Search in sources :

Example 6 with Extractor

use of org.icij.extract.extractor.Extractor in project datashare by ICIJ.

the class ElasticsearchSpewerTest method test_extract_id_should_be_equal_to_datashare_id.

@Test
public void test_extract_id_should_be_equal_to_datashare_id() throws IOException {
    DocumentFactory tikaFactory = new DocumentFactory().configure(Options.from(new HashMap<String, String>() {

        {
            put("idDigestMethod", Document.HASHER.toString());
        }
    }));
    Extractor extractor = new Extractor(tikaFactory);
    extractor.setDigester(new UpdatableDigester("project", Document.HASHER.toString()));
    final TikaDocument extractDocument = extractor.extract(get(Objects.requireNonNull(getClass().getResource("/docs/embedded_doc.eml")).getPath()));
    Document document = new Document(Project.project("project"), get(Objects.requireNonNull(getClass().getResource("/docs/embedded_doc.eml")).getPath()), "This is a document to be parsed by datashare.", Language.FRENCH, Charset.defaultCharset(), "text/plain", convert(extractDocument.getMetadata()), Document.Status.INDEXED, 45L);
    assertThat(document.getId()).isEqualTo(extractDocument.getId());
}
Also used : DocumentFactory(org.icij.extract.document.DocumentFactory) UpdatableDigester(org.icij.extract.extractor.UpdatableDigester) HashMap(java.util.HashMap) TikaDocument(org.icij.extract.document.TikaDocument) Extractor(org.icij.extract.extractor.Extractor) TikaDocument(org.icij.extract.document.TikaDocument) Document(org.icij.datashare.text.Document) Test(org.junit.Test)

Example 7 with Extractor

use of org.icij.extract.extractor.Extractor in project datashare by ICIJ.

the class SourceExtractorTest method test_get_source_for_embedded_doc.

@Test
public void test_get_source_for_embedded_doc() throws Exception {
    DocumentFactory tikaFactory = new DocumentFactory().configure(Options.from(new HashMap<String, String>() {

        {
            put("idDigestMethod", Document.HASHER.toString());
        }
    }));
    Path path = get(getClass().getResource("/docs/embedded_doc.eml").getPath());
    Extractor extractor = new Extractor(tikaFactory);
    extractor.setDigester(new UpdatableDigester(TEST_INDEX, Document.HASHER.toString()));
    final TikaDocument document = extractor.extract(path);
    ElasticsearchSpewer spewer = new ElasticsearchSpewer(es.client, l -> Language.ENGLISH, new FieldNames(), Mockito.mock(Publisher.class), new PropertiesProvider()).withRefresh(IMMEDIATE).withIndex(TEST_INDEX);
    spewer.write(document);
    Document attachedPdf = new ElasticsearchIndexer(es.client, new PropertiesProvider()).get(TEST_INDEX, "1bf2b6aa27dd8b45c7db58875004b8cb27a78ced5200b4976b63e351ebbae5ececb86076d90e156a7cdea06cde9573ca", "f4078910c3e73a192e3a82d205f3c0bdb749c4e7b23c1d05a622db0f07d7f0ededb335abdb62aef41ace5d3cdb9298bc");
    assertThat(attachedPdf).isNotNull();
    assertThat(attachedPdf.getContentType()).isEqualTo("application/pdf");
    InputStream source = new SourceExtractor().getSource(project(TEST_INDEX), attachedPdf);
    assertThat(source).isNotNull();
    assertThat(getBytes(source)).hasSize(49779);
}
Also used : Path(java.nio.file.Path) HashMap(java.util.HashMap) InputStream(java.io.InputStream) TikaDocument(org.icij.extract.document.TikaDocument) Publisher(org.icij.datashare.com.Publisher) TikaDocument(org.icij.extract.document.TikaDocument) Document(org.icij.datashare.text.Document) PropertiesProvider(org.icij.datashare.PropertiesProvider) DocumentFactory(org.icij.extract.document.DocumentFactory) UpdatableDigester(org.icij.extract.extractor.UpdatableDigester) FieldNames(org.icij.spewer.FieldNames) Extractor(org.icij.extract.extractor.Extractor) Test(org.junit.Test)

Example 8 with Extractor

use of org.icij.extract.extractor.Extractor in project datashare by ICIJ.

the class SourceExtractorTest method test_get_source_for_embedded_doc_without_metadata.

@Test
public void test_get_source_for_embedded_doc_without_metadata() throws Exception {
    DocumentFactory tikaFactory = new DocumentFactory().configure(Options.from(new HashMap<String, String>() {

        {
            put("idDigestMethod", Document.HASHER.toString());
        }
    }));
    Path path = get(getClass().getResource("/docs/embedded_doc.eml").getPath());
    Extractor extractor = new Extractor(tikaFactory);
    extractor.setDigester(new UpdatableDigester(TEST_INDEX, Document.HASHER.toString()));
    final TikaDocument document = extractor.extract(path);
    ElasticsearchSpewer spewer = new ElasticsearchSpewer(es.client, l -> Language.ENGLISH, new FieldNames(), Mockito.mock(Publisher.class), new PropertiesProvider()).withRefresh(IMMEDIATE).withIndex(TEST_INDEX);
    spewer.write(document);
    Document attachedPdf = new ElasticsearchIndexer(es.client, new PropertiesProvider()).get(TEST_INDEX, "1bf2b6aa27dd8b45c7db58875004b8cb27a78ced5200b4976b63e351ebbae5ececb86076d90e156a7cdea06cde9573ca", "f4078910c3e73a192e3a82d205f3c0bdb749c4e7b23c1d05a622db0f07d7f0ededb335abdb62aef41ace5d3cdb9298bc");
    InputStream source = new SourceExtractor(true).getSource(project(TEST_INDEX), attachedPdf);
    assertThat(source).isNotNull();
    assertThat(getBytes(source).length).isNotEqualTo(49779);
}
Also used : Path(java.nio.file.Path) HashMap(java.util.HashMap) InputStream(java.io.InputStream) TikaDocument(org.icij.extract.document.TikaDocument) Publisher(org.icij.datashare.com.Publisher) TikaDocument(org.icij.extract.document.TikaDocument) Document(org.icij.datashare.text.Document) PropertiesProvider(org.icij.datashare.PropertiesProvider) DocumentFactory(org.icij.extract.document.DocumentFactory) UpdatableDigester(org.icij.extract.extractor.UpdatableDigester) FieldNames(org.icij.spewer.FieldNames) Extractor(org.icij.extract.extractor.Extractor) Test(org.junit.Test)

Example 9 with Extractor

use of org.icij.extract.extractor.Extractor in project datashare by ICIJ.

the class ElasticsearchSpewerTest method test_embedded_document.

@Test
public void test_embedded_document() throws Exception {
    Path path = get(Objects.requireNonNull(getClass().getResource("/docs/embedded_doc.eml")).getPath());
    final TikaDocument document = new Extractor().extract(path);
    spewer.write(document);
    GetResponse documentFields = es.client.get(new GetRequest(TEST_INDEX, document.getId()), RequestOptions.DEFAULT);
    assertTrue(documentFields.isExists());
    SearchRequest searchRequest = new SearchRequest();
    SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
    searchSourceBuilder.query(QueryBuilders.multiMatchQuery("simple.tiff", "content"));
    searchRequest.source(searchSourceBuilder);
    SearchResponse response = es.client.search(searchRequest, RequestOptions.DEFAULT);
    assertThat(response.getHits().getTotalHits().value).isGreaterThan(0);
    // assertThat(response.getHits().getAt(0).getId()).endsWith("embedded.pdf");
    verify(publisher, times(2)).publish(eq(Channel.NLP), any(Message.class));
}
Also used : Path(java.nio.file.Path) SearchRequest(org.elasticsearch.action.search.SearchRequest) Message(org.icij.datashare.com.Message) GetRequest(org.elasticsearch.action.get.GetRequest) TikaDocument(org.icij.extract.document.TikaDocument) Extractor(org.icij.extract.extractor.Extractor) GetResponse(org.elasticsearch.action.get.GetResponse) SearchSourceBuilder(org.elasticsearch.search.builder.SearchSourceBuilder) SearchResponse(org.elasticsearch.action.search.SearchResponse) Test(org.junit.Test)

Aggregations

Extractor (org.icij.extract.extractor.Extractor)9 TikaDocument (org.icij.extract.document.TikaDocument)8 Test (org.junit.Test)7 DocumentFactory (org.icij.extract.document.DocumentFactory)6 UpdatableDigester (org.icij.extract.extractor.UpdatableDigester)6 Path (java.nio.file.Path)5 HashMap (java.util.HashMap)4 Document (org.icij.datashare.text.Document)4 GetRequest (org.elasticsearch.action.get.GetRequest)3 GetResponse (org.elasticsearch.action.get.GetResponse)3 PropertiesProvider (org.icij.datashare.PropertiesProvider)3 Publisher (org.icij.datashare.com.Publisher)3 FieldNames (org.icij.spewer.FieldNames)3 InputStream (java.io.InputStream)2 DigestIdentifier (org.icij.extract.document.DigestIdentifier)2 File (java.io.File)1 ArrayList (java.util.ArrayList)1 SearchRequest (org.elasticsearch.action.search.SearchRequest)1 SearchResponse (org.elasticsearch.action.search.SearchResponse)1 SearchSourceBuilder (org.elasticsearch.search.builder.SearchSourceBuilder)1