Search in sources :

Example 1 with UpdatableDigester

use of org.icij.extract.extractor.UpdatableDigester in project datashare by ICIJ.

the class IndexerHelper method indexEmbeddedFile.

File indexEmbeddedFile(String project, String docPath) throws IOException {
    Path path = get(getClass().getResource(docPath).getPath());
    Extractor extractor = new Extractor(new DocumentFactory().withIdentifier(new DigestIdentifier("SHA-384", Charset.defaultCharset())));
    extractor.setDigester(new UpdatableDigester(project, Entity.HASHER.toString()));
    TikaDocument document = extractor.extract(path);
    ElasticsearchSpewer elasticsearchSpewer = new ElasticsearchSpewer(client, l -> ENGLISH, new FieldNames(), mock(Publisher.class), new PropertiesProvider()).withRefresh(IMMEDIATE).withIndex("test-datashare");
    elasticsearchSpewer.write(document);
    return path.toFile();
}
Also used : Path(java.nio.file.Path) PropertiesProvider(org.icij.datashare.PropertiesProvider) ElasticsearchSpewer(org.icij.datashare.text.indexing.elasticsearch.ElasticsearchSpewer) DocumentFactory(org.icij.extract.document.DocumentFactory) UpdatableDigester(org.icij.extract.extractor.UpdatableDigester) FieldNames(org.icij.spewer.FieldNames) DigestIdentifier(org.icij.extract.document.DigestIdentifier) TikaDocument(org.icij.extract.document.TikaDocument) Extractor(org.icij.extract.extractor.Extractor) Publisher(org.icij.datashare.com.Publisher)

Example 2 with UpdatableDigester

use of org.icij.extract.extractor.UpdatableDigester in project datashare by ICIJ.

the class DatashareExtractIntegrationTest method createExtractor.

Extractor createExtractor() {
    Extractor extractor = new Extractor(new DocumentFactory().withIdentifier(new DigestIdentifier("SHA-384", Charset.defaultCharset())));
    extractor.setDigester(new UpdatableDigester("test", Entity.HASHER.toString()));
    return extractor;
}
Also used : DocumentFactory(org.icij.extract.document.DocumentFactory) UpdatableDigester(org.icij.extract.extractor.UpdatableDigester) DigestIdentifier(org.icij.extract.document.DigestIdentifier) Extractor(org.icij.extract.extractor.Extractor)

Example 3 with UpdatableDigester

use of org.icij.extract.extractor.UpdatableDigester in project datashare by ICIJ.

the class ElasticsearchSpewerTest method test_duplicate_file.

@Test
public void test_duplicate_file() throws Exception {
    DocumentFactory tikaFactory = new DocumentFactory().configure(Options.from(new HashMap<String, String>() {

        {
            put("idDigestMethod", Document.HASHER.toString());
        }
    }));
    Extractor extractor = new Extractor(tikaFactory);
    extractor.setDigester(new UpdatableDigester("project", Document.HASHER.toString()));
    final TikaDocument document = extractor.extract(get(Objects.requireNonNull(getClass().getResource("/docs/doc.txt")).getPath()));
    final TikaDocument document2 = extractor.extract(get(Objects.requireNonNull(getClass().getResource("/docs/doc-duplicate.txt")).getPath()));
    spewer.write(document);
    spewer.write(document2);
    GetResponse actualDocument = es.client.get(new GetRequest(TEST_INDEX, document.getId()), RequestOptions.DEFAULT);
    GetResponse actualDocument2 = es.client.get(new GetRequest(TEST_INDEX, new Duplicate(document2.getPath(), document.getId()).getId()), RequestOptions.DEFAULT);
    assertThat(actualDocument.isExists()).isTrue();
    assertThat(actualDocument.getSourceAsMap()).includes(entry("type", "Document"));
    assertThat(actualDocument2.isExists()).isTrue();
    assertThat(actualDocument2.getSourceAsMap()).includes(entry("type", "Duplicate"));
}
Also used : DocumentFactory(org.icij.extract.document.DocumentFactory) UpdatableDigester(org.icij.extract.extractor.UpdatableDigester) HashMap(java.util.HashMap) GetRequest(org.elasticsearch.action.get.GetRequest) TikaDocument(org.icij.extract.document.TikaDocument) Duplicate(org.icij.datashare.text.Duplicate) Extractor(org.icij.extract.extractor.Extractor) GetResponse(org.elasticsearch.action.get.GetResponse) Test(org.junit.Test)

Example 4 with UpdatableDigester

use of org.icij.extract.extractor.UpdatableDigester in project datashare by ICIJ.

the class ElasticsearchSpewerTest method test_extract_id_should_be_equal_to_datashare_id.

@Test
public void test_extract_id_should_be_equal_to_datashare_id() throws IOException {
    DocumentFactory tikaFactory = new DocumentFactory().configure(Options.from(new HashMap<String, String>() {

        {
            put("idDigestMethod", Document.HASHER.toString());
        }
    }));
    Extractor extractor = new Extractor(tikaFactory);
    extractor.setDigester(new UpdatableDigester("project", Document.HASHER.toString()));
    final TikaDocument extractDocument = extractor.extract(get(Objects.requireNonNull(getClass().getResource("/docs/embedded_doc.eml")).getPath()));
    Document document = new Document(Project.project("project"), get(Objects.requireNonNull(getClass().getResource("/docs/embedded_doc.eml")).getPath()), "This is a document to be parsed by datashare.", Language.FRENCH, Charset.defaultCharset(), "text/plain", convert(extractDocument.getMetadata()), Document.Status.INDEXED, 45L);
    assertThat(document.getId()).isEqualTo(extractDocument.getId());
}
Also used : DocumentFactory(org.icij.extract.document.DocumentFactory) UpdatableDigester(org.icij.extract.extractor.UpdatableDigester) HashMap(java.util.HashMap) TikaDocument(org.icij.extract.document.TikaDocument) Extractor(org.icij.extract.extractor.Extractor) TikaDocument(org.icij.extract.document.TikaDocument) Document(org.icij.datashare.text.Document) Test(org.junit.Test)

Example 5 with UpdatableDigester

use of org.icij.extract.extractor.UpdatableDigester in project datashare by ICIJ.

the class SourceExtractorTest method test_get_source_for_embedded_doc.

@Test
public void test_get_source_for_embedded_doc() throws Exception {
    DocumentFactory tikaFactory = new DocumentFactory().configure(Options.from(new HashMap<String, String>() {

        {
            put("idDigestMethod", Document.HASHER.toString());
        }
    }));
    Path path = get(getClass().getResource("/docs/embedded_doc.eml").getPath());
    Extractor extractor = new Extractor(tikaFactory);
    extractor.setDigester(new UpdatableDigester(TEST_INDEX, Document.HASHER.toString()));
    final TikaDocument document = extractor.extract(path);
    ElasticsearchSpewer spewer = new ElasticsearchSpewer(es.client, l -> Language.ENGLISH, new FieldNames(), Mockito.mock(Publisher.class), new PropertiesProvider()).withRefresh(IMMEDIATE).withIndex(TEST_INDEX);
    spewer.write(document);
    Document attachedPdf = new ElasticsearchIndexer(es.client, new PropertiesProvider()).get(TEST_INDEX, "1bf2b6aa27dd8b45c7db58875004b8cb27a78ced5200b4976b63e351ebbae5ececb86076d90e156a7cdea06cde9573ca", "f4078910c3e73a192e3a82d205f3c0bdb749c4e7b23c1d05a622db0f07d7f0ededb335abdb62aef41ace5d3cdb9298bc");
    assertThat(attachedPdf).isNotNull();
    assertThat(attachedPdf.getContentType()).isEqualTo("application/pdf");
    InputStream source = new SourceExtractor().getSource(project(TEST_INDEX), attachedPdf);
    assertThat(source).isNotNull();
    assertThat(getBytes(source)).hasSize(49779);
}
Also used : Path(java.nio.file.Path) HashMap(java.util.HashMap) InputStream(java.io.InputStream) TikaDocument(org.icij.extract.document.TikaDocument) Publisher(org.icij.datashare.com.Publisher) TikaDocument(org.icij.extract.document.TikaDocument) Document(org.icij.datashare.text.Document) PropertiesProvider(org.icij.datashare.PropertiesProvider) DocumentFactory(org.icij.extract.document.DocumentFactory) UpdatableDigester(org.icij.extract.extractor.UpdatableDigester) FieldNames(org.icij.spewer.FieldNames) Extractor(org.icij.extract.extractor.Extractor) Test(org.junit.Test)

Aggregations

UpdatableDigester (org.icij.extract.extractor.UpdatableDigester)7 DocumentFactory (org.icij.extract.document.DocumentFactory)6 Extractor (org.icij.extract.extractor.Extractor)6 TikaDocument (org.icij.extract.document.TikaDocument)5 HashMap (java.util.HashMap)4 Test (org.junit.Test)4 Path (java.nio.file.Path)3 PropertiesProvider (org.icij.datashare.PropertiesProvider)3 Publisher (org.icij.datashare.com.Publisher)3 Document (org.icij.datashare.text.Document)3 FieldNames (org.icij.spewer.FieldNames)3 InputStream (java.io.InputStream)2 DigestIdentifier (org.icij.extract.document.DigestIdentifier)2 TikaException (org.apache.tika.exception.TikaException)1 CommonsDigester (org.apache.tika.parser.utils.CommonsDigester)1 GetRequest (org.elasticsearch.action.get.GetRequest)1 GetResponse (org.elasticsearch.action.get.GetResponse)1 Duplicate (org.icij.datashare.text.Duplicate)1 Hasher (org.icij.datashare.text.Hasher)1 ElasticsearchSpewer (org.icij.datashare.text.indexing.elasticsearch.ElasticsearchSpewer)1