Search in sources :

Example 1 with Duplicate

use of org.icij.datashare.text.Duplicate in project datashare by ICIJ.

the class ElasticsearchSpewerTest method test_duplicate_file.

@Test
public void test_duplicate_file() throws Exception {
    DocumentFactory tikaFactory = new DocumentFactory().configure(Options.from(new HashMap<String, String>() {

        {
            put("idDigestMethod", Document.HASHER.toString());
        }
    }));
    Extractor extractor = new Extractor(tikaFactory);
    extractor.setDigester(new UpdatableDigester("project", Document.HASHER.toString()));
    final TikaDocument document = extractor.extract(get(Objects.requireNonNull(getClass().getResource("/docs/doc.txt")).getPath()));
    final TikaDocument document2 = extractor.extract(get(Objects.requireNonNull(getClass().getResource("/docs/doc-duplicate.txt")).getPath()));
    spewer.write(document);
    spewer.write(document2);
    GetResponse actualDocument = es.client.get(new GetRequest(TEST_INDEX, document.getId()), RequestOptions.DEFAULT);
    GetResponse actualDocument2 = es.client.get(new GetRequest(TEST_INDEX, new Duplicate(document2.getPath(), document.getId()).getId()), RequestOptions.DEFAULT);
    assertThat(actualDocument.isExists()).isTrue();
    assertThat(actualDocument.getSourceAsMap()).includes(entry("type", "Document"));
    assertThat(actualDocument2.isExists()).isTrue();
    assertThat(actualDocument2.getSourceAsMap()).includes(entry("type", "Duplicate"));
}
Also used : DocumentFactory(org.icij.extract.document.DocumentFactory) UpdatableDigester(org.icij.extract.extractor.UpdatableDigester) HashMap(java.util.HashMap) GetRequest(org.elasticsearch.action.get.GetRequest) TikaDocument(org.icij.extract.document.TikaDocument) Duplicate(org.icij.datashare.text.Duplicate) Extractor(org.icij.extract.extractor.Extractor) GetResponse(org.elasticsearch.action.get.GetResponse) Test(org.junit.Test)

Aggregations

HashMap (java.util.HashMap)1 GetRequest (org.elasticsearch.action.get.GetRequest)1 GetResponse (org.elasticsearch.action.get.GetResponse)1 Duplicate (org.icij.datashare.text.Duplicate)1 DocumentFactory (org.icij.extract.document.DocumentFactory)1 TikaDocument (org.icij.extract.document.TikaDocument)1 Extractor (org.icij.extract.extractor.Extractor)1 UpdatableDigester (org.icij.extract.extractor.UpdatableDigester)1 Test (org.junit.Test)1