Examples with TikaDocument - org.icij.extract.document.TikaDocument

Example 1 with TikaDocument

use of org.icij.extract.document.TikaDocument in project datashare by ICIJ.

the class IndexerHelper method indexEmbeddedFile.

File indexEmbeddedFile(String project, String docPath) throws IOException {
    Path path = get(getClass().getResource(docPath).getPath());
    Extractor extractor = new Extractor(new DocumentFactory().withIdentifier(new DigestIdentifier("SHA-384", Charset.defaultCharset())));
    extractor.setDigester(new UpdatableDigester(project, Entity.HASHER.toString()));
    TikaDocument document = extractor.extract(path);
    ElasticsearchSpewer elasticsearchSpewer = new ElasticsearchSpewer(client, l -> ENGLISH, new FieldNames(), mock(Publisher.class), new PropertiesProvider()).withRefresh(IMMEDIATE).withIndex("test-datashare");
    elasticsearchSpewer.write(document);
    return path.toFile();
}

Also used : Path(java.nio.file.Path) PropertiesProvider(org.icij.datashare.PropertiesProvider) ElasticsearchSpewer(org.icij.datashare.text.indexing.elasticsearch.ElasticsearchSpewer) DocumentFactory(org.icij.extract.document.DocumentFactory) UpdatableDigester(org.icij.extract.extractor.UpdatableDigester) FieldNames(org.icij.spewer.FieldNames) DigestIdentifier(org.icij.extract.document.DigestIdentifier) TikaDocument(org.icij.extract.document.TikaDocument) Extractor(org.icij.extract.extractor.Extractor) Publisher(org.icij.datashare.com.Publisher)

Example 2 with TikaDocument

use of org.icij.extract.document.TikaDocument in project datashare by ICIJ.

the class DatashareExtractIntegrationTest method test_spew_and_read_index.

@Test
public void test_spew_and_read_index() throws Exception {
    Path path = get(getClass().getResource("/docs/doc.txt").getPath());
    TikaDocument tikaDocument = createExtractor().extract(path);
    spewer.write(tikaDocument);
    Document doc = indexer.get(TEST_INDEX, tikaDocument.getId());
    assertThat(doc.getId()).isEqualTo(tikaDocument.getId());
    assertThat(doc.getContent()).isEqualTo("This is a document to be parsed by datashare.");
    assertThat(doc.getLanguage()).isEqualTo(ENGLISH);
    assertThat(doc.getContentLength()).isEqualTo(45);
    assertThat(doc.getDirname()).contains(get("docs"));
    assertThat(doc.getPath()).contains(get("doc.txt"));
    assertThat(doc.getContentEncoding()).isEqualTo(Charset.forName("iso-8859-1"));
    assertThat(doc.getContentType()).isEqualTo("text/plain");
    assertThat(doc.getExtractionLevel()).isEqualTo((short) 0);
    assertThat(doc.getMetadata()).hasSize(6);
    assertThat(doc.getParentDocument()).isNull();
    assertThat(doc.getRootDocument()).isEqualTo(doc.getId());
    assertThat(doc.getCreationDate()).isNull();
}

Also used : Path(java.nio.file.Path) TikaDocument(org.icij.extract.document.TikaDocument) TikaDocument(org.icij.extract.document.TikaDocument) Document(org.icij.datashare.text.Document) Test(org.junit.Test)

Example 3 with TikaDocument

use of org.icij.extract.document.TikaDocument in project datashare by ICIJ.

the class DatashareExtractIntegrationTest method test_spew_and_read_embedded_doc.

@Test
public void test_spew_and_read_embedded_doc() throws Exception {
    Path path = get(getClass().getResource("/docs/embedded_doc.eml").getPath());
    TikaDocument tikaDocument = createExtractor().extract(path);
    spewer.write(tikaDocument);
    Document doc = indexer.get(TEST_INDEX, tikaDocument.getEmbeds().get(0).getId(), tikaDocument.getId());
    assertThat(doc).isNotNull();
    assertThat(doc.getId()).isNotEqualTo(doc.getRootDocument());
    assertThat(doc.getRootDocument()).isEqualTo(tikaDocument.getId());
    assertThat(doc.getCreationDate()).isNotNull();
    assertThat(new SimpleDateFormat("HH:mm:ss").format(doc.getCreationDate())).isEqualTo("23:22:36");
}

Example 4 with TikaDocument

use of org.icij.extract.document.TikaDocument in project datashare by ICIJ.

the class DatabaseSpewerTest method test_spew_document_iso8859_encoded_is_stored_in_utf8_and_have_correct_parameters.

@Test
public void test_spew_document_iso8859_encoded_is_stored_in_utf8_and_have_correct_parameters() throws Exception {
    File file = tmp.newFile("test_iso8859-1.txt");
    Files.write(file.toPath(), singletonList("chaîne en iso8859"), forName("ISO-8859-1"));
    TikaDocument tikaDocument = new Extractor().extract(file.toPath());
    dbSpewer.write(tikaDocument);
    Document actual = dbSpewer.repository.getDocument(tikaDocument.getId());
    assertThat(actual.getContent()).isEqualTo("chaîne en iso8859");
    assertThat(actual.getContentEncoding()).isEqualTo(forName("iso8859-1"));
    assertThat(actual.getContentLength()).isEqualTo(18);
    assertThat(actual.getContentType()).isEqualTo("text/plain");
}

Also used : TikaDocument(org.icij.extract.document.TikaDocument) Extractor(org.icij.extract.extractor.Extractor) TikaDocument(org.icij.extract.document.TikaDocument) Document(org.icij.datashare.text.Document) File(java.io.File) Test(org.junit.Test)

Example 5 with TikaDocument

use of org.icij.extract.document.TikaDocument in project datashare by ICIJ.

the class ElasticsearchSpewerTest method test_duplicate_file.

@Test
public void test_duplicate_file() throws Exception {
    DocumentFactory tikaFactory = new DocumentFactory().configure(Options.from(new HashMap<String, String>() {

        {
            put("idDigestMethod", Document.HASHER.toString());
        }
    }));
    Extractor extractor = new Extractor(tikaFactory);
    extractor.setDigester(new UpdatableDigester("project", Document.HASHER.toString()));
    final TikaDocument document = extractor.extract(get(Objects.requireNonNull(getClass().getResource("/docs/doc.txt")).getPath()));
    final TikaDocument document2 = extractor.extract(get(Objects.requireNonNull(getClass().getResource("/docs/doc-duplicate.txt")).getPath()));
    spewer.write(document);
    spewer.write(document2);
    GetResponse actualDocument = es.client.get(new GetRequest(TEST_INDEX, document.getId()), RequestOptions.DEFAULT);
    GetResponse actualDocument2 = es.client.get(new GetRequest(TEST_INDEX, new Duplicate(document2.getPath(), document.getId()).getId()), RequestOptions.DEFAULT);
    assertThat(actualDocument.isExists()).isTrue();
    assertThat(actualDocument.getSourceAsMap()).includes(entry("type", "Document"));
    assertThat(actualDocument2.isExists()).isTrue();
    assertThat(actualDocument2.getSourceAsMap()).includes(entry("type", "Duplicate"));
}

Also used : DocumentFactory(org.icij.extract.document.DocumentFactory) UpdatableDigester(org.icij.extract.extractor.UpdatableDigester) HashMap(java.util.HashMap) GetRequest(org.elasticsearch.action.get.GetRequest) TikaDocument(org.icij.extract.document.TikaDocument) Duplicate(org.icij.datashare.text.Duplicate) Extractor(org.icij.extract.extractor.Extractor) GetResponse(org.elasticsearch.action.get.GetResponse) Test(org.junit.Test)

Aggregations

TikaDocument (org.icij.extract.document.TikaDocument)15 Test (org.junit.Test)13 DocumentFactory (org.icij.extract.document.DocumentFactory)9 Extractor (org.icij.extract.extractor.Extractor)8 Path (java.nio.file.Path)7 GetRequest (org.elasticsearch.action.get.GetRequest)7 GetResponse (org.elasticsearch.action.get.GetResponse)7 Document (org.icij.datashare.text.Document)7 HashMap (java.util.HashMap)6 PropertiesProvider (org.icij.datashare.PropertiesProvider)5 UpdatableDigester (org.icij.extract.extractor.UpdatableDigester)5 FieldNames (org.icij.spewer.FieldNames)5 ByteArrayInputStream (java.io.ByteArrayInputStream)4 ParsingReader (org.apache.tika.parser.ParsingReader)4 PathIdentifier (org.icij.extract.document.PathIdentifier)4 Publisher (org.icij.datashare.com.Publisher)3 InputStream (java.io.InputStream)2 Message (org.icij.datashare.com.Message)2 File (java.io.File)1 Charset (java.nio.charset.Charset)1