Search in sources :

Example 6 with UpdatableDigester

use of org.icij.extract.extractor.UpdatableDigester in project datashare by ICIJ.

the class SourceExtractorTest method test_get_source_for_embedded_doc_without_metadata.

@Test
public void test_get_source_for_embedded_doc_without_metadata() throws Exception {
    DocumentFactory tikaFactory = new DocumentFactory().configure(Options.from(new HashMap<String, String>() {

        {
            put("idDigestMethod", Document.HASHER.toString());
        }
    }));
    Path path = get(getClass().getResource("/docs/embedded_doc.eml").getPath());
    Extractor extractor = new Extractor(tikaFactory);
    extractor.setDigester(new UpdatableDigester(TEST_INDEX, Document.HASHER.toString()));
    final TikaDocument document = extractor.extract(path);
    ElasticsearchSpewer spewer = new ElasticsearchSpewer(es.client, l -> Language.ENGLISH, new FieldNames(), Mockito.mock(Publisher.class), new PropertiesProvider()).withRefresh(IMMEDIATE).withIndex(TEST_INDEX);
    spewer.write(document);
    Document attachedPdf = new ElasticsearchIndexer(es.client, new PropertiesProvider()).get(TEST_INDEX, "1bf2b6aa27dd8b45c7db58875004b8cb27a78ced5200b4976b63e351ebbae5ececb86076d90e156a7cdea06cde9573ca", "f4078910c3e73a192e3a82d205f3c0bdb749c4e7b23c1d05a622db0f07d7f0ededb335abdb62aef41ace5d3cdb9298bc");
    InputStream source = new SourceExtractor(true).getSource(project(TEST_INDEX), attachedPdf);
    assertThat(source).isNotNull();
    assertThat(getBytes(source).length).isNotEqualTo(49779);
}
Also used : Path(java.nio.file.Path) HashMap(java.util.HashMap) InputStream(java.io.InputStream) TikaDocument(org.icij.extract.document.TikaDocument) Publisher(org.icij.datashare.com.Publisher) TikaDocument(org.icij.extract.document.TikaDocument) Document(org.icij.datashare.text.Document) PropertiesProvider(org.icij.datashare.PropertiesProvider) DocumentFactory(org.icij.extract.document.DocumentFactory) UpdatableDigester(org.icij.extract.extractor.UpdatableDigester) FieldNames(org.icij.spewer.FieldNames) Extractor(org.icij.extract.extractor.Extractor) Test(org.junit.Test)

Example 7 with UpdatableDigester

use of org.icij.extract.extractor.UpdatableDigester in project datashare by ICIJ.

the class SourceExtractor method getSource.

public InputStream getSource(final Project project, final Document document) throws FileNotFoundException {
    if (document.isRootDocument()) {
        if (filterMetadata) {
            try {
                return new ByteArrayInputStream(metadataCleaner.clean(new FileInputStream(document.getPath().toFile())).getContent());
            } catch (IOException e) {
                throw new ExtractException("content cleaner error ", e);
            }
        } else {
            return new FileInputStream(document.getPath().toFile());
        }
    } else {
        LOGGER.info("extracting embedded document " + Identifier.shorten(document.getId(), 4) + " from root document " + document.getPath());
        TikaDocumentSource source;
        EmbeddedDocumentMemoryExtractor embeddedExtractor;
        DigestIdentifier identifier;
        if (document.getId().length() == SHA_384.digestLength) {
            embeddedExtractor = new EmbeddedDocumentMemoryExtractor(new UpdatableDigester(project.getId(), SHA_384.toString()));
            identifier = new DigestIdentifier(SHA_384.toString(), Charset.defaultCharset());
        } else {
            // backward compatibility
            Hasher hasher = Hasher.valueOf(document.getId().length());
            embeddedExtractor = new EmbeddedDocumentMemoryExtractor(new CommonsDigester(20 * 1024 * 1024, hasher.toString().replace("-", "")), hasher.toString(), false);
            identifier = new DigestIdentifier(hasher.toString(), Charset.defaultCharset());
        }
        TikaDocument rootDocument = new DocumentFactory().withIdentifier(identifier).create(document.getPath());
        try {
            source = embeddedExtractor.extract(rootDocument, document.getId());
            return filterMetadata ? new ByteArrayInputStream(metadataCleaner.clean(new ByteArrayInputStream(source.content)).getContent()) : new ByteArrayInputStream(source.content);
        } catch (SAXException | TikaException | IOException e) {
            throw new ExtractException("extract error for embedded document " + document.getId(), e);
        }
    }
}
Also used : EmbeddedDocumentMemoryExtractor(org.icij.extract.extractor.EmbeddedDocumentMemoryExtractor) TikaException(org.apache.tika.exception.TikaException) SAXException(org.xml.sax.SAXException) Hasher(org.icij.datashare.text.Hasher) UpdatableDigester(org.icij.extract.extractor.UpdatableDigester) CommonsDigester(org.apache.tika.parser.utils.CommonsDigester)

Aggregations

UpdatableDigester (org.icij.extract.extractor.UpdatableDigester)7 DocumentFactory (org.icij.extract.document.DocumentFactory)6 Extractor (org.icij.extract.extractor.Extractor)6 TikaDocument (org.icij.extract.document.TikaDocument)5 HashMap (java.util.HashMap)4 Test (org.junit.Test)4 Path (java.nio.file.Path)3 PropertiesProvider (org.icij.datashare.PropertiesProvider)3 Publisher (org.icij.datashare.com.Publisher)3 Document (org.icij.datashare.text.Document)3 FieldNames (org.icij.spewer.FieldNames)3 InputStream (java.io.InputStream)2 DigestIdentifier (org.icij.extract.document.DigestIdentifier)2 TikaException (org.apache.tika.exception.TikaException)1 CommonsDigester (org.apache.tika.parser.utils.CommonsDigester)1 GetRequest (org.elasticsearch.action.get.GetRequest)1 GetResponse (org.elasticsearch.action.get.GetResponse)1 Duplicate (org.icij.datashare.text.Duplicate)1 Hasher (org.icij.datashare.text.Hasher)1 ElasticsearchSpewer (org.icij.datashare.text.indexing.elasticsearch.ElasticsearchSpewer)1