Search in sources :

Example 1 with EmbeddedDocumentMemoryExtractor

use of org.icij.extract.extractor.EmbeddedDocumentMemoryExtractor in project datashare by ICIJ.

the class SourceExtractor method getSource.

public InputStream getSource(final Project project, final Document document) throws FileNotFoundException {
    if (document.isRootDocument()) {
        if (filterMetadata) {
            try {
                return new ByteArrayInputStream(metadataCleaner.clean(new FileInputStream(document.getPath().toFile())).getContent());
            } catch (IOException e) {
                throw new ExtractException("content cleaner error ", e);
            }
        } else {
            return new FileInputStream(document.getPath().toFile());
        }
    } else {
        LOGGER.info("extracting embedded document " + Identifier.shorten(document.getId(), 4) + " from root document " + document.getPath());
        TikaDocumentSource source;
        EmbeddedDocumentMemoryExtractor embeddedExtractor;
        DigestIdentifier identifier;
        if (document.getId().length() == SHA_384.digestLength) {
            embeddedExtractor = new EmbeddedDocumentMemoryExtractor(new UpdatableDigester(project.getId(), SHA_384.toString()));
            identifier = new DigestIdentifier(SHA_384.toString(), Charset.defaultCharset());
        } else {
            // backward compatibility
            Hasher hasher = Hasher.valueOf(document.getId().length());
            embeddedExtractor = new EmbeddedDocumentMemoryExtractor(new CommonsDigester(20 * 1024 * 1024, hasher.toString().replace("-", "")), hasher.toString(), false);
            identifier = new DigestIdentifier(hasher.toString(), Charset.defaultCharset());
        }
        TikaDocument rootDocument = new DocumentFactory().withIdentifier(identifier).create(document.getPath());
        try {
            source = embeddedExtractor.extract(rootDocument, document.getId());
            return filterMetadata ? new ByteArrayInputStream(metadataCleaner.clean(new ByteArrayInputStream(source.content)).getContent()) : new ByteArrayInputStream(source.content);
        } catch (SAXException | TikaException | IOException e) {
            throw new ExtractException("extract error for embedded document " + document.getId(), e);
        }
    }
}
Also used : EmbeddedDocumentMemoryExtractor(org.icij.extract.extractor.EmbeddedDocumentMemoryExtractor) TikaException(org.apache.tika.exception.TikaException) SAXException(org.xml.sax.SAXException) Hasher(org.icij.datashare.text.Hasher) UpdatableDigester(org.icij.extract.extractor.UpdatableDigester) CommonsDigester(org.apache.tika.parser.utils.CommonsDigester)

Aggregations

TikaException (org.apache.tika.exception.TikaException)1 CommonsDigester (org.apache.tika.parser.utils.CommonsDigester)1 Hasher (org.icij.datashare.text.Hasher)1 EmbeddedDocumentMemoryExtractor (org.icij.extract.extractor.EmbeddedDocumentMemoryExtractor)1 UpdatableDigester (org.icij.extract.extractor.UpdatableDigester)1 SAXException (org.xml.sax.SAXException)1