Search in sources :

Example 11 with EmbeddedDocumentExtractor

use of org.apache.tika.extractor.EmbeddedDocumentExtractor in project tika by apache.

the class OpenDocumentParser method handleZipEntry.

private void handleZipEntry(ZipEntry entry, InputStream zip, Metadata metadata, ParseContext context, EndDocumentShieldingContentHandler handler) throws IOException, SAXException, TikaException {
    if (entry == null)
        return;
    if (entry.getName().equals("mimetype")) {
        String type = IOUtils.toString(zip, UTF_8);
        metadata.set(Metadata.CONTENT_TYPE, type);
    } else if (entry.getName().equals(META_NAME)) {
        meta.parse(zip, new DefaultHandler(), metadata, context);
    } else if (entry.getName().endsWith("content.xml")) {
        if (content instanceof OpenDocumentContentParser) {
            ((OpenDocumentContentParser) content).parseInternal(zip, handler, metadata, context);
        } else {
            // Foreign content parser was set:
            content.parse(zip, handler, metadata, context);
        }
    } else if (entry.getName().endsWith("styles.xml")) {
        if (content instanceof OpenDocumentContentParser) {
            ((OpenDocumentContentParser) content).parseInternal(zip, handler, metadata, context);
        } else {
            // Foreign content parser was set:
            content.parse(zip, handler, metadata, context);
        }
    } else {
        String embeddedName = entry.getName();
        //scrape everything under Thumbnails/ and Pictures/
        if (embeddedName.contains("Thumbnails/") || embeddedName.contains("Pictures/")) {
            EmbeddedDocumentExtractor embeddedDocumentExtractor = EmbeddedDocumentUtil.getEmbeddedDocumentExtractor(context);
            Metadata embeddedMetadata = new Metadata();
            embeddedMetadata.set(TikaCoreProperties.ORIGINAL_RESOURCE_NAME, entry.getName());
            /* if (embeddedName.startsWith("Thumbnails/")) {
                    embeddedMetadata.set(TikaCoreProperties.EMBEDDED_RESOURCE_TYPE,
                            TikaCoreProperties.EmbeddedResourceType.THUMBNAIL);
                }*/
            if (embeddedName.contains("Pictures/")) {
                embeddedMetadata.set(TikaMetadataKeys.EMBEDDED_RESOURCE_TYPE, TikaCoreProperties.EmbeddedResourceType.INLINE.toString());
            }
            if (embeddedDocumentExtractor.shouldParseEmbedded(embeddedMetadata)) {
                embeddedDocumentExtractor.parseEmbedded(zip, new EmbeddedContentHandler(handler), embeddedMetadata, false);
            }
        }
    }
}
Also used : EmbeddedDocumentExtractor(org.apache.tika.extractor.EmbeddedDocumentExtractor) Metadata(org.apache.tika.metadata.Metadata) EmbeddedContentHandler(org.apache.tika.sax.EmbeddedContentHandler) DefaultHandler(org.xml.sax.helpers.DefaultHandler)

Example 12 with EmbeddedDocumentExtractor

use of org.apache.tika.extractor.EmbeddedDocumentExtractor in project tika by apache.

the class BinaryDataHandler method endPart.

@Override
public void endPart() throws SAXException, TikaException {
    if (hasData()) {
        EmbeddedDocumentExtractor embeddedDocumentExtractor = EmbeddedDocumentUtil.getEmbeddedDocumentExtractor(parseContext);
        Metadata embeddedMetadata = new Metadata();
        try (TikaInputStream stream = TikaInputStream.get(getInputStream())) {
            embeddedDocumentExtractor.parseEmbedded(stream, handler, embeddedMetadata, false);
        } catch (IOException e) {
            throw new TikaException("error in finishing part", e);
        }
        buffer.setLength(0);
    }
}
Also used : TikaException(org.apache.tika.exception.TikaException) EmbeddedDocumentExtractor(org.apache.tika.extractor.EmbeddedDocumentExtractor) Metadata(org.apache.tika.metadata.Metadata) TikaInputStream(org.apache.tika.io.TikaInputStream) IOException(java.io.IOException)

Example 13 with EmbeddedDocumentExtractor

use of org.apache.tika.extractor.EmbeddedDocumentExtractor in project tika by apache.

the class CompressorParser method parse.

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    // should not be closed
    if (stream.markSupported()) {
        stream = new CloseShieldInputStream(stream);
    } else {
        // Ensure that the stream supports the mark feature
        stream = new BufferedInputStream(new CloseShieldInputStream(stream));
    }
    CompressorInputStream cis;
    try {
        CompressorParserOptions options = context.get(CompressorParserOptions.class, new CompressorParserOptions() {

            public boolean decompressConcatenated(Metadata metadata) {
                return false;
            }
        });
        CompressorStreamFactory factory = new CompressorStreamFactory(options.decompressConcatenated(metadata), memoryLimitInKb);
        cis = factory.createCompressorInputStream(stream);
    } catch (CompressorException e) {
        if (e.getCause() != null && e.getCause() instanceof MemoryLimitException) {
            throw new TikaMemoryLimitException(e.getMessage());
        }
        throw new TikaException("Unable to uncompress document stream", e);
    }
    MediaType type = getMediaType(cis);
    if (!type.equals(MediaType.OCTET_STREAM)) {
        metadata.set(CONTENT_TYPE, type.toString());
    }
    XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
    xhtml.startDocument();
    try {
        Metadata entrydata = new Metadata();
        String name = metadata.get(Metadata.RESOURCE_NAME_KEY);
        if (name != null) {
            if (name.endsWith(".tbz")) {
                name = name.substring(0, name.length() - 4) + ".tar";
            } else if (name.endsWith(".tbz2")) {
                name = name.substring(0, name.length() - 5) + ".tar";
            } else if (name.endsWith(".bz")) {
                name = name.substring(0, name.length() - 3);
            } else if (name.endsWith(".bz2")) {
                name = name.substring(0, name.length() - 4);
            } else if (name.endsWith(".xz")) {
                name = name.substring(0, name.length() - 3);
            } else if (name.endsWith(".zlib")) {
                name = name.substring(0, name.length() - 5);
            } else if (name.endsWith(".pack")) {
                name = name.substring(0, name.length() - 5);
            } else if (name.length() > 0) {
                name = GzipUtils.getUncompressedFilename(name);
            }
            entrydata.set(Metadata.RESOURCE_NAME_KEY, name);
        }
        // Use the delegate parser to parse the compressed document
        EmbeddedDocumentExtractor extractor = EmbeddedDocumentUtil.getEmbeddedDocumentExtractor(context);
        if (extractor.shouldParseEmbedded(entrydata)) {
            extractor.parseEmbedded(cis, xhtml, entrydata, true);
        }
    } finally {
        cis.close();
    }
    xhtml.endDocument();
}
Also used : TikaException(org.apache.tika.exception.TikaException) EmbeddedDocumentExtractor(org.apache.tika.extractor.EmbeddedDocumentExtractor) Metadata(org.apache.tika.metadata.Metadata) CompressorStreamFactory(org.apache.commons.compress.compressors.CompressorStreamFactory) CompressorInputStream(org.apache.commons.compress.compressors.CompressorInputStream) SnappyCompressorInputStream(org.apache.commons.compress.compressors.snappy.SnappyCompressorInputStream) XZCompressorInputStream(org.apache.commons.compress.compressors.xz.XZCompressorInputStream) BZip2CompressorInputStream(org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream) GzipCompressorInputStream(org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream) DeflateCompressorInputStream(org.apache.commons.compress.compressors.deflate.DeflateCompressorInputStream) LZMACompressorInputStream(org.apache.commons.compress.compressors.lzma.LZMACompressorInputStream) FramedSnappyCompressorInputStream(org.apache.commons.compress.compressors.snappy.FramedSnappyCompressorInputStream) ZCompressorInputStream(org.apache.commons.compress.compressors.z.ZCompressorInputStream) Pack200CompressorInputStream(org.apache.commons.compress.compressors.pack200.Pack200CompressorInputStream) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) MemoryLimitException(org.apache.commons.compress.MemoryLimitException) TikaMemoryLimitException(org.apache.tika.exception.TikaMemoryLimitException) BufferedInputStream(java.io.BufferedInputStream) CompressorException(org.apache.commons.compress.compressors.CompressorException) TikaMemoryLimitException(org.apache.tika.exception.TikaMemoryLimitException) MediaType(org.apache.tika.mime.MediaType) CloseShieldInputStream(org.apache.commons.io.input.CloseShieldInputStream)

Example 14 with EmbeddedDocumentExtractor

use of org.apache.tika.extractor.EmbeddedDocumentExtractor in project tika by apache.

the class PackageParser method parse.

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    //lazily load the MediaTypeRegistry at parse time
    //only want to call getDefaultConfig() once, and can't
    //load statically because of the ForkParser
    TikaConfig config = context.get(TikaConfig.class);
    MediaTypeRegistry mediaTypeRegistry = null;
    if (config != null) {
        mediaTypeRegistry = config.getMediaTypeRegistry();
    } else {
        if (bufferedMediaTypeRegistry == null) {
            //buffer this for next time.
            synchronized (lock) {
                //now that we're locked, check again
                if (bufferedMediaTypeRegistry == null) {
                    bufferedMediaTypeRegistry = TikaConfig.getDefaultConfig().getMediaTypeRegistry();
                }
            }
        }
        mediaTypeRegistry = bufferedMediaTypeRegistry;
    }
    // Ensure that the stream supports the mark feature
    if (!stream.markSupported()) {
        stream = new BufferedInputStream(stream);
    }
    TemporaryResources tmp = new TemporaryResources();
    ArchiveInputStream ais = null;
    try {
        ArchiveStreamFactory factory = context.get(ArchiveStreamFactory.class, new ArchiveStreamFactory());
        // At the end we want to close the archive stream to release
        // any associated resources, but the underlying document stream
        // should not be closed
        ais = factory.createArchiveInputStream(new CloseShieldInputStream(stream));
    } catch (StreamingNotSupportedException sne) {
        // Most archive formats work on streams, but a few need files
        if (sne.getFormat().equals(ArchiveStreamFactory.SEVEN_Z)) {
            // Rework as a file, and wrap
            stream.reset();
            TikaInputStream tstream = TikaInputStream.get(stream, tmp);
            // Seven Zip suports passwords, was one given?
            String password = null;
            PasswordProvider provider = context.get(PasswordProvider.class);
            if (provider != null) {
                password = provider.getPassword(metadata);
            }
            SevenZFile sevenz;
            if (password == null) {
                sevenz = new SevenZFile(tstream.getFile());
            } else {
                sevenz = new SevenZFile(tstream.getFile(), password.getBytes("UnicodeLittleUnmarked"));
            }
            // Pending a fix for COMPRESS-269 / TIKA-1525, this bit is a little nasty
            ais = new SevenZWrapper(sevenz);
        } else {
            tmp.close();
            throw new TikaException("Unknown non-streaming format " + sne.getFormat(), sne);
        }
    } catch (ArchiveException e) {
        tmp.close();
        throw new TikaException("Unable to unpack document stream", e);
    }
    updateMediaType(ais, mediaTypeRegistry, metadata);
    // Use the delegate parser to parse the contained document
    EmbeddedDocumentExtractor extractor = EmbeddedDocumentUtil.getEmbeddedDocumentExtractor(context);
    XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
    xhtml.startDocument();
    try {
        ArchiveEntry entry = ais.getNextEntry();
        while (entry != null) {
            if (!entry.isDirectory()) {
                parseEntry(ais, entry, extractor, metadata, xhtml);
            }
            entry = ais.getNextEntry();
        }
    } catch (UnsupportedZipFeatureException zfe) {
        // If it's an encrypted document of unknown password, report as such
        if (zfe.getFeature() == Feature.ENCRYPTION) {
            throw new EncryptedDocumentException(zfe);
        }
        // Otherwise throw the exception
        throw new TikaException("UnsupportedZipFeature", zfe);
    } catch (PasswordRequiredException pre) {
        throw new EncryptedDocumentException(pre);
    } finally {
        ais.close();
        tmp.close();
    }
    xhtml.endDocument();
}
Also used : StreamingNotSupportedException(org.apache.commons.compress.archivers.StreamingNotSupportedException) TikaException(org.apache.tika.exception.TikaException) EncryptedDocumentException(org.apache.tika.exception.EncryptedDocumentException) TikaConfig(org.apache.tika.config.TikaConfig) EmbeddedDocumentExtractor(org.apache.tika.extractor.EmbeddedDocumentExtractor) TemporaryResources(org.apache.tika.io.TemporaryResources) TikaInputStream(org.apache.tika.io.TikaInputStream) MediaTypeRegistry(org.apache.tika.mime.MediaTypeRegistry) ZipArchiveEntry(org.apache.commons.compress.archivers.zip.ZipArchiveEntry) ArchiveEntry(org.apache.commons.compress.archivers.ArchiveEntry) PasswordRequiredException(org.apache.commons.compress.PasswordRequiredException) ArchiveException(org.apache.commons.compress.archivers.ArchiveException) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) PasswordProvider(org.apache.tika.parser.PasswordProvider) UnsupportedZipFeatureException(org.apache.commons.compress.archivers.zip.UnsupportedZipFeatureException) ArchiveStreamFactory(org.apache.commons.compress.archivers.ArchiveStreamFactory) ArArchiveInputStream(org.apache.commons.compress.archivers.ar.ArArchiveInputStream) TarArchiveInputStream(org.apache.commons.compress.archivers.tar.TarArchiveInputStream) JarArchiveInputStream(org.apache.commons.compress.archivers.jar.JarArchiveInputStream) ArchiveInputStream(org.apache.commons.compress.archivers.ArchiveInputStream) CpioArchiveInputStream(org.apache.commons.compress.archivers.cpio.CpioArchiveInputStream) ZipArchiveInputStream(org.apache.commons.compress.archivers.zip.ZipArchiveInputStream) DumpArchiveInputStream(org.apache.commons.compress.archivers.dump.DumpArchiveInputStream) SevenZFile(org.apache.commons.compress.archivers.sevenz.SevenZFile) BufferedInputStream(java.io.BufferedInputStream) CloseShieldInputStream(org.apache.commons.io.input.CloseShieldInputStream)

Example 15 with EmbeddedDocumentExtractor

use of org.apache.tika.extractor.EmbeddedDocumentExtractor in project tika by apache.

the class AppleSingleFileParser method parse.

@Override
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    EmbeddedDocumentExtractor ex = EmbeddedDocumentUtil.getEmbeddedDocumentExtractor(context);
    short numEntries = readThroughNumEntries(stream);
    long bytesRead = 26;
    List<FieldInfo> fieldInfoList = getSortedFieldInfoList(stream, numEntries);
    bytesRead += 12 * numEntries;
    Metadata embeddedMetadata = new Metadata();
    bytesRead = processFieldEntries(stream, fieldInfoList, embeddedMetadata, bytesRead);
    FieldInfo contentFieldInfo = getContentFieldInfo(fieldInfoList);
    XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
    xhtml.startDocument();
    if (contentFieldInfo != null) {
        long diff = contentFieldInfo.offset - bytesRead;
        IOUtils.skipFully(stream, diff);
        if (ex.shouldParseEmbedded(embeddedMetadata)) {
            // TODO: we should probably add a readlimiting wrapper around this
            // stream to ensure that not more than contentFieldInfo.length bytes
            // are read
            ex.parseEmbedded(new CloseShieldInputStream(stream), xhtml, embeddedMetadata, false);
        }
    }
    xhtml.endDocument();
}
Also used : EmbeddedDocumentExtractor(org.apache.tika.extractor.EmbeddedDocumentExtractor) Metadata(org.apache.tika.metadata.Metadata) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) CloseShieldInputStream(org.apache.commons.io.input.CloseShieldInputStream)

Aggregations

EmbeddedDocumentExtractor (org.apache.tika.extractor.EmbeddedDocumentExtractor)15 Metadata (org.apache.tika.metadata.Metadata)9 TikaException (org.apache.tika.exception.TikaException)8 XHTMLContentHandler (org.apache.tika.sax.XHTMLContentHandler)8 TikaInputStream (org.apache.tika.io.TikaInputStream)6 InputStream (java.io.InputStream)4 ByteArrayInputStream (java.io.ByteArrayInputStream)3 IOException (java.io.IOException)3 CloseShieldInputStream (org.apache.commons.io.input.CloseShieldInputStream)3 BufferedInputStream (java.io.BufferedInputStream)2 ParsingEmbeddedDocumentExtractor (org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor)2 ParseContext (org.apache.tika.parser.ParseContext)2 BodyContentHandler (org.apache.tika.sax.BodyContentHandler)2 ContentHandler (org.xml.sax.ContentHandler)2 SAXException (org.xml.sax.SAXException)2 Archive (com.github.junrar.Archive)1 RarException (com.github.junrar.exception.RarException)1 FileHeader (com.github.junrar.rarfile.FileHeader)1 PSTException (com.pff.PSTException)1 PSTFile (com.pff.PSTFile)1