Search in sources :

Example 11 with EmbeddedContentHandler

use of org.apache.tika.sax.EmbeddedContentHandler in project tika by apache.

the class AbstractOOXMLExtractor method handleEmbeddedOLE.

/**
     * Handles an embedded OLE object in the document
     */
private void handleEmbeddedOLE(PackagePart part, ContentHandler handler, String rel, Metadata parentMetadata) throws IOException, SAXException {
    // A POIFSFileSystem needs to be at least 3 blocks big to be valid
    if (part.getSize() >= 0 && part.getSize() < 512 * 3) {
        // Too small, skip
        return;
    }
    InputStream is = part.getInputStream();
    // Open the POIFS (OLE2) structure and process
    POIFSFileSystem fs = null;
    try {
        fs = new POIFSFileSystem(part.getInputStream());
    } catch (Exception e) {
        EmbeddedDocumentUtil.recordEmbeddedStreamException(e, parentMetadata);
        return;
    }
    TikaInputStream stream = null;
    try {
        Metadata metadata = new Metadata();
        metadata.set(Metadata.EMBEDDED_RELATIONSHIP_ID, rel);
        DirectoryNode root = fs.getRoot();
        POIFSDocumentType type = POIFSDocumentType.detectType(root);
        if (root.hasEntry("CONTENTS") && root.hasEntry("Ole") && root.hasEntry("CompObj")) {
            // TIKA-704: OLE 2.0 embedded non-Office document?
            //TODO: figure out if the equivalent of OLE 1.0's
            //getCommand() and getFileName() exist for OLE 2.0 to populate
            //TikaCoreProperties.ORIGINAL_RESOURCE_NAME
            stream = TikaInputStream.get(fs.createDocumentInputStream("CONTENTS"));
            if (embeddedExtractor.shouldParseEmbedded(metadata)) {
                embeddedExtractor.parseEmbedded(stream, new EmbeddedContentHandler(handler), metadata, false);
            }
        } else if (POIFSDocumentType.OLE10_NATIVE == type) {
            // TIKA-704: OLE 1.0 embedded document
            Ole10Native ole = Ole10Native.createFromEmbeddedOleObject(fs);
            if (ole.getLabel() != null) {
                metadata.set(Metadata.RESOURCE_NAME_KEY, ole.getLabel());
            }
            if (ole.getCommand() != null) {
                metadata.add(TikaCoreProperties.ORIGINAL_RESOURCE_NAME, ole.getCommand());
            }
            if (ole.getFileName() != null) {
                metadata.add(TikaCoreProperties.ORIGINAL_RESOURCE_NAME, ole.getFileName());
            }
            byte[] data = ole.getDataBuffer();
            if (data != null) {
                stream = TikaInputStream.get(data);
            }
            if (stream != null && embeddedExtractor.shouldParseEmbedded(metadata)) {
                embeddedExtractor.parseEmbedded(stream, new EmbeddedContentHandler(handler), metadata, false);
            }
        } else {
            handleEmbeddedFile(part, handler, rel);
        }
    } catch (FileNotFoundException e) {
    // There was no CONTENTS entry, so skip this part
    } catch (Ole10NativeException e) {
    // Could not process an OLE 1.0 entry, so skip this part
    } catch (IOException e) {
        EmbeddedDocumentUtil.recordEmbeddedStreamException(e, parentMetadata);
    } finally {
        if (fs != null) {
            fs.close();
        }
        if (stream != null) {
            stream.close();
        }
    }
}
Also used : Ole10NativeException(org.apache.poi.poifs.filesystem.Ole10NativeException) Ole10Native(org.apache.poi.poifs.filesystem.Ole10Native) TikaInputStream(org.apache.tika.io.TikaInputStream) InputStream(java.io.InputStream) POIFSFileSystem(org.apache.poi.poifs.filesystem.POIFSFileSystem) NPOIFSFileSystem(org.apache.poi.poifs.filesystem.NPOIFSFileSystem) Metadata(org.apache.tika.metadata.Metadata) FileNotFoundException(java.io.FileNotFoundException) TikaInputStream(org.apache.tika.io.TikaInputStream) DirectoryNode(org.apache.poi.poifs.filesystem.DirectoryNode) POIFSDocumentType(org.apache.tika.parser.microsoft.OfficeParser.POIFSDocumentType) EmbeddedContentHandler(org.apache.tika.sax.EmbeddedContentHandler) IOException(java.io.IOException) Ole10NativeException(org.apache.poi.poifs.filesystem.Ole10NativeException) TikaException(org.apache.tika.exception.TikaException) InvalidFormatException(org.apache.poi.openxml4j.exceptions.InvalidFormatException) IOException(java.io.IOException) FileNotFoundException(java.io.FileNotFoundException) XmlException(org.apache.xmlbeans.XmlException) SAXException(org.xml.sax.SAXException)

Example 12 with EmbeddedContentHandler

use of org.apache.tika.sax.EmbeddedContentHandler in project tika by apache.

the class AbstractOOXMLExtractor method handleEmbeddedFile.

/**
     * Handles an embedded file in the document
     */
protected void handleEmbeddedFile(PackagePart part, ContentHandler handler, String rel) throws SAXException, IOException {
    Metadata metadata = new Metadata();
    metadata.set(Metadata.EMBEDDED_RELATIONSHIP_ID, rel);
    // Get the name
    String name = part.getPartName().getName();
    metadata.set(Metadata.RESOURCE_NAME_KEY, name.substring(name.lastIndexOf('/') + 1));
    // Get the content type
    metadata.set(Metadata.CONTENT_TYPE, part.getContentType());
    // Call the recursing handler
    if (embeddedExtractor.shouldParseEmbedded(metadata)) {
        try (TikaInputStream tis = TikaInputStream.get(part.getInputStream())) {
            embeddedExtractor.parseEmbedded(tis, new EmbeddedContentHandler(handler), metadata, false);
        }
    }
}
Also used : Metadata(org.apache.tika.metadata.Metadata) TikaInputStream(org.apache.tika.io.TikaInputStream) EmbeddedContentHandler(org.apache.tika.sax.EmbeddedContentHandler)

Example 13 with EmbeddedContentHandler

use of org.apache.tika.sax.EmbeddedContentHandler in project tika by apache.

the class Word2006MLParser method parse.

@Override
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    //set OfficeParserConfig if the user hasn't specified one
    configure(context);
    final XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
    xhtml.startDocument();
    try {
        context.getSAXParser().parse(new CloseShieldInputStream(stream), new OfflineContentHandler(new EmbeddedContentHandler(new Word2006MLDocHandler(xhtml, metadata, context))));
    } catch (SAXException e) {
        throw new TikaException("XML parse error", e);
    } finally {
        xhtml.endDocument();
    }
}
Also used : OfflineContentHandler(org.apache.tika.sax.OfflineContentHandler) TikaException(org.apache.tika.exception.TikaException) EmbeddedContentHandler(org.apache.tika.sax.EmbeddedContentHandler) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) CloseShieldInputStream(org.apache.commons.io.input.CloseShieldInputStream) SAXException(org.xml.sax.SAXException)

Example 14 with EmbeddedContentHandler

use of org.apache.tika.sax.EmbeddedContentHandler in project tika by apache.

the class AbstractXML2003Parser method parse.

@Override
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    setContentType(metadata);
    final XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
    xhtml.startDocument();
    TaggedContentHandler tagged = new TaggedContentHandler(xhtml);
    try {
        context.getSAXParser().parse(new CloseShieldInputStream(stream), new OfflineContentHandler(new EmbeddedContentHandler(getContentHandler(tagged, metadata, context))));
    } catch (SAXException e) {
        tagged.throwIfCauseOf(e);
        throw new TikaException("XML parse error", e);
    } finally {
        xhtml.endDocument();
    }
}
Also used : OfflineContentHandler(org.apache.tika.sax.OfflineContentHandler) TikaException(org.apache.tika.exception.TikaException) TaggedContentHandler(org.apache.tika.sax.TaggedContentHandler) EmbeddedContentHandler(org.apache.tika.sax.EmbeddedContentHandler) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) CloseShieldInputStream(org.apache.commons.io.input.CloseShieldInputStream) SAXException(org.xml.sax.SAXException)

Example 15 with EmbeddedContentHandler

use of org.apache.tika.sax.EmbeddedContentHandler in project tika by apache.

the class OpenDocumentParser method handleZipEntry.

private void handleZipEntry(ZipEntry entry, InputStream zip, Metadata metadata, ParseContext context, EndDocumentShieldingContentHandler handler) throws IOException, SAXException, TikaException {
    if (entry == null)
        return;
    if (entry.getName().equals("mimetype")) {
        String type = IOUtils.toString(zip, UTF_8);
        metadata.set(Metadata.CONTENT_TYPE, type);
    } else if (entry.getName().equals(META_NAME)) {
        meta.parse(zip, new DefaultHandler(), metadata, context);
    } else if (entry.getName().endsWith("content.xml")) {
        if (content instanceof OpenDocumentContentParser) {
            ((OpenDocumentContentParser) content).parseInternal(zip, handler, metadata, context);
        } else {
            // Foreign content parser was set:
            content.parse(zip, handler, metadata, context);
        }
    } else if (entry.getName().endsWith("styles.xml")) {
        if (content instanceof OpenDocumentContentParser) {
            ((OpenDocumentContentParser) content).parseInternal(zip, handler, metadata, context);
        } else {
            // Foreign content parser was set:
            content.parse(zip, handler, metadata, context);
        }
    } else {
        String embeddedName = entry.getName();
        //scrape everything under Thumbnails/ and Pictures/
        if (embeddedName.contains("Thumbnails/") || embeddedName.contains("Pictures/")) {
            EmbeddedDocumentExtractor embeddedDocumentExtractor = EmbeddedDocumentUtil.getEmbeddedDocumentExtractor(context);
            Metadata embeddedMetadata = new Metadata();
            embeddedMetadata.set(TikaCoreProperties.ORIGINAL_RESOURCE_NAME, entry.getName());
            /* if (embeddedName.startsWith("Thumbnails/")) {
                    embeddedMetadata.set(TikaCoreProperties.EMBEDDED_RESOURCE_TYPE,
                            TikaCoreProperties.EmbeddedResourceType.THUMBNAIL);
                }*/
            if (embeddedName.contains("Pictures/")) {
                embeddedMetadata.set(TikaMetadataKeys.EMBEDDED_RESOURCE_TYPE, TikaCoreProperties.EmbeddedResourceType.INLINE.toString());
            }
            if (embeddedDocumentExtractor.shouldParseEmbedded(embeddedMetadata)) {
                embeddedDocumentExtractor.parseEmbedded(zip, new EmbeddedContentHandler(handler), embeddedMetadata, false);
            }
        }
    }
}
Also used : EmbeddedDocumentExtractor(org.apache.tika.extractor.EmbeddedDocumentExtractor) Metadata(org.apache.tika.metadata.Metadata) EmbeddedContentHandler(org.apache.tika.sax.EmbeddedContentHandler) DefaultHandler(org.xml.sax.helpers.DefaultHandler)

Aggregations

EmbeddedContentHandler (org.apache.tika.sax.EmbeddedContentHandler)20 TikaException (org.apache.tika.exception.TikaException)10 Metadata (org.apache.tika.metadata.Metadata)10 InputStream (java.io.InputStream)8 TikaInputStream (org.apache.tika.io.TikaInputStream)8 IOException (java.io.IOException)7 SAXException (org.xml.sax.SAXException)7 CloseShieldInputStream (org.apache.commons.io.input.CloseShieldInputStream)6 OfflineContentHandler (org.apache.tika.sax.OfflineContentHandler)6 XHTMLContentHandler (org.apache.tika.sax.XHTMLContentHandler)6 BodyContentHandler (org.apache.tika.sax.BodyContentHandler)5 AttributesImpl (org.xml.sax.helpers.AttributesImpl)4 ByteArrayInputStream (java.io.ByteArrayInputStream)3 InvalidFormatException (org.apache.poi.openxml4j.exceptions.InvalidFormatException)3 FileNotFoundException (java.io.FileNotFoundException)2 PackagePart (org.apache.poi.openxml4j.opc.PackagePart)2 PackageRelationship (org.apache.poi.openxml4j.opc.PackageRelationship)2 TaggedContentHandler (org.apache.tika.sax.TaggedContentHandler)2 ContentHandler (org.xml.sax.ContentHandler)2 ByteArrayOutputStream (java.io.ByteArrayOutputStream)1