Search in sources :

Example 1 with OfflineContentHandler

use of org.apache.tika.sax.OfflineContentHandler in project tika by apache.

the class XMLParser method parse.

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    if (metadata.get(Metadata.CONTENT_TYPE) == null) {
        metadata.set(Metadata.CONTENT_TYPE, "application/xml");
    }
    final XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
    xhtml.startDocument();
    xhtml.startElement("p");
    TaggedContentHandler tagged = new TaggedContentHandler(handler);
    try {
        context.getSAXParser().parse(new CloseShieldInputStream(stream), new OfflineContentHandler(new EmbeddedContentHandler(getContentHandler(tagged, metadata, context))));
    } catch (SAXException e) {
        tagged.throwIfCauseOf(e);
        throw new TikaException("XML parse error", e);
    } finally {
        xhtml.endElement("p");
        xhtml.endDocument();
    }
}
Also used : OfflineContentHandler(org.apache.tika.sax.OfflineContentHandler) TikaException(org.apache.tika.exception.TikaException) TaggedContentHandler(org.apache.tika.sax.TaggedContentHandler) EmbeddedContentHandler(org.apache.tika.sax.EmbeddedContentHandler) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) CloseShieldInputStream(org.apache.commons.io.input.CloseShieldInputStream) SAXException(org.xml.sax.SAXException)

Example 2 with OfflineContentHandler

use of org.apache.tika.sax.OfflineContentHandler in project tika by apache.

the class TesseractOCRParser method extractHOCROutput.

private void extractHOCROutput(InputStream is, ParseContext parseContext, XHTMLContentHandler xhtml) throws TikaException, IOException, SAXException {
    if (parseContext == null) {
        parseContext = new ParseContext();
    }
    SAXParser parser = parseContext.getSAXParser();
    xhtml.startElement("div", "class", "ocr");
    parser.parse(is, new OfflineContentHandler(new HOCRPassThroughHandler(xhtml)));
    xhtml.endElement("div");
}
Also used : OfflineContentHandler(org.apache.tika.sax.OfflineContentHandler) ParseContext(org.apache.tika.parser.ParseContext) SAXParser(javax.xml.parsers.SAXParser)

Example 3 with OfflineContentHandler

use of org.apache.tika.sax.OfflineContentHandler in project tika by apache.

the class SXSLFPowerPointExtractorDecorator method handleSlidePart.

private void handleSlidePart(PackagePart slidePart, XHTMLContentHandler xhtml) throws IOException, SAXException {
    Map<String, String> linkedRelationships = loadLinkedRelationships(slidePart, false, metadata);
    //        Map<String, String> hyperlinks = loadHyperlinkRelationships(packagePart);
    xhtml.startElement("div", "class", "slide-content");
    try (InputStream stream = slidePart.getInputStream()) {
        context.getSAXParser().parse(new CloseShieldInputStream(stream), new OfflineContentHandler(new EmbeddedContentHandler(new OOXMLWordAndPowerPointTextHandler(new OOXMLTikaBodyPartHandler(xhtml), linkedRelationships))));
    } catch (TikaException e) {
        metadata.add(TikaCoreProperties.TIKA_META_EXCEPTION_WARNING, ExceptionUtils.getStackTrace(e));
    }
    xhtml.endElement("div");
    handleBasicRelatedParts(XSLFRelation.SLIDE_LAYOUT.getRelation(), "slide-master-content", slidePart, new PlaceHolderSkipper(new OOXMLWordAndPowerPointTextHandler(new OOXMLTikaBodyPartHandler(xhtml), linkedRelationships)));
    handleBasicRelatedParts(XSLFRelation.NOTES.getRelation(), "slide-notes", slidePart, new OOXMLWordAndPowerPointTextHandler(new OOXMLTikaBodyPartHandler(xhtml), linkedRelationships));
    handleBasicRelatedParts(XSLFRelation.NOTES_MASTER.getRelation(), "slide-notes-master", slidePart, new OOXMLWordAndPowerPointTextHandler(new OOXMLTikaBodyPartHandler(xhtml), linkedRelationships));
    handleBasicRelatedParts(XSLFRelation.COMMENTS.getRelation(), null, slidePart, new XSLFCommentsHandler(xhtml));
}
Also used : OfflineContentHandler(org.apache.tika.sax.OfflineContentHandler) TikaException(org.apache.tika.exception.TikaException) CloseShieldInputStream(org.apache.commons.io.input.CloseShieldInputStream) InputStream(java.io.InputStream) EmbeddedContentHandler(org.apache.tika.sax.EmbeddedContentHandler) CloseShieldInputStream(org.apache.commons.io.input.CloseShieldInputStream)

Example 4 with OfflineContentHandler

use of org.apache.tika.sax.OfflineContentHandler in project tika by apache.

the class SXSLFPowerPointExtractorDecorator method handleBasicRelatedParts.

/**
     * This should handle the comments, master, notes, etc
     *
     * @param contentType
     * @param xhtmlClassLabel
     * @param parentPart
     * @param contentHandler
     */
private void handleBasicRelatedParts(String contentType, String xhtmlClassLabel, PackagePart parentPart, ContentHandler contentHandler) throws SAXException {
    PackageRelationshipCollection relatedPartPRC = null;
    try {
        relatedPartPRC = parentPart.getRelationshipsByType(contentType);
    } catch (InvalidFormatException e) {
        metadata.add(TikaCoreProperties.TIKA_META_EXCEPTION_WARNING, ExceptionUtils.getStackTrace(e));
    }
    if (relatedPartPRC != null && relatedPartPRC.size() > 0) {
        AttributesImpl attributes = new AttributesImpl();
        attributes.addAttribute("", "class", "class", "CDATA", xhtmlClassLabel);
        contentHandler.startElement("", "div", "div", attributes);
        for (int i = 0; i < relatedPartPRC.size(); i++) {
            PackageRelationship relatedPartPackageRelationship = relatedPartPRC.getRelationship(i);
            try {
                PackagePart relatedPartPart = parentPart.getRelatedPart(relatedPartPackageRelationship);
                try (InputStream stream = relatedPartPart.getInputStream()) {
                    context.getSAXParser().parse(stream, new OfflineContentHandler(new EmbeddedContentHandler(contentHandler)));
                } catch (IOException | TikaException e) {
                    metadata.add(TikaCoreProperties.TIKA_META_EXCEPTION_WARNING, ExceptionUtils.getStackTrace(e));
                }
            } catch (InvalidFormatException e) {
                metadata.add(TikaCoreProperties.TIKA_META_EXCEPTION_WARNING, ExceptionUtils.getStackTrace(e));
            }
        }
        contentHandler.endElement("", "div", "div");
    }
}
Also used : PackageRelationship(org.apache.poi.openxml4j.opc.PackageRelationship) AttributesImpl(org.xml.sax.helpers.AttributesImpl) OfflineContentHandler(org.apache.tika.sax.OfflineContentHandler) TikaException(org.apache.tika.exception.TikaException) PackageRelationshipCollection(org.apache.poi.openxml4j.opc.PackageRelationshipCollection) CloseShieldInputStream(org.apache.commons.io.input.CloseShieldInputStream) InputStream(java.io.InputStream) EmbeddedContentHandler(org.apache.tika.sax.EmbeddedContentHandler) IOException(java.io.IOException) PackagePart(org.apache.poi.openxml4j.opc.PackagePart) InvalidFormatException(org.apache.poi.openxml4j.exceptions.InvalidFormatException)

Example 5 with OfflineContentHandler

use of org.apache.tika.sax.OfflineContentHandler in project tika by apache.

the class Word2006MLParser method parse.

@Override
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    //set OfficeParserConfig if the user hasn't specified one
    configure(context);
    final XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
    xhtml.startDocument();
    try {
        context.getSAXParser().parse(new CloseShieldInputStream(stream), new OfflineContentHandler(new EmbeddedContentHandler(new Word2006MLDocHandler(xhtml, metadata, context))));
    } catch (SAXException e) {
        throw new TikaException("XML parse error", e);
    } finally {
        xhtml.endDocument();
    }
}
Also used : OfflineContentHandler(org.apache.tika.sax.OfflineContentHandler) TikaException(org.apache.tika.exception.TikaException) EmbeddedContentHandler(org.apache.tika.sax.EmbeddedContentHandler) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) CloseShieldInputStream(org.apache.commons.io.input.CloseShieldInputStream) SAXException(org.xml.sax.SAXException)

Aggregations

OfflineContentHandler (org.apache.tika.sax.OfflineContentHandler)11 CloseShieldInputStream (org.apache.commons.io.input.CloseShieldInputStream)9 TikaException (org.apache.tika.exception.TikaException)7 EmbeddedContentHandler (org.apache.tika.sax.EmbeddedContentHandler)6 XHTMLContentHandler (org.apache.tika.sax.XHTMLContentHandler)5 SAXException (org.xml.sax.SAXException)5 InputStream (java.io.InputStream)3 SAXParser (javax.xml.parsers.SAXParser)3 TaggedContentHandler (org.apache.tika.sax.TaggedContentHandler)3 BufferedInputStream (java.io.BufferedInputStream)1 IOException (java.io.IOException)1 SAXParserFactory (javax.xml.parsers.SAXParserFactory)1 ZipArchiveEntry (org.apache.commons.compress.archivers.zip.ZipArchiveEntry)1 ZipArchiveInputStream (org.apache.commons.compress.archivers.zip.ZipArchiveInputStream)1 InvalidFormatException (org.apache.poi.openxml4j.exceptions.InvalidFormatException)1 PackagePart (org.apache.poi.openxml4j.opc.PackagePart)1 PackageRelationship (org.apache.poi.openxml4j.opc.PackageRelationship)1 PackageRelationshipCollection (org.apache.poi.openxml4j.opc.PackageRelationshipCollection)1 CloseShieldInputStream (org.apache.tika.io.CloseShieldInputStream)1 ParseContext (org.apache.tika.parser.ParseContext)1