Search in sources :

Example 6 with OfflineContentHandler

use of org.apache.tika.sax.OfflineContentHandler in project tika by apache.

the class XmlRootExtractor method extractRootElement.

/**
     * @since Apache Tika 0.9
     */
public QName extractRootElement(InputStream stream) {
    ExtractorHandler handler = new ExtractorHandler();
    try {
        SAXParserFactory factory = SAXParserFactory.newInstance();
        factory.setNamespaceAware(true);
        factory.setValidating(false);
        try {
            factory.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true);
        } catch (SAXNotRecognizedException e) {
        // TIKA-271 and TIKA-1000: Some XML parsers do not support the secure-processing
        // feature, even though it's required by JAXP in Java 5. Ignoring
        // the exception is fine here, deployments without this feature
        // are inherently vulnerable to XML denial-of-service attacks.
        }
        factory.newSAXParser().parse(new CloseShieldInputStream(stream), new OfflineContentHandler(handler));
    } catch (Exception ignore) {
    }
    return handler.rootElement;
}
Also used : OfflineContentHandler(org.apache.tika.sax.OfflineContentHandler) SAXNotRecognizedException(org.xml.sax.SAXNotRecognizedException) CloseShieldInputStream(org.apache.tika.io.CloseShieldInputStream) SAXNotRecognizedException(org.xml.sax.SAXNotRecognizedException) SAXException(org.xml.sax.SAXException) SAXParserFactory(javax.xml.parsers.SAXParserFactory)

Example 7 with OfflineContentHandler

use of org.apache.tika.sax.OfflineContentHandler in project tika by apache.

the class DIFParser method parse.

@Override
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    // TODO Auto-generated method stub
    final XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
    xhtml.startDocument();
    xhtml.startElement("p");
    TaggedContentHandler tagged = new TaggedContentHandler(handler);
    try {
        context.getSAXParser().parse(new CloseShieldInputStream(stream), new OfflineContentHandler(new EmbeddedContentHandler(getContentHandler(tagged, metadata, context))));
    } catch (SAXException e) {
        tagged.throwIfCauseOf(e);
        throw new TikaException("XML parse error", e);
    } finally {
        xhtml.endElement("p");
        xhtml.endDocument();
    }
}
Also used : OfflineContentHandler(org.apache.tika.sax.OfflineContentHandler) TikaException(org.apache.tika.exception.TikaException) TaggedContentHandler(org.apache.tika.sax.TaggedContentHandler) EmbeddedContentHandler(org.apache.tika.sax.EmbeddedContentHandler) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) CloseShieldInputStream(org.apache.commons.io.input.CloseShieldInputStream) SAXException(org.xml.sax.SAXException)

Example 8 with OfflineContentHandler

use of org.apache.tika.sax.OfflineContentHandler in project tika by apache.

the class EpubContentParser method parse.

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    SAXParser parser = context.getSAXParser();
    parser.parse(new CloseShieldInputStream(stream), new OfflineContentHandler(handler));
}
Also used : OfflineContentHandler(org.apache.tika.sax.OfflineContentHandler) SAXParser(javax.xml.parsers.SAXParser) CloseShieldInputStream(org.apache.commons.io.input.CloseShieldInputStream)

Example 9 with OfflineContentHandler

use of org.apache.tika.sax.OfflineContentHandler in project tika by apache.

the class Word2006MLParser method parse.

@Override
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    //set OfficeParserConfig if the user hasn't specified one
    configure(context);
    final XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
    xhtml.startDocument();
    try {
        context.getSAXParser().parse(new CloseShieldInputStream(stream), new OfflineContentHandler(new EmbeddedContentHandler(new Word2006MLDocHandler(xhtml, metadata, context))));
    } catch (SAXException e) {
        throw new TikaException("XML parse error", e);
    } finally {
        xhtml.endDocument();
    }
}
Also used : OfflineContentHandler(org.apache.tika.sax.OfflineContentHandler) TikaException(org.apache.tika.exception.TikaException) EmbeddedContentHandler(org.apache.tika.sax.EmbeddedContentHandler) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) CloseShieldInputStream(org.apache.commons.io.input.CloseShieldInputStream) SAXException(org.xml.sax.SAXException)

Example 10 with OfflineContentHandler

use of org.apache.tika.sax.OfflineContentHandler in project tika by apache.

the class AbstractXML2003Parser method parse.

@Override
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    setContentType(metadata);
    final XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
    xhtml.startDocument();
    TaggedContentHandler tagged = new TaggedContentHandler(xhtml);
    try {
        context.getSAXParser().parse(new CloseShieldInputStream(stream), new OfflineContentHandler(new EmbeddedContentHandler(getContentHandler(tagged, metadata, context))));
    } catch (SAXException e) {
        tagged.throwIfCauseOf(e);
        throw new TikaException("XML parse error", e);
    } finally {
        xhtml.endDocument();
    }
}
Also used : OfflineContentHandler(org.apache.tika.sax.OfflineContentHandler) TikaException(org.apache.tika.exception.TikaException) TaggedContentHandler(org.apache.tika.sax.TaggedContentHandler) EmbeddedContentHandler(org.apache.tika.sax.EmbeddedContentHandler) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) CloseShieldInputStream(org.apache.commons.io.input.CloseShieldInputStream) SAXException(org.xml.sax.SAXException)

Aggregations

OfflineContentHandler (org.apache.tika.sax.OfflineContentHandler)11 CloseShieldInputStream (org.apache.commons.io.input.CloseShieldInputStream)9 TikaException (org.apache.tika.exception.TikaException)7 EmbeddedContentHandler (org.apache.tika.sax.EmbeddedContentHandler)6 XHTMLContentHandler (org.apache.tika.sax.XHTMLContentHandler)5 SAXException (org.xml.sax.SAXException)5 InputStream (java.io.InputStream)3 SAXParser (javax.xml.parsers.SAXParser)3 TaggedContentHandler (org.apache.tika.sax.TaggedContentHandler)3 BufferedInputStream (java.io.BufferedInputStream)1 IOException (java.io.IOException)1 SAXParserFactory (javax.xml.parsers.SAXParserFactory)1 ZipArchiveEntry (org.apache.commons.compress.archivers.zip.ZipArchiveEntry)1 ZipArchiveInputStream (org.apache.commons.compress.archivers.zip.ZipArchiveInputStream)1 InvalidFormatException (org.apache.poi.openxml4j.exceptions.InvalidFormatException)1 PackagePart (org.apache.poi.openxml4j.opc.PackagePart)1 PackageRelationship (org.apache.poi.openxml4j.opc.PackageRelationship)1 PackageRelationshipCollection (org.apache.poi.openxml4j.opc.PackageRelationshipCollection)1 CloseShieldInputStream (org.apache.tika.io.CloseShieldInputStream)1 ParseContext (org.apache.tika.parser.ParseContext)1