Search in sources :

Example 6 with MatchingContentHandler

use of org.apache.tika.sax.xpath.MatchingContentHandler in project tika by apache.

the class ContentHandlerExample method parseOnePartToHTML.

/**
     * Example of extracting just one part of the document's body,
     * as HTML as a string, excluding the rest
     */
public String parseOnePartToHTML() throws IOException, SAXException, TikaException {
    // Only get things under html -> body -> div (class=header)
    XPathParser xhtmlParser = new XPathParser("xhtml", XHTMLContentHandler.XHTML);
    Matcher divContentMatcher = xhtmlParser.parse("/xhtml:html/xhtml:body/xhtml:div/descendant::node()");
    ContentHandler handler = new MatchingContentHandler(new ToXMLContentHandler(), divContentMatcher);
    AutoDetectParser parser = new AutoDetectParser();
    Metadata metadata = new Metadata();
    try (InputStream stream = ContentHandlerExample.class.getResourceAsStream("test2.doc")) {
        parser.parse(stream, handler, metadata);
        return handler.toString();
    }
}
Also used : ToXMLContentHandler(org.apache.tika.sax.ToXMLContentHandler) XPathParser(org.apache.tika.sax.xpath.XPathParser) Matcher(org.apache.tika.sax.xpath.Matcher) MatchingContentHandler(org.apache.tika.sax.xpath.MatchingContentHandler) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) MatchingContentHandler(org.apache.tika.sax.xpath.MatchingContentHandler) ToXMLContentHandler(org.apache.tika.sax.ToXMLContentHandler) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) ContentHandler(org.xml.sax.ContentHandler)

Aggregations

Matcher (org.apache.tika.sax.xpath.Matcher)6 MatchingContentHandler (org.apache.tika.sax.xpath.MatchingContentHandler)6 ContentHandler (org.xml.sax.ContentHandler)6 TeeContentHandler (org.apache.tika.sax.TeeContentHandler)4 CompositeMatcher (org.apache.tika.sax.xpath.CompositeMatcher)4 AttributeMetadataHandler (org.apache.tika.parser.xml.AttributeMetadataHandler)3 InputStream (java.io.InputStream)2 Metadata (org.apache.tika.metadata.Metadata)2 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)2 AttributeDependantMetadataHandler (org.apache.tika.parser.xml.AttributeDependantMetadataHandler)2 XHTMLContentHandler (org.apache.tika.sax.XHTMLContentHandler)2 XPathParser (org.apache.tika.sax.xpath.XPathParser)2 StringWriter (java.io.StringWriter)1 SolrException (org.apache.solr.common.SolrException)1 NamedList (org.apache.solr.common.util.NamedList)1 TikaException (org.apache.tika.exception.TikaException)1 MediaType (org.apache.tika.mime.MediaType)1 DefaultParser (org.apache.tika.parser.DefaultParser)1 ParseContext (org.apache.tika.parser.ParseContext)1 Parser (org.apache.tika.parser.Parser)1