Search in sources :

Example 1 with XPathParser

use of org.apache.tika.sax.xpath.XPathParser in project tika by apache.

the class ContentHandlerExample method parseOnePartToHTML.

/**
     * Example of extracting just one part of the document's body,
     * as HTML as a string, excluding the rest
     */
public String parseOnePartToHTML() throws IOException, SAXException, TikaException {
    // Only get things under html -> body -> div (class=header)
    XPathParser xhtmlParser = new XPathParser("xhtml", XHTMLContentHandler.XHTML);
    Matcher divContentMatcher = xhtmlParser.parse("/xhtml:html/xhtml:body/xhtml:div/descendant::node()");
    ContentHandler handler = new MatchingContentHandler(new ToXMLContentHandler(), divContentMatcher);
    AutoDetectParser parser = new AutoDetectParser();
    Metadata metadata = new Metadata();
    try (InputStream stream = ContentHandlerExample.class.getResourceAsStream("test2.doc")) {
        parser.parse(stream, handler, metadata);
        return handler.toString();
    }
}
Also used : ToXMLContentHandler(org.apache.tika.sax.ToXMLContentHandler) XPathParser(org.apache.tika.sax.xpath.XPathParser) Matcher(org.apache.tika.sax.xpath.Matcher) MatchingContentHandler(org.apache.tika.sax.xpath.MatchingContentHandler) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) MatchingContentHandler(org.apache.tika.sax.xpath.MatchingContentHandler) ToXMLContentHandler(org.apache.tika.sax.ToXMLContentHandler) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) ContentHandler(org.xml.sax.ContentHandler)

Aggregations

InputStream (java.io.InputStream)1 Metadata (org.apache.tika.metadata.Metadata)1 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)1 BodyContentHandler (org.apache.tika.sax.BodyContentHandler)1 ToXMLContentHandler (org.apache.tika.sax.ToXMLContentHandler)1 XHTMLContentHandler (org.apache.tika.sax.XHTMLContentHandler)1 Matcher (org.apache.tika.sax.xpath.Matcher)1 MatchingContentHandler (org.apache.tika.sax.xpath.MatchingContentHandler)1 XPathParser (org.apache.tika.sax.xpath.XPathParser)1 ContentHandler (org.xml.sax.ContentHandler)1