Search in sources :

Example 1 with TeeContentHandler

use of org.apache.tika.sax.TeeContentHandler in project tika by apache.

the class ParserPostProcessor method parse.

/**
     * Forwards the call to the delegated parser and post-processes the
     * results as described above.
     */
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    ContentHandler body = new BodyContentHandler();
    ContentHandler tee = new TeeContentHandler(handler, body);
    super.parse(stream, tee, metadata, context);
    String content = body.toString();
    metadata.set("fulltext", content);
    int length = Math.min(content.length(), 500);
    metadata.set("summary", content.substring(0, length));
    for (String link : RegexUtils.extractLinks(content)) {
        metadata.add("outlinks", link);
    }
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) TeeContentHandler(org.apache.tika.sax.TeeContentHandler) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) TeeContentHandler(org.apache.tika.sax.TeeContentHandler) ContentHandler(org.xml.sax.ContentHandler)

Example 2 with TeeContentHandler

use of org.apache.tika.sax.TeeContentHandler in project tika by apache.

the class OpenDocumentMetaParser method getStatistic.

private static ContentHandler getStatistic(ContentHandler ch, Metadata md, Property property, String attribute) {
    Matcher matcher = META_XPATH.parse("//meta:document-statistic/@meta:" + attribute);
    ContentHandler branch = new MatchingContentHandler(new AttributeMetadataHandler(META_NS, attribute, md, property), matcher);
    return new TeeContentHandler(ch, branch);
}
Also used : CompositeMatcher(org.apache.tika.sax.xpath.CompositeMatcher) Matcher(org.apache.tika.sax.xpath.Matcher) MatchingContentHandler(org.apache.tika.sax.xpath.MatchingContentHandler) TeeContentHandler(org.apache.tika.sax.TeeContentHandler) MatchingContentHandler(org.apache.tika.sax.xpath.MatchingContentHandler) TeeContentHandler(org.apache.tika.sax.TeeContentHandler) ContentHandler(org.xml.sax.ContentHandler) AttributeMetadataHandler(org.apache.tika.parser.xml.AttributeMetadataHandler)

Example 3 with TeeContentHandler

use of org.apache.tika.sax.TeeContentHandler in project tika by apache.

the class TIAParsingExample method testTeeContentHandler.

public static void testTeeContentHandler(String filename) throws Exception {
    InputStream stream = new ByteArrayInputStream(new byte[0]);
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    Parser parser = new AutoDetectParser();
    LinkContentHandler linkCollector = new LinkContentHandler();
    try (OutputStream output = new FileOutputStream(new File(filename))) {
        ContentHandler handler = new TeeContentHandler(new BodyContentHandler(output), linkCollector);
        parser.parse(stream, handler, metadata, context);
    }
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) GZIPInputStream(java.util.zip.GZIPInputStream) ByteArrayInputStream(java.io.ByteArrayInputStream) TikaInputStream(org.apache.tika.io.TikaInputStream) FileInputStream(java.io.FileInputStream) InputStream(java.io.InputStream) OutputStream(java.io.OutputStream) FileOutputStream(java.io.FileOutputStream) Metadata(org.apache.tika.metadata.Metadata) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) LinkContentHandler(org.apache.tika.sax.LinkContentHandler) TeeContentHandler(org.apache.tika.sax.TeeContentHandler) Parser(org.apache.tika.parser.Parser) XMLParser(org.apache.tika.parser.xml.XMLParser) HtmlParser(org.apache.tika.parser.html.HtmlParser) TXTParser(org.apache.tika.parser.txt.TXTParser) CompositeParser(org.apache.tika.parser.CompositeParser) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) ByteArrayInputStream(java.io.ByteArrayInputStream) LinkContentHandler(org.apache.tika.sax.LinkContentHandler) FileOutputStream(java.io.FileOutputStream) ParseContext(org.apache.tika.parser.ParseContext) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) TeeContentHandler(org.apache.tika.sax.TeeContentHandler) File(java.io.File)

Example 4 with TeeContentHandler

use of org.apache.tika.sax.TeeContentHandler in project tika by apache.

the class HtmlParserTest method testParseAscii.

@Test
public void testParseAscii() throws Exception {
    String path = "/test-documents/testHTML.html";
    final StringWriter href = new StringWriter();
    final StringWriter name = new StringWriter();
    ContentHandler body = new BodyContentHandler();
    Metadata metadata = new Metadata();
    try (InputStream stream = HtmlParserTest.class.getResourceAsStream(path)) {
        ContentHandler link = new DefaultHandler() {

            @Override
            public void startElement(String u, String l, String n, Attributes a) throws SAXException {
                if ("a".equals(l)) {
                    if (a.getValue("href") != null) {
                        href.append(a.getValue("href"));
                    } else if (a.getValue("name") != null) {
                        name.append(a.getValue("name"));
                    }
                }
            }
        };
        new HtmlParser().parse(stream, new TeeContentHandler(body, link), metadata, new ParseContext());
    }
    assertEquals("Title : Test Indexation Html", metadata.get(TikaCoreProperties.TITLE));
    assertEquals("Tika Developers", metadata.get("Author"));
    assertEquals("5", metadata.get("refresh"));
    assertEquals("51.2312", metadata.get(Geographic.LATITUDE));
    assertEquals("-5.1987", metadata.get(Geographic.LONGITUDE));
    assertEquals("http://www.apache.org/", href.toString());
    assertEquals("test-anchor", name.toString());
    String content = body.toString();
    assertTrue("Did not contain expected text:" + "Test Indexation Html", content.contains("Test Indexation Html"));
    assertTrue("Did not contain expected text:" + "Indexation du fichier", content.contains("Indexation du fichier"));
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) StringWriter(java.io.StringWriter) ByteArrayInputStream(java.io.ByteArrayInputStream) TikaInputStream(org.apache.tika.io.TikaInputStream) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) Attributes(org.xml.sax.Attributes) ParseContext(org.apache.tika.parser.ParseContext) TeeContentHandler(org.apache.tika.sax.TeeContentHandler) LinkContentHandler(org.apache.tika.sax.LinkContentHandler) TeeContentHandler(org.apache.tika.sax.TeeContentHandler) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) DefaultHandler(org.xml.sax.helpers.DefaultHandler) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest)

Example 5 with TeeContentHandler

use of org.apache.tika.sax.TeeContentHandler in project spring-boot-quick by vector4wang.

the class TikaUtil method handleStreamMetaDate.

public static Map<String, String> handleStreamMetaDate(byte[] file) throws Exception {
    Map<String, String> meta = new HashMap<>();
    Metadata md = new Metadata();
    TikaInputStream input = TikaInputStream.get(file, md);
    StringWriter textBuffer = new StringWriter();
    ContentHandler handler = new TeeContentHandler(getTextContentHandler(textBuffer));
    parser.parse(input, handler, md, context);
    String[] names = md.names();
    Arrays.sort(names);
    for (String name : names) {
        meta.put(name, md.get(name));
    }
    return meta;
}
Also used : HashMap(java.util.HashMap) Metadata(org.apache.tika.metadata.Metadata) TikaInputStream(org.apache.tika.io.TikaInputStream) TeeContentHandler(org.apache.tika.sax.TeeContentHandler) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) BoilerpipeContentHandler(org.apache.tika.parser.html.BoilerpipeContentHandler) ContentHandler(org.xml.sax.ContentHandler) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) TeeContentHandler(org.apache.tika.sax.TeeContentHandler)

Aggregations

TeeContentHandler (org.apache.tika.sax.TeeContentHandler)15 ContentHandler (org.xml.sax.ContentHandler)14 Metadata (org.apache.tika.metadata.Metadata)6 BodyContentHandler (org.apache.tika.sax.BodyContentHandler)6 TikaInputStream (org.apache.tika.io.TikaInputStream)5 ParseContext (org.apache.tika.parser.ParseContext)4 XHTMLContentHandler (org.apache.tika.sax.XHTMLContentHandler)4 CompositeMatcher (org.apache.tika.sax.xpath.CompositeMatcher)4 Matcher (org.apache.tika.sax.xpath.Matcher)4 MatchingContentHandler (org.apache.tika.sax.xpath.MatchingContentHandler)4 ByteArrayInputStream (java.io.ByteArrayInputStream)3 BoilerpipeContentHandler (org.apache.tika.parser.html.BoilerpipeContentHandler)3 AttributeMetadataHandler (org.apache.tika.parser.xml.AttributeMetadataHandler)3 ElementMetadataHandler (org.apache.tika.parser.xml.ElementMetadataHandler)3 IOException (java.io.IOException)2 InputStream (java.io.InputStream)2 StringWriter (java.io.StringWriter)2 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)2 CompositeParser (org.apache.tika.parser.CompositeParser)2 Parser (org.apache.tika.parser.Parser)2