Search in sources :

Example 1 with XMLParser

use of org.apache.tika.parser.xml.XMLParser in project tika by apache.

the class TikaParserConfigTest method defaultParserBlacklist.

/**
     * TIKA-1558 It should be possible to exclude Parsers from being picked up by
     * DefaultParser.
     */
@Test
public void defaultParserBlacklist() throws Exception {
    TikaConfig config = new TikaConfig();
    assertNotNull(config.getParser());
    assertNotNull(config.getDetector());
    CompositeParser cp = (CompositeParser) config.getParser();
    List<Parser> parsers = cp.getAllComponentParsers();
    boolean hasXML = false;
    for (Parser p : parsers) {
        if (p instanceof XMLParser) {
            hasXML = true;
            break;
        }
    }
    assertTrue("Default config should include an XMLParser.", hasXML);
    // This custom TikaConfig should exclude XMLParser and all of its subclasses.
    config = getConfig("TIKA-1558-blacklistsub.xml");
    cp = (CompositeParser) config.getParser();
    parsers = cp.getAllComponentParsers();
    for (Parser p : parsers) {
        if (p instanceof XMLParser)
            fail("Custom config should not include an XMLParser (" + p.getClass() + ").");
    }
}
Also used : CompositeParser(org.apache.tika.parser.CompositeParser) XMLParser(org.apache.tika.parser.xml.XMLParser) Parser(org.apache.tika.parser.Parser) ExecutableParser(org.apache.tika.parser.executable.ExecutableParser) CompositeParser(org.apache.tika.parser.CompositeParser) XMLParser(org.apache.tika.parser.xml.XMLParser) DefaultParser(org.apache.tika.parser.DefaultParser) EmptyParser(org.apache.tika.parser.EmptyParser) Test(org.junit.Test)

Example 2 with XMLParser

use of org.apache.tika.parser.xml.XMLParser in project tika by apache.

the class TIAParsingExample method useCompositeParser.

public static void useCompositeParser() throws Exception {
    InputStream stream = new ByteArrayInputStream(new byte[0]);
    ContentHandler handler = new DefaultHandler();
    ParseContext context = new ParseContext();
    Map<MediaType, Parser> parsersByType = new HashMap<MediaType, Parser>();
    parsersByType.put(MediaType.parse("text/html"), new HtmlParser());
    parsersByType.put(MediaType.parse("application/xml"), new XMLParser());
    CompositeParser parser = new CompositeParser();
    parser.setParsers(parsersByType);
    parser.setFallback(new TXTParser());
    Metadata metadata = new Metadata();
    metadata.set(Metadata.CONTENT_TYPE, "text/html");
    parser.parse(stream, handler, metadata, context);
}
Also used : HashMap(java.util.HashMap) GZIPInputStream(java.util.zip.GZIPInputStream) ByteArrayInputStream(java.io.ByteArrayInputStream) TikaInputStream(org.apache.tika.io.TikaInputStream) FileInputStream(java.io.FileInputStream) InputStream(java.io.InputStream) CompositeParser(org.apache.tika.parser.CompositeParser) Metadata(org.apache.tika.metadata.Metadata) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) LinkContentHandler(org.apache.tika.sax.LinkContentHandler) TeeContentHandler(org.apache.tika.sax.TeeContentHandler) DefaultHandler(org.xml.sax.helpers.DefaultHandler) Parser(org.apache.tika.parser.Parser) XMLParser(org.apache.tika.parser.xml.XMLParser) HtmlParser(org.apache.tika.parser.html.HtmlParser) TXTParser(org.apache.tika.parser.txt.TXTParser) CompositeParser(org.apache.tika.parser.CompositeParser) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) HtmlParser(org.apache.tika.parser.html.HtmlParser) ByteArrayInputStream(java.io.ByteArrayInputStream) ParseContext(org.apache.tika.parser.ParseContext) TXTParser(org.apache.tika.parser.txt.TXTParser) MediaType(org.apache.tika.mime.MediaType) XMLParser(org.apache.tika.parser.xml.XMLParser)

Aggregations

CompositeParser (org.apache.tika.parser.CompositeParser)2 Parser (org.apache.tika.parser.Parser)2 XMLParser (org.apache.tika.parser.xml.XMLParser)2 ByteArrayInputStream (java.io.ByteArrayInputStream)1 FileInputStream (java.io.FileInputStream)1 InputStream (java.io.InputStream)1 HashMap (java.util.HashMap)1 GZIPInputStream (java.util.zip.GZIPInputStream)1 TikaInputStream (org.apache.tika.io.TikaInputStream)1 Metadata (org.apache.tika.metadata.Metadata)1 MediaType (org.apache.tika.mime.MediaType)1 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)1 DefaultParser (org.apache.tika.parser.DefaultParser)1 EmptyParser (org.apache.tika.parser.EmptyParser)1 ParseContext (org.apache.tika.parser.ParseContext)1 ExecutableParser (org.apache.tika.parser.executable.ExecutableParser)1 HtmlParser (org.apache.tika.parser.html.HtmlParser)1 TXTParser (org.apache.tika.parser.txt.TXTParser)1 BodyContentHandler (org.apache.tika.sax.BodyContentHandler)1 LinkContentHandler (org.apache.tika.sax.LinkContentHandler)1