Search in sources :

Example 1 with TXTParser

use of org.apache.tika.parser.txt.TXTParser in project tika by apache.

the class TikaEncodingDetectorTest method testEncodingDetectorsAreLoaded.

@Test
public void testEncodingDetectorsAreLoaded() {
    EncodingDetector encodingDetector = ((AbstractEncodingDetectorParser) new TXTParser()).getEncodingDetector();
    assertTrue(encodingDetector instanceof CompositeEncodingDetector);
}
Also used : Icu4jEncodingDetector(org.apache.tika.parser.txt.Icu4jEncodingDetector) NonDetectingEncodingDetector(org.apache.tika.detect.NonDetectingEncodingDetector) UniversalEncodingDetector(org.apache.tika.parser.txt.UniversalEncodingDetector) CompositeEncodingDetector(org.apache.tika.detect.CompositeEncodingDetector) EncodingDetector(org.apache.tika.detect.EncodingDetector) HtmlEncodingDetector(org.apache.tika.parser.html.HtmlEncodingDetector) CompositeEncodingDetector(org.apache.tika.detect.CompositeEncodingDetector) TXTParser(org.apache.tika.parser.txt.TXTParser) AbstractEncodingDetectorParser(org.apache.tika.parser.AbstractEncodingDetectorParser) Test(org.junit.Test)

Example 2 with TXTParser

use of org.apache.tika.parser.txt.TXTParser in project tika by apache.

the class TIAParsingExample method useCompositeParser.

public static void useCompositeParser() throws Exception {
    InputStream stream = new ByteArrayInputStream(new byte[0]);
    ContentHandler handler = new DefaultHandler();
    ParseContext context = new ParseContext();
    Map<MediaType, Parser> parsersByType = new HashMap<MediaType, Parser>();
    parsersByType.put(MediaType.parse("text/html"), new HtmlParser());
    parsersByType.put(MediaType.parse("application/xml"), new XMLParser());
    CompositeParser parser = new CompositeParser();
    parser.setParsers(parsersByType);
    parser.setFallback(new TXTParser());
    Metadata metadata = new Metadata();
    metadata.set(Metadata.CONTENT_TYPE, "text/html");
    parser.parse(stream, handler, metadata, context);
}
Also used : HashMap(java.util.HashMap) GZIPInputStream(java.util.zip.GZIPInputStream) ByteArrayInputStream(java.io.ByteArrayInputStream) TikaInputStream(org.apache.tika.io.TikaInputStream) FileInputStream(java.io.FileInputStream) InputStream(java.io.InputStream) CompositeParser(org.apache.tika.parser.CompositeParser) Metadata(org.apache.tika.metadata.Metadata) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) LinkContentHandler(org.apache.tika.sax.LinkContentHandler) TeeContentHandler(org.apache.tika.sax.TeeContentHandler) DefaultHandler(org.xml.sax.helpers.DefaultHandler) Parser(org.apache.tika.parser.Parser) XMLParser(org.apache.tika.parser.xml.XMLParser) HtmlParser(org.apache.tika.parser.html.HtmlParser) TXTParser(org.apache.tika.parser.txt.TXTParser) CompositeParser(org.apache.tika.parser.CompositeParser) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) HtmlParser(org.apache.tika.parser.html.HtmlParser) ByteArrayInputStream(java.io.ByteArrayInputStream) ParseContext(org.apache.tika.parser.ParseContext) TXTParser(org.apache.tika.parser.txt.TXTParser) MediaType(org.apache.tika.mime.MediaType) XMLParser(org.apache.tika.parser.xml.XMLParser)

Aggregations

TXTParser (org.apache.tika.parser.txt.TXTParser)2 ByteArrayInputStream (java.io.ByteArrayInputStream)1 FileInputStream (java.io.FileInputStream)1 InputStream (java.io.InputStream)1 HashMap (java.util.HashMap)1 GZIPInputStream (java.util.zip.GZIPInputStream)1 CompositeEncodingDetector (org.apache.tika.detect.CompositeEncodingDetector)1 EncodingDetector (org.apache.tika.detect.EncodingDetector)1 NonDetectingEncodingDetector (org.apache.tika.detect.NonDetectingEncodingDetector)1 TikaInputStream (org.apache.tika.io.TikaInputStream)1 Metadata (org.apache.tika.metadata.Metadata)1 MediaType (org.apache.tika.mime.MediaType)1 AbstractEncodingDetectorParser (org.apache.tika.parser.AbstractEncodingDetectorParser)1 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)1 CompositeParser (org.apache.tika.parser.CompositeParser)1 ParseContext (org.apache.tika.parser.ParseContext)1 Parser (org.apache.tika.parser.Parser)1 HtmlEncodingDetector (org.apache.tika.parser.html.HtmlEncodingDetector)1 HtmlParser (org.apache.tika.parser.html.HtmlParser)1 Icu4jEncodingDetector (org.apache.tika.parser.txt.Icu4jEncodingDetector)1