Search in sources :

Example 6 with AutoDetectReader

use of org.apache.tika.detect.AutoDetectReader in project tika by apache.

the class TXTParser method parse.

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    // Automatically detect the character encoding
    try (AutoDetectReader reader = new AutoDetectReader(new CloseShieldInputStream(stream), metadata, getEncodingDetector(context))) {
        //try to get detected content type; could be a subclass of text/plain
        //such as vcal, etc.
        String incomingMime = metadata.get(Metadata.CONTENT_TYPE);
        MediaType mediaType = MediaType.TEXT_PLAIN;
        if (incomingMime != null) {
            MediaType tmpMediaType = MediaType.parse(incomingMime);
            if (tmpMediaType != null) {
                mediaType = tmpMediaType;
            }
        }
        Charset charset = reader.getCharset();
        MediaType type = new MediaType(mediaType, charset);
        metadata.set(Metadata.CONTENT_TYPE, type.toString());
        // deprecated, see TIKA-431
        metadata.set(Metadata.CONTENT_ENCODING, charset.name());
        XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
        xhtml.startDocument();
        xhtml.startElement("p");
        char[] buffer = new char[4096];
        int n = reader.read(buffer);
        while (n != -1) {
            xhtml.characters(buffer, 0, n);
            n = reader.read(buffer);
        }
        xhtml.endElement("p");
        xhtml.endDocument();
    }
}
Also used : AutoDetectReader(org.apache.tika.detect.AutoDetectReader) MediaType(org.apache.tika.mime.MediaType) Charset(java.nio.charset.Charset) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) CloseShieldInputStream(org.apache.commons.io.input.CloseShieldInputStream)

Aggregations

CloseShieldInputStream (org.apache.commons.io.input.CloseShieldInputStream)6 AutoDetectReader (org.apache.tika.detect.AutoDetectReader)6 Charset (java.nio.charset.Charset)4 MediaType (org.apache.tika.mime.MediaType)4 TikaConfig (org.apache.tika.config.TikaConfig)3 CSVParser (org.apache.commons.csv.CSVParser)2 CSVRecord (org.apache.commons.csv.CSVRecord)2 TikaInputStream (org.apache.tika.io.TikaInputStream)2 AbstractEncodingDetectorParser (org.apache.tika.parser.AbstractEncodingDetectorParser)2 XHTMLContentHandler (org.apache.tika.sax.XHTMLContentHandler)2 HTMLSchema (org.ccil.cowan.tagsoup.HTMLSchema)2 Schema (org.ccil.cowan.tagsoup.Schema)2 Renderer (com.uwyn.jhighlight.renderer.Renderer)1 StringReader (java.io.StringReader)1 InputSource (org.xml.sax.InputSource)1