Search in sources :

Example 1 with BoilerpipeContentHandler

use of org.apache.tika.parser.html.BoilerpipeContentHandler in project camel by apache.

the class TikaProducer method getContentHandler.

private ContentHandler getContentHandler(TikaConfiguration configuration, OutputStream outputStream) throws TransformerConfigurationException, UnsupportedEncodingException {
    ContentHandler result = null;
    TikaParseOutputFormat outputFormat = configuration.getTikaParseOutputFormat();
    switch(outputFormat) {
        case xml:
            result = getTransformerHandler(outputStream, "xml", true);
            break;
        case text:
            result = new BodyContentHandler(new OutputStreamWriter(outputStream, this.encoding));
            break;
        case textMain:
            result = new BoilerpipeContentHandler(new OutputStreamWriter(outputStream, this.encoding));
            break;
        case html:
            result = new ExpandedTitleContentHandler(getTransformerHandler(outputStream, "html", true));
            break;
        default:
            throw new IllegalArgumentException(String.format("Unknown format %s", tikaConfiguration.getTikaParseOutputFormat()));
    }
    return result;
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) OutputStreamWriter(java.io.OutputStreamWriter) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) BoilerpipeContentHandler(org.apache.tika.parser.html.BoilerpipeContentHandler) ContentHandler(org.xml.sax.ContentHandler) ExpandedTitleContentHandler(org.apache.tika.sax.ExpandedTitleContentHandler) BoilerpipeContentHandler(org.apache.tika.parser.html.BoilerpipeContentHandler) ExpandedTitleContentHandler(org.apache.tika.sax.ExpandedTitleContentHandler)

Example 2 with BoilerpipeContentHandler

use of org.apache.tika.parser.html.BoilerpipeContentHandler in project Xponents by OpenSextant.

the class TikaHTMLConverter method conversionImplementation.

/**
     * a barebones HTML parser.
     *
     * <pre>
     * TODO: mis-encoded HTML entities are not decoded
     * properly. E.g., finding "&#8211;" (82xx range is dashes, quotes) for
     * example, does not decode correctly unless the page encoding is declared as UTF-8.
     * </pre>
     */
@Override
protected ConvertedDocument conversionImplementation(InputStream input, File doc) throws IOException {
    Metadata metadata = new Metadata();
    HashMap<String, String> moreMetadata = new HashMap<>();
    // HTML Conversion here is simply not resetting its internal buffers
    // Its just accumulating and error out when it reaches MAX
    ContentHandler handler = new BodyContentHandler(maxHTMLDocumentSize);
    BoilerpipeContentHandler scrubbingHandler = null;
    if (scrubHTMLArticle) {
        scrubbingHandler = new BoilerpipeContentHandler(handler);
    }
    try {
        parser.parse(input, (scrubHTMLArticle ? scrubbingHandler : handler), metadata, new ParseContext());
        if (doc != null) {
            parseHTMLMetadata(doc, moreMetadata);
        }
    } catch (Exception xerr) {
        throw new IOException("Unable to parse content", xerr);
    } finally {
        input.close();
    }
    ConvertedDocument textdoc = new ConvertedDocument(doc);
    textdoc.addTitle(metadata.get(TikaCoreProperties.TITLE));
    String text = null;
    if (scrubHTMLArticle) {
        text = scrubbingHandler.getTextDocument().getText(true, false);
    } else {
        text = handler.toString();
    }
    textdoc.setText(TextUtils.reduce_line_breaks(text));
    // -- Improve CHAR SET encoding answer.
    byte[] data = textdoc.buffer.getBytes();
    if (TextUtils.isASCII(data)) {
        textdoc.setEncoding("ASCII");
    } else {
        // Okay, okay... let Tika name whatever encoding it found or guessed
        // at.
        textdoc.setEncoding(metadata.get(Metadata.CONTENT_ENCODING));
    }
    // Indicate if we tried to filter the article at all.
    //
    textdoc.addProperty("filtered", scrubHTMLArticle);
    textdoc.addProperty("converter", TikaHTMLConverter.class.getName());
    if (!moreMetadata.isEmpty()) {
        for (String k : moreMetadata.keySet()) {
            textdoc.addUserProperty(k, moreMetadata.get(k));
        }
    }
    return textdoc;
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) HashMap(java.util.HashMap) Metadata(org.apache.tika.metadata.Metadata) IOException(java.io.IOException) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) BoilerpipeContentHandler(org.apache.tika.parser.html.BoilerpipeContentHandler) ContentHandler(org.xml.sax.ContentHandler) IOException(java.io.IOException) ParseContext(org.apache.tika.parser.ParseContext) ConvertedDocument(org.opensextant.xtext.ConvertedDocument) BoilerpipeContentHandler(org.apache.tika.parser.html.BoilerpipeContentHandler)

Example 3 with BoilerpipeContentHandler

use of org.apache.tika.parser.html.BoilerpipeContentHandler in project tika by apache.

the class TikaResource method produceTextMain.

public StreamingOutput produceTextMain(final InputStream is, @Context MultivaluedMap<String, String> httpHeaders, @Context final UriInfo info) {
    final Parser parser = createParser();
    final Metadata metadata = new Metadata();
    final ParseContext context = new ParseContext();
    fillMetadata(parser, metadata, context, httpHeaders);
    fillParseContext(context, httpHeaders, parser);
    logRequest(LOG, info, metadata);
    return new StreamingOutput() {

        public void write(OutputStream outputStream) throws IOException, WebApplicationException {
            Writer writer = new OutputStreamWriter(outputStream, UTF_8);
            ContentHandler handler = new BoilerpipeContentHandler(writer);
            parse(parser, LOG, info.getPath(), is, handler, metadata, context);
        }
    };
}
Also used : OutputStream(java.io.OutputStream) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) StreamingOutput(javax.ws.rs.core.StreamingOutput) OutputStreamWriter(java.io.OutputStreamWriter) Writer(java.io.Writer) OutputStreamWriter(java.io.OutputStreamWriter) BoilerpipeContentHandler(org.apache.tika.parser.html.BoilerpipeContentHandler) ExpandedTitleContentHandler(org.apache.tika.sax.ExpandedTitleContentHandler) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) RichTextContentHandler(org.apache.tika.sax.RichTextContentHandler) BoilerpipeContentHandler(org.apache.tika.parser.html.BoilerpipeContentHandler) Parser(org.apache.tika.parser.Parser) HtmlParser(org.apache.tika.parser.html.HtmlParser) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) DigestingParser(org.apache.tika.parser.DigestingParser)

Aggregations

BoilerpipeContentHandler (org.apache.tika.parser.html.BoilerpipeContentHandler)3 BodyContentHandler (org.apache.tika.sax.BodyContentHandler)3 ContentHandler (org.xml.sax.ContentHandler)3 OutputStreamWriter (java.io.OutputStreamWriter)2 Metadata (org.apache.tika.metadata.Metadata)2 ParseContext (org.apache.tika.parser.ParseContext)2 ExpandedTitleContentHandler (org.apache.tika.sax.ExpandedTitleContentHandler)2 IOException (java.io.IOException)1 OutputStream (java.io.OutputStream)1 Writer (java.io.Writer)1 HashMap (java.util.HashMap)1 StreamingOutput (javax.ws.rs.core.StreamingOutput)1 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)1 DigestingParser (org.apache.tika.parser.DigestingParser)1 Parser (org.apache.tika.parser.Parser)1 HtmlParser (org.apache.tika.parser.html.HtmlParser)1 RichTextContentHandler (org.apache.tika.sax.RichTextContentHandler)1 ConvertedDocument (org.opensextant.xtext.ConvertedDocument)1