Search in sources :

Example 1 with HtmlTextExtractUtil

use of org.apache.stanbol.enhancer.engines.metaxa.core.html.HtmlTextExtractUtil in project stanbol by apache.

the class SimpleMailExtractor method extractTextFromHtml.

protected String extractTextFromHtml(String string, String charset, RDFContainer rdf) throws ExtractorException {
    // parse the HTML and extract full-text and metadata
    HtmlTextExtractUtil extractor;
    try {
        extractor = new HtmlTextExtractUtil();
    } catch (InitializationException e) {
        throw new ExtractorException("Could not initialize HtmlExtractor: " + e.getMessage());
    }
    InputStream stream = new ByteArrayInputStream(string.getBytes());
    RDFContainerFactory containerFactory = new RDFContainerFactoryImpl();
    URI id = rdf.getDescribedUri();
    RDFContainer result = containerFactory.getRDFContainer(id);
    extractor.extract(id, charset, stream, result);
    Model meta = result.getModel();
    // append metadata and full-text to a string buffer
    StringBuilder buffer = new StringBuilder(32 * 1024);
    append(buffer, extractor.getTitle(meta), "\n");
    append(buffer, extractor.getAuthor(meta), "\n");
    append(buffer, extractor.getDescription(meta), "\n");
    List<String> keywords = extractor.getKeywords(meta);
    for (String kw : keywords) {
        append(buffer, kw, " ");
    }
    buffer.append("\n");
    append(buffer, extractor.getText(meta), " ");
    logger.debug("text extracted:\n{}", buffer);
    meta.close();
    // return the buffer's content
    return buffer.toString();
}
Also used : RDFContainer(org.semanticdesktop.aperture.rdf.RDFContainer) ByteArrayInputStream(java.io.ByteArrayInputStream) FileInputStream(java.io.FileInputStream) InputStream(java.io.InputStream) RDFContainerFactory(org.semanticdesktop.aperture.rdf.RDFContainerFactory) InitializationException(org.apache.stanbol.enhancer.engines.metaxa.core.html.InitializationException) URI(org.ontoware.rdf2go.model.node.URI) ByteArrayInputStream(java.io.ByteArrayInputStream) HtmlTextExtractUtil(org.apache.stanbol.enhancer.engines.metaxa.core.html.HtmlTextExtractUtil) ExtractorException(org.semanticdesktop.aperture.extractor.ExtractorException) Model(org.ontoware.rdf2go.model.Model) RDFContainerFactoryImpl(org.semanticdesktop.aperture.rdf.impl.RDFContainerFactoryImpl)

Aggregations

ByteArrayInputStream (java.io.ByteArrayInputStream)1 FileInputStream (java.io.FileInputStream)1 InputStream (java.io.InputStream)1 HtmlTextExtractUtil (org.apache.stanbol.enhancer.engines.metaxa.core.html.HtmlTextExtractUtil)1 InitializationException (org.apache.stanbol.enhancer.engines.metaxa.core.html.InitializationException)1 Model (org.ontoware.rdf2go.model.Model)1 URI (org.ontoware.rdf2go.model.node.URI)1 ExtractorException (org.semanticdesktop.aperture.extractor.ExtractorException)1 RDFContainer (org.semanticdesktop.aperture.rdf.RDFContainer)1 RDFContainerFactory (org.semanticdesktop.aperture.rdf.RDFContainerFactory)1 RDFContainerFactoryImpl (org.semanticdesktop.aperture.rdf.impl.RDFContainerFactoryImpl)1