Examples with HtmlExtractor - org.apache.stanbol.enhancer.engines.htmlextractor.impl.HtmlExtractor

Example 1 with HtmlExtractor

use of org.apache.stanbol.enhancer.engines.htmlextractor.impl.HtmlExtractor in project stanbol by apache.

the class TestHtmlExtractor method testMicrodataExtraction.

/** This test some extraction of microdata from an HTML-5 document
     * 
     * @throws Exception
     */
@Test
public void testMicrodataExtraction() throws Exception {
    HtmlExtractor extractor = new HtmlExtractor(registry, parser);
    Graph model = new SimpleGraph();
    String testFile = "test-microdata.html";
    // extract text from RDFa annotated html
    InputStream in = getResourceAsStream(testFile);
    assertNotNull("failed to load resource " + testFile, in);
    extractor.extract("file://" + testFile, in, null, "text/html", model);
    // show triples
    int tripleCounter = model.size();
    LOG.debug("Microdata triples: {}", tripleCounter);
    printTriples(model);
    assertEquals(91, tripleCounter);
    ClerezzaRDFUtils.makeConnected(model, new IRI("file://" + testFile), new IRI(NIE_NS + "contains"));
}

Also used : IRI(org.apache.clerezza.commons.rdf.IRI) BlankNodeOrIRI(org.apache.clerezza.commons.rdf.BlankNodeOrIRI) SimpleGraph(org.apache.clerezza.commons.rdf.impl.utils.simple.SimpleGraph) Graph(org.apache.clerezza.commons.rdf.Graph) InputStream(java.io.InputStream) SimpleGraph(org.apache.clerezza.commons.rdf.impl.utils.simple.SimpleGraph) HtmlExtractor(org.apache.stanbol.enhancer.engines.htmlextractor.impl.HtmlExtractor) Test(org.junit.Test)

Example 2 with HtmlExtractor

use of org.apache.stanbol.enhancer.engines.htmlextractor.impl.HtmlExtractor in project stanbol by apache.

the class TestHtmlExtractor method testMFExtraction.

/** This tests some Microformat extraction
     * 
     * @throws ExtractorException if there is an error during extraction
     * @throws IOException if there is an error when reading the document
     */
@Test
public void testMFExtraction() throws Exception {
    HtmlExtractor extractor = new HtmlExtractor(registry, parser);
    Graph model = new SimpleGraph();
    String testFile = "test-MF.html";
    // extract text from RDFa annotated html
    InputStream in = getResourceAsStream(testFile);
    assertNotNull("failed to load resource " + testFile, in);
    extractor.extract("file://" + testFile, in, null, "text/html", model);
    // show triples
    int tripleCounter = model.size();
    LOG.debug("Microformat triples: {}", tripleCounter);
    printTriples(model);
    assertEquals(127, tripleCounter);
    ClerezzaRDFUtils.makeConnected(model, new IRI("file://" + testFile), new IRI(NIE_NS + "contains"));
}

Example 3 with HtmlExtractor

use of org.apache.stanbol.enhancer.engines.htmlextractor.impl.HtmlExtractor in project stanbol by apache.

the class TestHtmlExtractor method testRootExtraction.

/** This tests the merging of disconnected graphs under a single root
     * 
     * @throws Exception
     */
@Test
public void testRootExtraction() throws Exception {
    HtmlExtractor extractor = new HtmlExtractor(registry, parser);
    Graph model = new SimpleGraph();
    String testFile = "test-MultiRoot.html";
    // extract text from RDFa annotated html
    InputStream in = getResourceAsStream(testFile);
    assertNotNull("failed to load resource " + testFile, in);
    extractor.extract("file://" + testFile, in, null, "text/html", model);
    // show triples
    int tripleCounter = model.size();
    LOG.debug("Triples: {}", tripleCounter);
    printTriples(model);
    Set<BlankNodeOrIRI> roots = ClerezzaRDFUtils.findRoots(model);
    assertTrue(roots.size() > 1);
    ClerezzaRDFUtils.makeConnected(model, new IRI("file://" + testFile), new IRI(NIE_NS + "contains"));
    roots = ClerezzaRDFUtils.findRoots(model);
    assertEquals(1, roots.size());
}

Also used : IRI(org.apache.clerezza.commons.rdf.IRI) BlankNodeOrIRI(org.apache.clerezza.commons.rdf.BlankNodeOrIRI) SimpleGraph(org.apache.clerezza.commons.rdf.impl.utils.simple.SimpleGraph) Graph(org.apache.clerezza.commons.rdf.Graph) InputStream(java.io.InputStream) SimpleGraph(org.apache.clerezza.commons.rdf.impl.utils.simple.SimpleGraph) BlankNodeOrIRI(org.apache.clerezza.commons.rdf.BlankNodeOrIRI) HtmlExtractor(org.apache.stanbol.enhancer.engines.htmlextractor.impl.HtmlExtractor) Test(org.junit.Test)

Example 4 with HtmlExtractor

use of org.apache.stanbol.enhancer.engines.htmlextractor.impl.HtmlExtractor in project stanbol by apache.

the class TestHtmlExtractor method testRdfaExtraction.

/**
     * This tests the RDFa extraction.
     *
     * @throws ExtractorException if there is an error during extraction
     * @throws IOException if there is an error when reading the document
     */
@Test
public void testRdfaExtraction() throws Exception {
    HtmlExtractor extractor = new HtmlExtractor(registry, parser);
    Graph model = new SimpleGraph();
    String testFile = "test-rdfa.html";
    // extract text from RDFa annotated html
    InputStream in = getResourceAsStream(testFile);
    assertNotNull("failed to load resource " + testFile, in);
    extractor.extract("file://" + testFile, in, null, "text/html", model);
    // show triples
    int tripleCounter = model.size();
    LOG.debug("RDFa triples: {}", tripleCounter);
    printTriples(model);
    assertEquals(8, tripleCounter);
    ClerezzaRDFUtils.makeConnected(model, new IRI("file://" + testFile), new IRI(NIE_NS + "contains"));
}

Example 5 with HtmlExtractor

use of org.apache.stanbol.enhancer.engines.htmlextractor.impl.HtmlExtractor in project stanbol by apache.

the class HtmlExtractorEngine method computeEnhancements.

@Override
public void computeEnhancements(ContentItem ci) throws EngineException {
    HtmlExtractor extractor = new HtmlExtractor(htmlExtractorRegistry, htmlParser);
    Graph model = new SimpleGraph();
    ci.getLock().readLock().lock();
    try {
        extractor.extract(ci.getUri().getUnicodeString(), ci.getStream(), null, ci.getMimeType(), model);
    } catch (ExtractorException e) {
        throw new EngineException("Error while processing ContentItem " + ci.getUri() + " with HtmlExtractor", e);
    } finally {
        ci.getLock().readLock().unlock();
    }
    ClerezzaRDFUtils.urifyBlankNodes(model);
    // make the model single rooted
    if (singleRootRdf) {
        ClerezzaRDFUtils.makeConnected(model, ci.getUri(), new IRI(NIE_NS + "contains"));
    }
    //add the extracted triples to the metadata of the ContentItem
    ci.getLock().writeLock().lock();
    try {
        LOG.info("Model: {}", model);
        ci.getMetadata().addAll(model);
        model = null;
    } finally {
        ci.getLock().writeLock().unlock();
    }
}

Also used : IRI(org.apache.clerezza.commons.rdf.IRI) SimpleGraph(org.apache.clerezza.commons.rdf.impl.utils.simple.SimpleGraph) Graph(org.apache.clerezza.commons.rdf.Graph) SimpleGraph(org.apache.clerezza.commons.rdf.impl.utils.simple.SimpleGraph) ExtractorException(org.apache.stanbol.enhancer.engines.htmlextractor.impl.ExtractorException) EngineException(org.apache.stanbol.enhancer.servicesapi.EngineException) HtmlExtractor(org.apache.stanbol.enhancer.engines.htmlextractor.impl.HtmlExtractor)

Aggregations

Graph (org.apache.clerezza.commons.rdf.Graph)5 IRI (org.apache.clerezza.commons.rdf.IRI)5 SimpleGraph (org.apache.clerezza.commons.rdf.impl.utils.simple.SimpleGraph)5 HtmlExtractor (org.apache.stanbol.enhancer.engines.htmlextractor.impl.HtmlExtractor)5 InputStream (java.io.InputStream)4 BlankNodeOrIRI (org.apache.clerezza.commons.rdf.BlankNodeOrIRI)4 Test (org.junit.Test)4 ExtractorException (org.apache.stanbol.enhancer.engines.htmlextractor.impl.ExtractorException)1 EngineException (org.apache.stanbol.enhancer.servicesapi.EngineException)1