Search in sources :

Example 6 with ExtractorException

use of org.semanticdesktop.aperture.extractor.ExtractorException in project stanbol by apache.

the class MetaxaEngine method computeEnhancements.

public void computeEnhancements(ContentItem ci) throws EngineException {
    // get model from the extraction
    URIImpl docId;
    Model m = null;
    ci.getLock().readLock().lock();
    try {
        docId = new URIImpl(ci.getUri().getUnicodeString());
        m = this.extractor.extract(ci.getStream(), docId, ci.getMimeType());
    } catch (ExtractorException e) {
        throw new EngineException("Error while processing ContentItem " + ci.getUri() + " with Metaxa", e);
    } catch (IOException e) {
        throw new EngineException("Error while processing ContentItem " + ci.getUri() + " with Metaxa", e);
    } finally {
        ci.getLock().readLock().unlock();
    }
    // the extracted plain text from the model
    if (null == m) {
        log.debug("Unable to preocess ContentItem {} (mime type {}) with Metaxa", ci.getUri(), ci.getMimeType());
        return;
    }
    ContentSink plainTextSink;
    try {
        plainTextSink = ciFactory.createContentSink("text/plain");
    } catch (IOException e) {
        m.close();
        throw new EngineException("Unable to initialise Blob for storing" + "the plain text content", e);
    }
    HashMap<BlankNode, BlankNode> blankNodeMap = new HashMap<BlankNode, BlankNode>();
    RDF2GoUtils.urifyBlankNodes(m);
    ClosableIterator<Statement> it = m.iterator();
    BufferedWriter out = new BufferedWriter(new OutputStreamWriter(plainTextSink.getOutputStream(), UTF8));
    // used to detect if some text was extracted
    boolean textExtracted = false;
    try {
        // first add to a temporary graph
        Graph g = new SimpleGraph();
        while (it.hasNext()) {
            Statement oneStmt = it.next();
            // the plain text Blob!
            if (oneStmt.getSubject().equals(docId) && oneStmt.getPredicate().equals(NIE_PLAINTEXT_PROPERTY)) {
                String text = oneStmt.getObject().toString();
                if (text != null && !text.isEmpty()) {
                    try {
                        out.write(oneStmt.getObject().toString());
                    } catch (IOException e) {
                        throw new EngineException("Unable to write extracted" + "plain text to Blob (blob impl: " + plainTextSink.getBlob().getClass() + ")", e);
                    }
                    textExtracted = true;
                    if (includeText) {
                        BlankNodeOrIRI subject = (BlankNodeOrIRI) asClerezzaResource(oneStmt.getSubject(), blankNodeMap);
                        IRI predicate = (IRI) asClerezzaResource(oneStmt.getPredicate(), blankNodeMap);
                        RDFTerm object = asClerezzaResource(oneStmt.getObject(), blankNodeMap);
                        g.add(new TripleImpl(subject, predicate, object));
                    }
                }
            } else {
                // add metadata to the metadata of the contentItem
                BlankNodeOrIRI subject = (BlankNodeOrIRI) asClerezzaResource(oneStmt.getSubject(), blankNodeMap);
                IRI predicate = (IRI) asClerezzaResource(oneStmt.getPredicate(), blankNodeMap);
                RDFTerm object = asClerezzaResource(oneStmt.getObject(), blankNodeMap);
                if (null != subject && null != predicate && null != object) {
                    Triple t = new TripleImpl(subject, predicate, object);
                    g.add(t);
                    log.debug("added " + t.toString());
                }
            }
        }
        // add the extracted triples to the metadata of the ContentItem
        ci.getLock().writeLock().lock();
        try {
            ci.getMetadata().addAll(g);
            g = null;
        } finally {
            ci.getLock().writeLock().unlock();
        }
    } finally {
        it.close();
        m.close();
        IOUtils.closeQuietly(out);
    }
    if (textExtracted) {
        // add plain text to the content item
        IRI blobUri = new IRI("urn:metaxa:plain-text:" + randomUUID());
        ci.addPart(blobUri, plainTextSink.getBlob());
    }
}
Also used : IRI(org.apache.clerezza.commons.rdf.IRI) BlankNodeOrIRI(org.apache.clerezza.commons.rdf.BlankNodeOrIRI) HashMap(java.util.HashMap) Statement(org.ontoware.rdf2go.model.Statement) EngineException(org.apache.stanbol.enhancer.servicesapi.EngineException) BlankNode(org.apache.clerezza.commons.rdf.BlankNode) BlankNode(org.ontoware.rdf2go.model.node.BlankNode) BlankNodeOrIRI(org.apache.clerezza.commons.rdf.BlankNodeOrIRI) URIImpl(org.ontoware.rdf2go.model.node.impl.URIImpl) RDFTerm(org.apache.clerezza.commons.rdf.RDFTerm) IOException(java.io.IOException) BufferedWriter(java.io.BufferedWriter) Triple(org.apache.clerezza.commons.rdf.Triple) SimpleGraph(org.apache.clerezza.commons.rdf.impl.utils.simple.SimpleGraph) Graph(org.apache.clerezza.commons.rdf.Graph) Model(org.ontoware.rdf2go.model.Model) ExtractorException(org.semanticdesktop.aperture.extractor.ExtractorException) SimpleGraph(org.apache.clerezza.commons.rdf.impl.utils.simple.SimpleGraph) OutputStreamWriter(java.io.OutputStreamWriter) ContentSink(org.apache.stanbol.enhancer.servicesapi.ContentSink) TripleImpl(org.apache.clerezza.commons.rdf.impl.utils.TripleImpl)

Example 7 with ExtractorException

use of org.semanticdesktop.aperture.extractor.ExtractorException in project stanbol by apache.

the class XsltExtractor method extract.

public synchronized void extract(String id, Document doc, Map<String, Object> params, RDFContainer result) throws ExtractorException {
    if (params == null) {
        params = new HashMap<String, Object>();
    }
    params.put(this.uriParameter, id);
    initTransformerParameters(params);
    Source source = new DOMSource(doc);
    StringWriter writer = new StringWriter();
    StreamResult output = new StreamResult(writer);
    try {
        this.transformer.transform(source, output);
        String rdf = writer.toString();
        LOG.debug(rdf);
        StringReader reader = new StringReader(rdf);
        result.getModel().readFrom(reader, this.syntax);
        reader.close();
    } catch (TransformerException e) {
        throw new ExtractorException(e.getMessage(), e);
    } catch (ModelRuntimeException e) {
        throw new ExtractorException(e.getMessage(), e);
    } catch (IOException e) {
        throw new ExtractorException(e.getMessage(), e);
    }
}
Also used : DOMSource(javax.xml.transform.dom.DOMSource) StringWriter(java.io.StringWriter) StreamResult(javax.xml.transform.stream.StreamResult) StringReader(java.io.StringReader) ExtractorException(org.semanticdesktop.aperture.extractor.ExtractorException) ModelRuntimeException(org.ontoware.rdf2go.exception.ModelRuntimeException) IOException(java.io.IOException) DOMSource(javax.xml.transform.dom.DOMSource) StreamSource(javax.xml.transform.stream.StreamSource) Source(javax.xml.transform.Source) TransformerException(javax.xml.transform.TransformerException)

Aggregations

ExtractorException (org.semanticdesktop.aperture.extractor.ExtractorException)7 IOException (java.io.IOException)6 Model (org.ontoware.rdf2go.model.Model)2 Document (org.w3c.dom.Document)2 ID3Wrapper (com.mpatric.mp3agic.ID3Wrapper)1 ID3v1 (com.mpatric.mp3agic.ID3v1)1 ID3v2 (com.mpatric.mp3agic.ID3v2)1 InvalidDataException (com.mpatric.mp3agic.InvalidDataException)1 Mp3File (com.mpatric.mp3agic.Mp3File)1 UnsupportedTagException (com.mpatric.mp3agic.UnsupportedTagException)1 BufferedInputStream (java.io.BufferedInputStream)1 BufferedWriter (java.io.BufferedWriter)1 ByteArrayInputStream (java.io.ByteArrayInputStream)1 FileInputStream (java.io.FileInputStream)1 InputStream (java.io.InputStream)1 OutputStreamWriter (java.io.OutputStreamWriter)1 StringReader (java.io.StringReader)1 StringWriter (java.io.StringWriter)1 ArrayList (java.util.ArrayList)1 HashMap (java.util.HashMap)1