Search in sources :

Example 1 with ExtractorException

use of org.semanticdesktop.aperture.extractor.ExtractorException in project stanbol by apache.

the class HtmlTextExtractUtil method extract.

public void extract(URI id, String charset, InputStream input, RDFContainer result) throws ExtractorException {
    String encoding = charset;
    if (charset == null) {
        try {
            encoding = CharsetRecognizer.detect(input, "html", null);
        } catch (IOException e) {
            LOG.error("Charset detection problem: " + e.getMessage());
            throw new ExtractorException("Charset detection problem: " + e.getMessage());
    Document doc = htmlParser.getDOM(input, encoding);
    htmlExtractor.extract(id.toString(), doc, null, result);
Also used : ExtractorException(org.semanticdesktop.aperture.extractor.ExtractorException) IOException( Document(org.w3c.dom.Document)

Example 2 with ExtractorException

use of org.semanticdesktop.aperture.extractor.ExtractorException in project stanbol by apache.

the class SimpleMailExtractor method extractTextFromHtml.

protected String extractTextFromHtml(String string, String charset, RDFContainer rdf) throws ExtractorException {
    // parse the HTML and extract full-text and metadata
    HtmlTextExtractUtil extractor;
    try {
        extractor = new HtmlTextExtractUtil();
    } catch (InitializationException e) {
        throw new ExtractorException("Could not initialize HtmlExtractor: " + e.getMessage());
    InputStream stream = new ByteArrayInputStream(string.getBytes());
    RDFContainerFactory containerFactory = new RDFContainerFactoryImpl();
    URI id = rdf.getDescribedUri();
    RDFContainer result = containerFactory.getRDFContainer(id);
    extractor.extract(id, charset, stream, result);
    Model meta = result.getModel();
    // append metadata and full-text to a string buffer
    StringBuilder buffer = new StringBuilder(32 * 1024);
    append(buffer, extractor.getTitle(meta), "\n");
    append(buffer, extractor.getAuthor(meta), "\n");
    append(buffer, extractor.getDescription(meta), "\n");
    List<String> keywords = extractor.getKeywords(meta);
    for (String kw : keywords) {
        append(buffer, kw, " ");
    append(buffer, extractor.getText(meta), " ");
    logger.debug("text extracted:\n{}", buffer);
    // return the buffer's content
    return buffer.toString();
Also used : RDFContainer(org.semanticdesktop.aperture.rdf.RDFContainer) ByteArrayInputStream( FileInputStream( InputStream( RDFContainerFactory(org.semanticdesktop.aperture.rdf.RDFContainerFactory) InitializationException(org.apache.stanbol.enhancer.engines.metaxa.core.html.InitializationException) URI(org.ontoware.rdf2go.model.node.URI) ByteArrayInputStream( HtmlTextExtractUtil(org.apache.stanbol.enhancer.engines.metaxa.core.html.HtmlTextExtractUtil) ExtractorException(org.semanticdesktop.aperture.extractor.ExtractorException) Model(org.ontoware.rdf2go.model.Model) RDFContainerFactoryImpl(org.semanticdesktop.aperture.rdf.impl.RDFContainerFactoryImpl)

Example 3 with ExtractorException

use of org.semanticdesktop.aperture.extractor.ExtractorException in project stanbol by apache.

the class MP3FileExtractor method performExtraction.

protected void performExtraction(URI arg0, File arg1, Charset arg2, String arg3, RDFContainer result) throws ExtractorException {
    try {
        Mp3File mp3File = new Mp3File(arg1.toString());
        ID3v1 id3v1 = mp3File.getId3v1Tag();
        ID3v2 id3v2 = mp3File.getId3v2Tag();
        ID3Wrapper wrapper = new ID3Wrapper(id3v1, id3v2);
        addId3Fields(wrapper, result);
        result.add(RDF.type, NID3.ID3Audio);
    } catch (UnsupportedTagException e) {
        throw new ExtractorException(e);
    } catch (InvalidDataException e) {
        throw new ExtractorException(e);
    } catch (IOException e) {
        throw new ExtractorException(e);
Also used : ID3v2(com.mpatric.mp3agic.ID3v2) Mp3File(com.mpatric.mp3agic.Mp3File) ID3v1(com.mpatric.mp3agic.ID3v1) ID3Wrapper(com.mpatric.mp3agic.ID3Wrapper) ExtractorException(org.semanticdesktop.aperture.extractor.ExtractorException) InvalidDataException(com.mpatric.mp3agic.InvalidDataException) UnsupportedTagException(com.mpatric.mp3agic.UnsupportedTagException) IOException(

Example 4 with ExtractorException

use of org.semanticdesktop.aperture.extractor.ExtractorException in project stanbol by apache.

the class IksHtmlExtractor method extract.

public void extract(URI id, InputStream input, Charset charset, String mimeType, RDFContainer result) throws ExtractorException {
    if (registry == null)
    String encoding;
    if (charset == null) {
        if (!input.markSupported()) {
            input = new BufferedInputStream(input);
        try {
            encoding = CharsetRecognizer.detect(input, "html", "UTF-8");
        } catch (IOException e) {
            LOG.error("Charset detection problem: " + e.getMessage());
            throw new ExtractorException("Charset detection problem: " + e.getMessage());
    } else {
        encoding =;
    Document doc = htmlParser.getDOM(input, encoding);
         * This solves namespace problem but makes it difficult to handle normal
         * HTML and namespaced XHTML documents on a par. Rather avoid namespaces
         * in transformers for HTML elements! Problem remains that scripts then
         * cannot be tested offline Way out might be to use disjunctions in
         * scripts or ignore namespace by checking local-name() only
         * (match=*[local-name() = 'xxx']) Are Microformats, RDFa, ... only used
         * in XHTML? That would make the decision easier! Also have to solve the
         * problem how to connect/map SemanticDesktop ontologies with those from
         * the extractors String docText = DOMUtils.getStringFromDoc(doc,
         * "UTF-8", null);; doc = DOMUtils.parse(docText,
         * "UTF-8");
    HashMap<String, HtmlExtractionComponent> extractors = registry.getRegistry();
    List<String> formats = new ArrayList<String>();
    long modelSize = result.getModel().size();
    for (String s : registry.getActiveExtractors()) {
        LOG.debug("Extractor: {}", s);
        HtmlExtractionComponent extractor = extractors.get(s);
        // formats used also in other formats
        if (extractor != null) {
            extractor.extract(id.toString(), doc, null, result);
            long tmpSize = result.getModel().size();
            if (modelSize < tmpSize) {
                LOG.debug("{} Statements added: {}", (tmpSize - modelSize), s);
                modelSize = tmpSize;
Also used : BufferedInputStream( ExtractorException(org.semanticdesktop.aperture.extractor.ExtractorException) ArrayList(java.util.ArrayList) IOException( Document(org.w3c.dom.Document)

Example 5 with ExtractorException

use of org.semanticdesktop.aperture.extractor.ExtractorException in project stanbol by apache.

the class SimpleMailExtractor method extract.

public void extract(URI id, InputStream stream, Charset charset, String mimeType, RDFContainer result) throws ExtractorException {
    try {
        // parse the stream
        MimeMessage message = new MimeMessage(null, stream);
        result.add(RDF.type, NMO.Email);
        // extract the full-text
        StringBuilder buffer = new StringBuilder(10000);
        processMessage(message, buffer, result);
        String text = buffer.toString().trim();
        if (text.length() > 0) {
            result.add(NMO.plainTextMessageContent, text);
            result.add(NIE.plainTextContent, text);
        // extract other metadata
        String title = message.getSubject();
        if (title != null) {
            title = title.trim();
            if (title.length() > 0) {
                result.add(NMO.messageSubject, title);
        try {
            copyAddress(message.getFrom(), NMO.from, result);
        } catch (AddressException e) {
        // ignore
        copyAddress(getRecipients(message, RecipientType.TO),, result);
        copyAddress(getRecipients(message, RecipientType.CC),, result);
        copyAddress(getRecipients(message, RecipientType.BCC), NMO.bcc, result);
        MailUtil.getDates(message, result);
    } catch (MessagingException e) {
        throw new ExtractorException(e);
    } catch (IOException e) {
        throw new ExtractorException(e);
Also used : MimeMessage(javax.mail.internet.MimeMessage) MessagingException(javax.mail.MessagingException) AddressException(javax.mail.internet.AddressException) ExtractorException(org.semanticdesktop.aperture.extractor.ExtractorException) IOException(


ExtractorException (org.semanticdesktop.aperture.extractor.ExtractorException)7 IOException ( Model (org.ontoware.rdf2go.model.Model)2 Document (org.w3c.dom.Document)2 ID3Wrapper (com.mpatric.mp3agic.ID3Wrapper)1 ID3v1 (com.mpatric.mp3agic.ID3v1)1 ID3v2 (com.mpatric.mp3agic.ID3v2)1 InvalidDataException (com.mpatric.mp3agic.InvalidDataException)1 Mp3File (com.mpatric.mp3agic.Mp3File)1 UnsupportedTagException (com.mpatric.mp3agic.UnsupportedTagException)1 BufferedInputStream ( BufferedWriter ( ByteArrayInputStream ( FileInputStream ( InputStream ( OutputStreamWriter ( StringReader ( StringWriter ( ArrayList (java.util.ArrayList)1 HashMap (java.util.HashMap)1