Search in sources :

Example 1 with Extractor

use of org.codelibs.fess.crawler.extractor.Extractor in project fess-crawler by codelibs.

the class TextTransformer method transform.

/*
     * (non-Javadoc)
     *
     * @see
     * org.codelibs.fess.crawler.transformer.impl.AbstractTransformer#transform(org.fess.crawler.entity.ResponseData)
     */
@Override
public ResultData transform(final ResponseData responseData) {
    if (responseData == null || !responseData.hasResponseBody()) {
        throw new CrawlingAccessException("No response body.");
    }
    final ExtractorFactory extractorFactory = crawlerContainer.getComponent("extractorFactory");
    if (extractorFactory == null) {
        throw new CrawlerSystemException("Could not find extractorFactory.");
    }
    final Extractor extractor = extractorFactory.getExtractor(responseData.getMimeType());
    final Map<String, String> params = new HashMap<>();
    params.put(TikaMetadataKeys.RESOURCE_NAME_KEY, getResourceName(responseData));
    params.put(HttpHeaders.CONTENT_TYPE, responseData.getMimeType());
    String content = null;
    try (final InputStream in = responseData.getResponseBody()) {
        content = extractor.getText(in, params).getContent();
    } catch (final Exception e) {
        throw new CrawlingAccessException("Could not extract data.", e);
    }
    final ResultData resultData = new ResultData();
    resultData.setTransformerName(getName());
    try {
        resultData.setData(content.getBytes(charsetName));
    } catch (final UnsupportedEncodingException e) {
        if (logger.isInfoEnabled()) {
            logger.info("Invalid charsetName: " + charsetName + ". Changed to " + Constants.UTF_8, e);
        }
        charsetName = Constants.UTF_8_CHARSET.name();
        resultData.setData(content.getBytes(Constants.UTF_8_CHARSET));
    }
    resultData.setEncoding(charsetName);
    return resultData;
}
Also used : ResultData(org.codelibs.fess.crawler.entity.ResultData) AccessResultData(org.codelibs.fess.crawler.entity.AccessResultData) CrawlingAccessException(org.codelibs.fess.crawler.exception.CrawlingAccessException) HashMap(java.util.HashMap) ExtractorFactory(org.codelibs.fess.crawler.extractor.ExtractorFactory) InputStream(java.io.InputStream) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) UnsupportedEncodingException(java.io.UnsupportedEncodingException) Extractor(org.codelibs.fess.crawler.extractor.Extractor) CrawlingAccessException(org.codelibs.fess.crawler.exception.CrawlingAccessException) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) UnsupportedEncodingException(java.io.UnsupportedEncodingException)

Example 2 with Extractor

use of org.codelibs.fess.crawler.extractor.Extractor in project fess-crawler by codelibs.

the class EmlExtractor method appendAttachment.

protected void appendAttachment(final StringBuilder buf, final BodyPart bodyPart) {
    final MimeTypeHelper mimeTypeHelper = getMimeTypeHelper();
    final ExtractorFactory extractorFactory = getExtractorFactory();
    try {
        final String filename = getDecodeText(bodyPart.getFileName());
        final String mimeType = mimeTypeHelper.getContentType(null, filename);
        if (mimeType != null) {
            final Extractor extractor = extractorFactory.getExtractor(mimeType);
            if (extractor != null) {
                try (final InputStream in = bodyPart.getInputStream()) {
                    final Map<String, String> map = new HashMap<>();
                    map.put(TikaMetadataKeys.RESOURCE_NAME_KEY, filename);
                    final String content = extractor.getText(in, map).getContent();
                    buf.append(content).append(' ');
                } catch (final Exception e) {
                    if (logger.isDebugEnabled()) {
                        logger.debug("Exception in an internal extractor.", e);
                    }
                }
            }
        }
    } catch (MessagingException e) {
        if (logger.isDebugEnabled()) {
            logger.debug("Exception in parsing BodyPart.", e);
        }
    }
}
Also used : HashMap(java.util.HashMap) MessagingException(javax.mail.MessagingException) MimeTypeHelper(org.codelibs.fess.crawler.helper.MimeTypeHelper) ExtractorFactory(org.codelibs.fess.crawler.extractor.ExtractorFactory) InputStream(java.io.InputStream) Extractor(org.codelibs.fess.crawler.extractor.Extractor) MessagingException(javax.mail.MessagingException) ParseException(java.text.ParseException) IOException(java.io.IOException) ExtractException(org.codelibs.fess.crawler.exception.ExtractException) UnsupportedEncodingException(java.io.UnsupportedEncodingException)

Example 3 with Extractor

use of org.codelibs.fess.crawler.extractor.Extractor in project fess-crawler by codelibs.

the class TarExtractor method getTextInternal.

protected String getTextInternal(final InputStream in, final MimeTypeHelper mimeTypeHelper, final ExtractorFactory extractorFactory) {
    final StringBuilder buf = new StringBuilder(1000);
    ArchiveInputStream ais = null;
    try {
        ais = archiveStreamFactory.createArchiveInputStream("tar", in);
        TarArchiveEntry entry = null;
        long contentSize = 0;
        while ((entry = (TarArchiveEntry) ais.getNextEntry()) != null) {
            contentSize += entry.getSize();
            if (maxContentSize != -1 && contentSize > maxContentSize) {
                throw new MaxLengthExceededException("Extracted size is " + contentSize + " > " + maxContentSize);
            }
            final String filename = entry.getName();
            final String mimeType = mimeTypeHelper.getContentType(null, filename);
            if (mimeType != null) {
                final Extractor extractor = extractorFactory.getExtractor(mimeType);
                if (extractor != null) {
                    try {
                        final Map<String, String> map = new HashMap<>();
                        map.put(TikaMetadataKeys.RESOURCE_NAME_KEY, filename);
                        buf.append(extractor.getText(new IgnoreCloseInputStream(ais), map).getContent());
                        buf.append('\n');
                    } catch (final Exception e) {
                        if (logger.isDebugEnabled()) {
                            logger.debug("Exception in an internal extractor.", e);
                        }
                    }
                }
            }
        }
    } catch (final MaxLengthExceededException e) {
        throw e;
    } catch (final Exception e) {
        if (buf.length() == 0) {
            throw new ExtractException("Could not extract a content.", e);
        }
    } finally {
        CloseableUtil.closeQuietly(ais);
    }
    return buf.toString().trim();
}
Also used : ArchiveInputStream(org.apache.commons.compress.archivers.ArchiveInputStream) ExtractException(org.codelibs.fess.crawler.exception.ExtractException) MaxLengthExceededException(org.codelibs.fess.crawler.exception.MaxLengthExceededException) HashMap(java.util.HashMap) Extractor(org.codelibs.fess.crawler.extractor.Extractor) IgnoreCloseInputStream(org.codelibs.fess.crawler.util.IgnoreCloseInputStream) TarArchiveEntry(org.apache.commons.compress.archivers.tar.TarArchiveEntry) ExtractException(org.codelibs.fess.crawler.exception.ExtractException) MaxLengthExceededException(org.codelibs.fess.crawler.exception.MaxLengthExceededException) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException)

Example 4 with Extractor

use of org.codelibs.fess.crawler.extractor.Extractor in project fess-crawler by codelibs.

the class ZipExtractor method getText.

@Override
public ExtractData getText(final InputStream in, final Map<String, String> params) {
    if (in == null) {
        throw new CrawlerSystemException("The inputstream is null.");
    }
    final MimeTypeHelper mimeTypeHelper = getMimeTypeHelper();
    final ExtractorFactory extractorFactory = getExtractorFactory();
    final StringBuilder buf = new StringBuilder(1000);
    try (final ArchiveInputStream ais = archiveStreamFactory.createArchiveInputStream(in.markSupported() ? in : new BufferedInputStream(in))) {
        ZipArchiveEntry entry = null;
        long contentSize = 0;
        while ((entry = (ZipArchiveEntry) ais.getNextEntry()) != null) {
            contentSize += entry.getSize();
            if (maxContentSize != -1 && contentSize > maxContentSize) {
                throw new MaxLengthExceededException("Extracted size is " + contentSize + " > " + maxContentSize);
            }
            final String filename = entry.getName();
            final String mimeType = mimeTypeHelper.getContentType(null, filename);
            if (mimeType != null) {
                final Extractor extractor = extractorFactory.getExtractor(mimeType);
                if (extractor != null) {
                    try {
                        final Map<String, String> map = new HashMap<>();
                        map.put(TikaMetadataKeys.RESOURCE_NAME_KEY, filename);
                        buf.append(extractor.getText(new IgnoreCloseInputStream(ais), map).getContent());
                        buf.append('\n');
                    } catch (final Exception e) {
                        if (logger.isDebugEnabled()) {
                            logger.debug("Exception in an internal extractor.", e);
                        }
                    }
                }
            }
        }
    } catch (final MaxLengthExceededException e) {
        throw e;
    } catch (final Exception e) {
        if (buf.length() == 0) {
            throw new ExtractException("Could not extract a content.", e);
        }
    }
    return new ExtractData(buf.toString().trim());
}
Also used : ExtractException(org.codelibs.fess.crawler.exception.ExtractException) ExtractData(org.codelibs.fess.crawler.entity.ExtractData) MaxLengthExceededException(org.codelibs.fess.crawler.exception.MaxLengthExceededException) HashMap(java.util.HashMap) MimeTypeHelper(org.codelibs.fess.crawler.helper.MimeTypeHelper) ExtractorFactory(org.codelibs.fess.crawler.extractor.ExtractorFactory) ExtractException(org.codelibs.fess.crawler.exception.ExtractException) MaxLengthExceededException(org.codelibs.fess.crawler.exception.MaxLengthExceededException) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) ArchiveInputStream(org.apache.commons.compress.archivers.ArchiveInputStream) BufferedInputStream(java.io.BufferedInputStream) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) ZipArchiveEntry(org.apache.commons.compress.archivers.zip.ZipArchiveEntry) Extractor(org.codelibs.fess.crawler.extractor.Extractor) IgnoreCloseInputStream(org.codelibs.fess.crawler.util.IgnoreCloseInputStream)

Example 5 with Extractor

use of org.codelibs.fess.crawler.extractor.Extractor in project fess by codelibs.

the class FessStandardTransformer method getExtractor.

@Override
protected Extractor getExtractor(final ResponseData responseData) {
    final ExtractorFactory extractorFactory = ComponentUtil.getExtractorFactory();
    if (extractorFactory == null) {
        throw new FessSystemException("Could not find extractorFactory.");
    }
    Extractor extractor = extractorFactory.getExtractor(responseData.getMimeType());
    if (extractor == null) {
        extractor = ComponentUtil.getComponent("tikaExtractor");
        if (extractor == null) {
            throw new FessSystemException("Could not find tikaExtractor.");
        }
    }
    if (logger.isDebugEnabled()) {
        logger.debug("url={}, extractor={}", responseData.getUrl(), extractor);
    }
    return extractor;
}
Also used : ExtractorFactory(org.codelibs.fess.crawler.extractor.ExtractorFactory) Extractor(org.codelibs.fess.crawler.extractor.Extractor) FessSystemException(org.codelibs.fess.exception.FessSystemException)

Aggregations

Extractor (org.codelibs.fess.crawler.extractor.Extractor)10 HashMap (java.util.HashMap)7 CrawlerSystemException (org.codelibs.fess.crawler.exception.CrawlerSystemException)6 ExtractException (org.codelibs.fess.crawler.exception.ExtractException)6 InputStream (java.io.InputStream)5 ExtractData (org.codelibs.fess.crawler.entity.ExtractData)5 ExtractorFactory (org.codelibs.fess.crawler.extractor.ExtractorFactory)5 IOException (java.io.IOException)3 UnsupportedEncodingException (java.io.UnsupportedEncodingException)3 BufferedInputStream (java.io.BufferedInputStream)2 File (java.io.File)2 FileInputStream (java.io.FileInputStream)2 FileOutputStream (java.io.FileOutputStream)2 Map (java.util.Map)2 ArchiveInputStream (org.apache.commons.compress.archivers.ArchiveInputStream)2 HttpHeaders (org.apache.tika.metadata.HttpHeaders)2 TikaMetadataKeys (org.apache.tika.metadata.TikaMetadataKeys)2 StringUtil (org.codelibs.core.lang.StringUtil)2 MaxLengthExceededException (org.codelibs.fess.crawler.exception.MaxLengthExceededException)2 MimeTypeHelper (org.codelibs.fess.crawler.helper.MimeTypeHelper)2