Search in sources :

Example 6 with ExtractorFactory

use of org.codelibs.fess.crawler.extractor.ExtractorFactory in project fess-crawler by codelibs.

the class ZipExtractor method getText.

@Override
public ExtractData getText(final InputStream in, final Map<String, String> params) {
    if (in == null) {
        throw new CrawlerSystemException("The inputstream is null.");
    }
    final MimeTypeHelper mimeTypeHelper = getMimeTypeHelper();
    final ExtractorFactory extractorFactory = getExtractorFactory();
    final StringBuilder buf = new StringBuilder(1000);
    try (final ArchiveInputStream ais = archiveStreamFactory.createArchiveInputStream(in.markSupported() ? in : new BufferedInputStream(in))) {
        ZipArchiveEntry entry = null;
        long contentSize = 0;
        while ((entry = (ZipArchiveEntry) ais.getNextEntry()) != null) {
            contentSize += entry.getSize();
            if (maxContentSize != -1 && contentSize > maxContentSize) {
                throw new MaxLengthExceededException("Extracted size is " + contentSize + " > " + maxContentSize);
            }
            final String filename = entry.getName();
            final String mimeType = mimeTypeHelper.getContentType(null, filename);
            if (mimeType != null) {
                final Extractor extractor = extractorFactory.getExtractor(mimeType);
                if (extractor != null) {
                    try {
                        final Map<String, String> map = new HashMap<>();
                        map.put(TikaMetadataKeys.RESOURCE_NAME_KEY, filename);
                        buf.append(extractor.getText(new IgnoreCloseInputStream(ais), map).getContent());
                        buf.append('\n');
                    } catch (final Exception e) {
                        if (logger.isDebugEnabled()) {
                            logger.debug("Exception in an internal extractor.", e);
                        }
                    }
                }
            }
        }
    } catch (final MaxLengthExceededException e) {
        throw e;
    } catch (final Exception e) {
        if (buf.length() == 0) {
            throw new ExtractException("Could not extract a content.", e);
        }
    }
    return new ExtractData(buf.toString().trim());
}
Also used : ExtractException(org.codelibs.fess.crawler.exception.ExtractException) ExtractData(org.codelibs.fess.crawler.entity.ExtractData) MaxLengthExceededException(org.codelibs.fess.crawler.exception.MaxLengthExceededException) HashMap(java.util.HashMap) MimeTypeHelper(org.codelibs.fess.crawler.helper.MimeTypeHelper) ExtractorFactory(org.codelibs.fess.crawler.extractor.ExtractorFactory) ExtractException(org.codelibs.fess.crawler.exception.ExtractException) MaxLengthExceededException(org.codelibs.fess.crawler.exception.MaxLengthExceededException) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) ArchiveInputStream(org.apache.commons.compress.archivers.ArchiveInputStream) BufferedInputStream(java.io.BufferedInputStream) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) ZipArchiveEntry(org.apache.commons.compress.archivers.zip.ZipArchiveEntry) Extractor(org.codelibs.fess.crawler.extractor.Extractor) IgnoreCloseInputStream(org.codelibs.fess.crawler.util.IgnoreCloseInputStream)

Example 7 with ExtractorFactory

use of org.codelibs.fess.crawler.extractor.ExtractorFactory in project fess-crawler by codelibs.

the class AbstractExtractor method register.

public void register(final List<String> keyList) {
    final ExtractorFactory extractorFactory = crawlerContainer.getComponent("extractorFactory");
    extractorFactory.addExtractor(keyList, this);
}
Also used : ExtractorFactory(org.codelibs.fess.crawler.extractor.ExtractorFactory)

Example 8 with ExtractorFactory

use of org.codelibs.fess.crawler.extractor.ExtractorFactory in project fess by codelibs.

the class FessStandardTransformer method getExtractor.

@Override
protected Extractor getExtractor(final ResponseData responseData) {
    final ExtractorFactory extractorFactory = ComponentUtil.getExtractorFactory();
    if (extractorFactory == null) {
        throw new FessSystemException("Could not find extractorFactory.");
    }
    Extractor extractor = extractorFactory.getExtractor(responseData.getMimeType());
    if (extractor == null) {
        extractor = ComponentUtil.getComponent("tikaExtractor");
        if (extractor == null) {
            throw new FessSystemException("Could not find tikaExtractor.");
        }
    }
    if (logger.isDebugEnabled()) {
        logger.debug("url={}, extractor={}", responseData.getUrl(), extractor);
    }
    return extractor;
}
Also used : ExtractorFactory(org.codelibs.fess.crawler.extractor.ExtractorFactory) Extractor(org.codelibs.fess.crawler.extractor.Extractor) FessSystemException(org.codelibs.fess.exception.FessSystemException)

Example 9 with ExtractorFactory

use of org.codelibs.fess.crawler.extractor.ExtractorFactory in project fess by codelibs.

the class FessFileTransformer method getExtractor.

@Override
protected Extractor getExtractor(final ResponseData responseData) {
    final ExtractorFactory extractorFactory = ComponentUtil.getExtractorFactory();
    if (extractorFactory == null) {
        throw new FessSystemException("Could not find extractorFactory.");
    }
    final Extractor extractor = extractorFactory.getExtractor(responseData.getMimeType());
    if (logger.isDebugEnabled()) {
        logger.debug("url={}, extractor={}", responseData.getUrl(), extractor);
    }
    return extractor;
}
Also used : ExtractorFactory(org.codelibs.fess.crawler.extractor.ExtractorFactory) Extractor(org.codelibs.fess.crawler.extractor.Extractor) FessSystemException(org.codelibs.fess.exception.FessSystemException)

Example 10 with ExtractorFactory

use of org.codelibs.fess.crawler.extractor.ExtractorFactory in project fess-crawler by codelibs.

the class TarExtractor method getText.

@Override
public ExtractData getText(final InputStream in, final Map<String, String> params) {
    if (in == null) {
        throw new CrawlerSystemException("The inputstream is null.");
    }
    final MimeTypeHelper mimeTypeHelper = getMimeTypeHelper();
    final ExtractorFactory extractorFactory = getExtractorFactory();
    return new ExtractData(getTextInternal(in, mimeTypeHelper, extractorFactory));
}
Also used : ExtractData(org.codelibs.fess.crawler.entity.ExtractData) MimeTypeHelper(org.codelibs.fess.crawler.helper.MimeTypeHelper) ExtractorFactory(org.codelibs.fess.crawler.extractor.ExtractorFactory) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException)

Aggregations

ExtractorFactory (org.codelibs.fess.crawler.extractor.ExtractorFactory)15 StandardCrawlerContainer (org.codelibs.fess.crawler.container.StandardCrawlerContainer)7 Extractor (org.codelibs.fess.crawler.extractor.Extractor)6 MimeTypeHelperImpl (org.codelibs.fess.crawler.helper.impl.MimeTypeHelperImpl)5 HashMap (java.util.HashMap)4 CrawlerSystemException (org.codelibs.fess.crawler.exception.CrawlerSystemException)4 MimeTypeHelper (org.codelibs.fess.crawler.helper.MimeTypeHelper)4 InputStream (java.io.InputStream)3 ExtractData (org.codelibs.fess.crawler.entity.ExtractData)3 ExtractException (org.codelibs.fess.crawler.exception.ExtractException)3 File (java.io.File)2 IOException (java.io.IOException)2 UnsupportedEncodingException (java.io.UnsupportedEncodingException)2 ArchiveStreamFactory (org.apache.commons.compress.archivers.ArchiveStreamFactory)2 MaxLengthExceededException (org.codelibs.fess.crawler.exception.MaxLengthExceededException)2 TikaExtractor (org.codelibs.fess.crawler.extractor.impl.TikaExtractor)2 FessSystemException (org.codelibs.fess.exception.FessSystemException)2 BufferedInputStream (java.io.BufferedInputStream)1 FileOutputStream (java.io.FileOutputStream)1 ParseException (java.text.ParseException)1