Search in sources :

Example 1 with LhaFile

use of jp.gr.java_conf.dangan.util.lha.LhaFile in project fess-crawler by codelibs.

the class LhaExtractor method getText.

@Override
public ExtractData getText(final InputStream in, final Map<String, String> params) {
    if (in == null) {
        throw new CrawlerSystemException("The inputstream is null.");
    }
    final MimeTypeHelper mimeTypeHelper = getMimeTypeHelper();
    final ExtractorFactory extractorFactory = getExtractorFactory();
    final StringBuilder buf = new StringBuilder(1000);
    File tempFile = null;
    LhaFile lhaFile = null;
    try {
        tempFile = File.createTempFile("crawler-", ".lzh");
        try (FileOutputStream fos = new FileOutputStream(tempFile)) {
            CopyUtil.copy(in, fos);
        }
        lhaFile = new LhaFile(tempFile);
        @SuppressWarnings("unchecked") final Enumeration<LhaHeader> entries = lhaFile.entries();
        long contentSize = 0;
        while (entries.hasMoreElements()) {
            final LhaHeader head = entries.nextElement();
            contentSize += head.getOriginalSize();
            if (maxContentSize != -1 && contentSize > maxContentSize) {
                throw new MaxLengthExceededException("Extracted size is " + contentSize + " > " + maxContentSize);
            }
            final String filename = head.getPath();
            final String mimeType = mimeTypeHelper.getContentType(null, filename);
            if (mimeType != null) {
                final Extractor extractor = extractorFactory.getExtractor(mimeType);
                if (extractor != null) {
                    InputStream is = null;
                    try {
                        is = lhaFile.getInputStream(head);
                        final Map<String, String> map = new HashMap<>();
                        map.put(TikaMetadataKeys.RESOURCE_NAME_KEY, filename);
                        buf.append(extractor.getText(new IgnoreCloseInputStream(is), map).getContent());
                        buf.append('\n');
                    } catch (final Exception e) {
                        if (logger.isDebugEnabled()) {
                            logger.debug("Exception in an internal extractor.", e);
                        }
                    } finally {
                        CloseableUtil.closeQuietly(is);
                    }
                }
            }
        }
    } catch (final MaxLengthExceededException e) {
        throw e;
    } catch (final Exception e) {
        throw new ExtractException("Could not extract a content.", e);
    } finally {
        if (lhaFile != null) {
            try {
                lhaFile.close();
            } catch (final IOException e) {
            // ignore
            }
        }
        if (tempFile != null && !tempFile.delete()) {
            logger.warn("Failed to delete " + tempFile.getAbsolutePath());
        }
    }
    return new ExtractData(buf.toString().trim());
}
Also used : ExtractException(org.codelibs.fess.crawler.exception.ExtractException) ExtractData(org.codelibs.fess.crawler.entity.ExtractData) MaxLengthExceededException(org.codelibs.fess.crawler.exception.MaxLengthExceededException) HashMap(java.util.HashMap) MimeTypeHelper(org.codelibs.fess.crawler.helper.MimeTypeHelper) ExtractorFactory(org.codelibs.fess.crawler.extractor.ExtractorFactory) IgnoreCloseInputStream(org.codelibs.fess.crawler.util.IgnoreCloseInputStream) InputStream(java.io.InputStream) LhaFile(jp.gr.java_conf.dangan.util.lha.LhaFile) IOException(java.io.IOException) IOException(java.io.IOException) ExtractException(org.codelibs.fess.crawler.exception.ExtractException) MaxLengthExceededException(org.codelibs.fess.crawler.exception.MaxLengthExceededException) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) LhaHeader(jp.gr.java_conf.dangan.util.lha.LhaHeader) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) FileOutputStream(java.io.FileOutputStream) Extractor(org.codelibs.fess.crawler.extractor.Extractor) LhaFile(jp.gr.java_conf.dangan.util.lha.LhaFile) File(java.io.File) IgnoreCloseInputStream(org.codelibs.fess.crawler.util.IgnoreCloseInputStream)

Aggregations

File (java.io.File)1 FileOutputStream (java.io.FileOutputStream)1 IOException (java.io.IOException)1 InputStream (java.io.InputStream)1 HashMap (java.util.HashMap)1 LhaFile (jp.gr.java_conf.dangan.util.lha.LhaFile)1 LhaHeader (jp.gr.java_conf.dangan.util.lha.LhaHeader)1 ExtractData (org.codelibs.fess.crawler.entity.ExtractData)1 CrawlerSystemException (org.codelibs.fess.crawler.exception.CrawlerSystemException)1 ExtractException (org.codelibs.fess.crawler.exception.ExtractException)1 MaxLengthExceededException (org.codelibs.fess.crawler.exception.MaxLengthExceededException)1 Extractor (org.codelibs.fess.crawler.extractor.Extractor)1 ExtractorFactory (org.codelibs.fess.crawler.extractor.ExtractorFactory)1 MimeTypeHelper (org.codelibs.fess.crawler.helper.MimeTypeHelper)1 IgnoreCloseInputStream (org.codelibs.fess.crawler.util.IgnoreCloseInputStream)1