Search in sources :

Example 21 with CrawlerSystemException

use of org.codelibs.fess.crawler.exception.CrawlerSystemException in project fess-crawler by codelibs.

the class PdfExtractor method getText.

/*
     * (non-Javadoc)
     *
     * @see org.codelibs.fess.crawler.extractor.Extractor#getText(java.io.InputStream,
     * java.util.Map)
     */
@Override
public ExtractData getText(final InputStream in, final Map<String, String> params) {
    if (in == null) {
        throw new CrawlerSystemException("The inputstream is null.");
    }
    synchronized (pdfBoxLockObj) {
        // PDFBox is not a thread-safe library
        final String password = getPassword(params);
        try (PDDocument document = PDDocument.load(in, password == null ? null : password)) {
            final StringWriter output = new StringWriter();
            final PDFTextStripper stripper = new PDFTextStripper();
            final AtomicBoolean done = new AtomicBoolean(false);
            final PDDocument doc = document;
            final Set<Exception> exceptionSet = new HashSet<>();
            final Thread task = new Thread(() -> {
                try {
                    stripper.writeText(doc, output);
                } catch (final Exception e) {
                    exceptionSet.add(e);
                } finally {
                    done.set(true);
                }
            }, Thread.currentThread().getName() + "-pdf");
            task.setDaemon(isDaemonThread);
            task.start();
            task.join(timeout);
            if (!done.get()) {
                for (int i = 0; i < 100 && !done.get(); i++) {
                    task.interrupt();
                    Thread.sleep(100);
                }
                throw new ExtractException("PDFBox process cannot finish in " + timeout + " sec.");
            } else if (!exceptionSet.isEmpty()) {
                throw exceptionSet.iterator().next();
            }
            output.flush();
            final ExtractData extractData = new ExtractData(output.toString());
            extractMetadata(document, extractData);
            return extractData;
        } catch (final Exception e) {
            throw new ExtractException(e);
        }
    }
}
Also used : ExtractException(org.codelibs.fess.crawler.exception.ExtractException) ExtractData(org.codelibs.fess.crawler.entity.ExtractData) ExtractException(org.codelibs.fess.crawler.exception.ExtractException) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) AtomicBoolean(java.util.concurrent.atomic.AtomicBoolean) StringWriter(java.io.StringWriter) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) PDDocument(org.apache.pdfbox.pdmodel.PDDocument) PDFTextStripper(org.apache.pdfbox.text.PDFTextStripper) HashSet(java.util.HashSet)

Example 22 with CrawlerSystemException

use of org.codelibs.fess.crawler.exception.CrawlerSystemException in project fess-crawler by codelibs.

the class TarExtractor method getText.

@Override
public ExtractData getText(final InputStream in, final Map<String, String> params) {
    if (in == null) {
        throw new CrawlerSystemException("The inputstream is null.");
    }
    final MimeTypeHelper mimeTypeHelper = getMimeTypeHelper();
    final ExtractorFactory extractorFactory = getExtractorFactory();
    return new ExtractData(getTextInternal(in, mimeTypeHelper, extractorFactory));
}
Also used : ExtractData(org.codelibs.fess.crawler.entity.ExtractData) MimeTypeHelper(org.codelibs.fess.crawler.helper.MimeTypeHelper) ExtractorFactory(org.codelibs.fess.crawler.extractor.ExtractorFactory) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException)

Example 23 with CrawlerSystemException

use of org.codelibs.fess.crawler.exception.CrawlerSystemException in project fess-crawler by codelibs.

the class TikaExtractor method getContent.

protected String getContent(final ContentWriter out, final String encoding) throws TikaException {
    File tempFile = null;
    try {
        tempFile = File.createTempFile("tika", ".tmp");
    } catch (final IOException e) {
        throw new CrawlerSystemException("Failed to create a temp file.", e);
    }
    final String enc = encoding == null ? Constants.UTF_8 : encoding;
    try (DeferredFileOutputStream dfos = new DeferredFileOutputStream(memorySize, tempFile)) {
        final BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(dfos, enc));
        out.accept(writer);
        writer.flush();
        try (Reader reader = new InputStreamReader(getContentStream(dfos), enc)) {
            return TextUtil.normalizeText(reader).initialCapacity(initialBufferSize).maxAlphanumTermSize(maxAlphanumTermSize).maxSymbolTermSize(maxSymbolTermSize).duplicateTermRemoved(replaceDuplication).execute();
        }
    } catch (final TikaException e) {
        throw e;
    } catch (final Exception e) {
        throw new ExtractException("Failed to read a content.", e);
    } finally {
        if (tempFile.exists() && !tempFile.delete()) {
            logger.warn("Failed to delete " + tempFile.getAbsolutePath());
        }
    }
}
Also used : ExtractException(org.codelibs.fess.crawler.exception.ExtractException) TikaException(org.apache.tika.exception.TikaException) InputStreamReader(java.io.InputStreamReader) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) Reader(java.io.Reader) InputStreamReader(java.io.InputStreamReader) BufferedReader(java.io.BufferedReader) OutputStreamWriter(java.io.OutputStreamWriter) IOException(java.io.IOException) DeferredFileOutputStream(org.apache.commons.io.output.DeferredFileOutputStream) File(java.io.File) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) ExtractException(org.codelibs.fess.crawler.exception.ExtractException) SAXException(org.xml.sax.SAXException) TikaException(org.apache.tika.exception.TikaException) IOException(java.io.IOException) BufferedWriter(java.io.BufferedWriter)

Example 24 with CrawlerSystemException

use of org.codelibs.fess.crawler.exception.CrawlerSystemException in project fess-crawler by codelibs.

the class TikaExtractor method getText.

@Override
public ExtractData getText(final InputStream inputStream, final Map<String, String> params) {
    if (inputStream == null) {
        throw new CrawlerSystemException("The inputstream is null.");
    }
    final File tempFile;
    final boolean isByteStream = inputStream instanceof ByteArrayInputStream;
    if (isByteStream) {
        inputStream.mark(0);
        tempFile = null;
    } else {
        try {
            tempFile = File.createTempFile("tikaExtractor-", ".out");
        } catch (final IOException e) {
            throw new ExtractException("Could not create a temp file.", e);
        }
    }
    try {
        final PrintStream originalOutStream = System.out;
        final ByteArrayOutputStream outStream = new ByteArrayOutputStream();
        System.setOut(new PrintStream(outStream, true));
        final PrintStream originalErrStream = System.err;
        final ByteArrayOutputStream errStream = new ByteArrayOutputStream();
        System.setErr(new PrintStream(errStream, true));
        try {
            final String resourceName = params == null ? null : params.get(TikaMetadataKeys.RESOURCE_NAME_KEY);
            final String contentType = params == null ? null : params.get(HttpHeaders.CONTENT_TYPE);
            String contentEncoding = params == null ? null : params.get(HttpHeaders.CONTENT_ENCODING);
            String pdfPassword = getPassword(params);
            final Metadata metadata = createMetadata(resourceName, contentType, contentEncoding, pdfPassword);
            final Parser parser = new TikaDetectParser();
            final ParseContext parseContext = createParseContext(parser, params);
            String content = getContent(writer -> {
                InputStream in = null;
                try {
                    if (!isByteStream) {
                        try (OutputStream out = new FileOutputStream(tempFile)) {
                            CopyUtil.copy(inputStream, out);
                        }
                        in = new FileInputStream(tempFile);
                    } else {
                        in = inputStream;
                    }
                    parser.parse(in, new BodyContentHandler(writer), metadata, parseContext);
                } finally {
                    CloseableUtil.closeQuietly(in);
                }
            }, contentEncoding);
            if (StringUtil.isBlank(content)) {
                if (resourceName != null) {
                    if (logger.isDebugEnabled()) {
                        logger.debug("retry without a resource name: {}", resourceName);
                    }
                    final Metadata metadata2 = createMetadata(null, contentType, contentEncoding, pdfPassword);
                    content = getContent(writer -> {
                        InputStream in = null;
                        try {
                            if (isByteStream) {
                                inputStream.reset();
                                in = inputStream;
                            } else {
                                in = new FileInputStream(tempFile);
                            }
                            parser.parse(in, new BodyContentHandler(writer), metadata2, parseContext);
                        } finally {
                            CloseableUtil.closeQuietly(in);
                        }
                    }, contentEncoding);
                }
                if (StringUtil.isBlank(content) && contentType != null) {
                    if (logger.isDebugEnabled()) {
                        logger.debug("retry without a content type: {}", contentType);
                    }
                    final Metadata metadata3 = createMetadata(null, null, contentEncoding, pdfPassword);
                    content = getContent(writer -> {
                        InputStream in = null;
                        try {
                            if (isByteStream) {
                                inputStream.reset();
                                in = inputStream;
                            } else {
                                in = new FileInputStream(tempFile);
                            }
                            parser.parse(in, new BodyContentHandler(writer), metadata3, parseContext);
                        } finally {
                            CloseableUtil.closeQuietly(in);
                        }
                    }, contentEncoding);
                }
                if (readAsTextIfFailed && StringUtil.isBlank(content)) {
                    if (logger.isDebugEnabled()) {
                        logger.debug("read the content as a text.");
                    }
                    if (contentEncoding == null) {
                        contentEncoding = Constants.UTF_8;
                    }
                    final String enc = contentEncoding;
                    content = getContent(writer -> {
                        BufferedReader br = null;
                        try {
                            if (isByteStream) {
                                inputStream.reset();
                                br = new BufferedReader(new InputStreamReader(inputStream, enc));
                            } else {
                                br = new BufferedReader(new InputStreamReader(new FileInputStream(tempFile), enc));
                            }
                            String line;
                            while ((line = br.readLine()) != null) {
                                writer.write(line);
                            }
                        } catch (final Exception e) {
                            logger.warn("Could not read " + (tempFile != null ? tempFile.getAbsolutePath() : "a byte stream"), e);
                        } finally {
                            CloseableUtil.closeQuietly(br);
                        }
                    }, contentEncoding);
                }
            }
            final ExtractData extractData = new ExtractData(content);
            final String[] names = metadata.names();
            Arrays.sort(names);
            for (final String name : names) {
                extractData.putValues(name, metadata.getValues(name));
            }
            if (logger.isDebugEnabled()) {
                logger.debug("Result: metadata: {}", metadata);
            }
            return extractData;
        } catch (final TikaException e) {
            if (e.getMessage().indexOf("bomb") >= 0) {
                throw e;
            }
            final Throwable cause = e.getCause();
            if (cause instanceof SAXException) {
                final Extractor xmlExtractor = crawlerContainer.getComponent("xmlExtractor");
                if (xmlExtractor != null) {
                    InputStream in = null;
                    try {
                        if (isByteStream) {
                            inputStream.reset();
                            in = inputStream;
                        } else {
                            in = new FileInputStream(tempFile);
                        }
                        return xmlExtractor.getText(in, params);
                    } finally {
                        CloseableUtil.closeQuietly(in);
                    }
                }
            }
            throw e;
        } finally {
            if (originalOutStream != null) {
                System.setOut(originalOutStream);
            }
            if (originalErrStream != null) {
                System.setErr(originalErrStream);
            }
            try {
                if (logger.isInfoEnabled()) {
                    final byte[] bs = outStream.toByteArray();
                    if (bs.length != 0) {
                        logger.info(new String(bs, outputEncoding));
                    }
                }
                if (logger.isWarnEnabled()) {
                    final byte[] bs = errStream.toByteArray();
                    if (bs.length != 0) {
                        logger.warn(new String(bs, outputEncoding));
                    }
                }
            } catch (final Exception e) {
            // NOP
            }
        }
    } catch (final Exception e) {
        throw new ExtractException("Could not extract a content.", e);
    } finally {
        if (tempFile != null && !tempFile.delete()) {
            logger.warn("Failed to delete " + tempFile.getAbsolutePath());
        }
    }
}
Also used : Arrays(java.util.Arrays) BufferedInputStream(java.io.BufferedInputStream) Parser(org.apache.tika.parser.Parser) TesseractOCRConfig(org.apache.tika.parser.ocr.TesseractOCRConfig) LoggerFactory(org.slf4j.LoggerFactory) TikaMetadataKeys(org.apache.tika.metadata.TikaMetadataKeys) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) Metadata(org.apache.tika.metadata.Metadata) ByteArrayInputStream(java.io.ByteArrayInputStream) Map(java.util.Map) CopyUtil(org.codelibs.core.io.CopyUtil) ParsingEmbeddedDocumentExtractor(org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor) TemporaryResources(org.apache.tika.io.TemporaryResources) Extractor(org.codelibs.fess.crawler.extractor.Extractor) ConcurrentHashMap(java.util.concurrent.ConcurrentHashMap) ByteArrayOutputStream(org.apache.commons.io.output.ByteArrayOutputStream) CompositeParser(org.apache.tika.parser.CompositeParser) ExtractException(org.codelibs.fess.crawler.exception.ExtractException) Reader(java.io.Reader) SecureContentHandler(org.apache.tika.sax.SecureContentHandler) ParseContext(org.apache.tika.parser.ParseContext) SAXException(org.xml.sax.SAXException) Writer(java.io.Writer) PostConstruct(javax.annotation.PostConstruct) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) TikaConfig(org.apache.tika.config.TikaConfig) EmbeddedDocumentExtractor(org.apache.tika.extractor.EmbeddedDocumentExtractor) MediaType(org.apache.tika.mime.MediaType) TikaException(org.apache.tika.exception.TikaException) ExtractData(org.codelibs.fess.crawler.entity.ExtractData) PasswordProvider(org.apache.tika.parser.PasswordProvider) OutputStreamWriter(java.io.OutputStreamWriter) TikaInputStream(org.apache.tika.io.TikaInputStream) ContentHandler(org.xml.sax.ContentHandler) OutputStream(java.io.OutputStream) PrintStream(java.io.PrintStream) Logger(org.slf4j.Logger) BufferedWriter(java.io.BufferedWriter) DeferredFileOutputStream(org.apache.commons.io.output.DeferredFileOutputStream) PDFParserConfig(org.apache.tika.parser.pdf.PDFParserConfig) StringUtil(org.codelibs.core.lang.StringUtil) FileOutputStream(java.io.FileOutputStream) IOException(java.io.IOException) FileInputStream(java.io.FileInputStream) Detector(org.apache.tika.detect.Detector) InputStreamReader(java.io.InputStreamReader) File(java.io.File) CloseableUtil(org.codelibs.core.io.CloseableUtil) TextUtil(org.codelibs.fess.crawler.util.TextUtil) Constants(org.codelibs.fess.crawler.Constants) BufferedReader(java.io.BufferedReader) HttpHeaders(org.apache.tika.metadata.HttpHeaders) InputStream(java.io.InputStream) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ByteArrayOutputStream(org.apache.commons.io.output.ByteArrayOutputStream) OutputStream(java.io.OutputStream) DeferredFileOutputStream(org.apache.commons.io.output.DeferredFileOutputStream) FileOutputStream(java.io.FileOutputStream) Metadata(org.apache.tika.metadata.Metadata) SAXException(org.xml.sax.SAXException) ParsingEmbeddedDocumentExtractor(org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor) Extractor(org.codelibs.fess.crawler.extractor.Extractor) EmbeddedDocumentExtractor(org.apache.tika.extractor.EmbeddedDocumentExtractor) PrintStream(java.io.PrintStream) ExtractException(org.codelibs.fess.crawler.exception.ExtractException) ExtractData(org.codelibs.fess.crawler.entity.ExtractData) TikaException(org.apache.tika.exception.TikaException) InputStreamReader(java.io.InputStreamReader) BufferedInputStream(java.io.BufferedInputStream) ByteArrayInputStream(java.io.ByteArrayInputStream) TikaInputStream(org.apache.tika.io.TikaInputStream) FileInputStream(java.io.FileInputStream) InputStream(java.io.InputStream) IOException(java.io.IOException) ByteArrayOutputStream(org.apache.commons.io.output.ByteArrayOutputStream) FileInputStream(java.io.FileInputStream) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) ExtractException(org.codelibs.fess.crawler.exception.ExtractException) SAXException(org.xml.sax.SAXException) TikaException(org.apache.tika.exception.TikaException) IOException(java.io.IOException) Parser(org.apache.tika.parser.Parser) CompositeParser(org.apache.tika.parser.CompositeParser) ByteArrayInputStream(java.io.ByteArrayInputStream) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) DeferredFileOutputStream(org.apache.commons.io.output.DeferredFileOutputStream) FileOutputStream(java.io.FileOutputStream) ParseContext(org.apache.tika.parser.ParseContext) BufferedReader(java.io.BufferedReader) File(java.io.File)

Example 25 with CrawlerSystemException

use of org.codelibs.fess.crawler.exception.CrawlerSystemException in project fess-crawler by codelibs.

the class CommandExtractor method executeCommand.

private void executeCommand(final File inputFile, final File outputFile) {
    if (StringUtil.isBlank(command)) {
        throw new CrawlerSystemException("command is empty.");
    }
    final Map<String, String> params = new HashMap<>();
    params.put("$INPUT_FILE", inputFile.getAbsolutePath());
    params.put("$OUTPUT_FILE", outputFile.getAbsolutePath());
    final List<String> cmdList = parseCommand(command, params);
    if (logger.isInfoEnabled()) {
        logger.info("Command: " + cmdList);
    }
    final ProcessBuilder pb = new ProcessBuilder(cmdList);
    if (workingDirectory != null) {
        pb.directory(workingDirectory);
    }
    if (standardOutput) {
        pb.redirectOutput(outputFile);
    } else {
        pb.redirectErrorStream(true);
    }
    Process currentProcess = null;
    MonitorThread mt = null;
    try {
        currentProcess = pb.start();
        // monitoring
        mt = new MonitorThread(currentProcess, executionTimeout);
        mt.start();
        final InputStreamThread it = new InputStreamThread(currentProcess.getInputStream(), commandOutputEncoding, maxOutputLine);
        it.start();
        currentProcess.waitFor();
        it.join(5000);
        if (mt.isTeminated()) {
            throw new ExecutionTimeoutException("The command execution is timeout: " + cmdList);
        }
        final int exitValue = currentProcess.exitValue();
        if (logger.isInfoEnabled()) {
            if (standardOutput) {
                logger.info("Exit Code: " + exitValue);
            } else {
                logger.info("Exit Code: " + exitValue + " - Process Output:\n" + it.getOutput());
            }
        }
        if (exitValue == 143 && mt.isTeminated()) {
            throw new ExecutionTimeoutException("The command execution is timeout: " + cmdList);
        }
    } catch (final CrawlerSystemException e) {
        throw e;
    } catch (final InterruptedException e) {
        if (mt != null && mt.isTeminated()) {
            throw new ExecutionTimeoutException("The command execution is timeout: " + cmdList, e);
        }
        throw new CrawlerSystemException("Process terminated.", e);
    } catch (final Exception e) {
        throw new CrawlerSystemException("Process terminated.", e);
    } finally {
        if (mt != null) {
            mt.setFinished(true);
            try {
                mt.interrupt();
            } catch (final Exception e) {
            }
        }
        if (currentProcess != null) {
            try {
                currentProcess.destroy();
            } catch (final Exception e) {
            }
        }
        currentProcess = null;
    }
}
Also used : HashMap(java.util.HashMap) ExecutionTimeoutException(org.codelibs.fess.crawler.exception.ExecutionTimeoutException) IOException(java.io.IOException) ExtractException(org.codelibs.fess.crawler.exception.ExtractException) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException) UnsupportedEncodingException(java.io.UnsupportedEncodingException) ExecutionTimeoutException(org.codelibs.fess.crawler.exception.ExecutionTimeoutException) CrawlerSystemException(org.codelibs.fess.crawler.exception.CrawlerSystemException)

Aggregations

CrawlerSystemException (org.codelibs.fess.crawler.exception.CrawlerSystemException)41 IOException (java.io.IOException)16 CrawlingAccessException (org.codelibs.fess.crawler.exception.CrawlingAccessException)13 File (java.io.File)11 InputStream (java.io.InputStream)11 UnsupportedEncodingException (java.io.UnsupportedEncodingException)10 BufferedInputStream (java.io.BufferedInputStream)9 ExtractException (org.codelibs.fess.crawler.exception.ExtractException)9 ExtractData (org.codelibs.fess.crawler.entity.ExtractData)8 ResponseData (org.codelibs.fess.crawler.entity.ResponseData)8 Map (java.util.Map)7 MaxLengthExceededException (org.codelibs.fess.crawler.exception.MaxLengthExceededException)7 MalformedURLException (java.net.MalformedURLException)6 HashMap (java.util.HashMap)6 AccessResultDataImpl (org.codelibs.fess.crawler.entity.AccessResultDataImpl)6 RequestData (org.codelibs.fess.crawler.entity.RequestData)6 ResultData (org.codelibs.fess.crawler.entity.ResultData)6 ChildUrlsException (org.codelibs.fess.crawler.exception.ChildUrlsException)6 HashSet (java.util.HashSet)5 TransformerException (javax.xml.transform.TransformerException)5