Search in sources :

Example 1 with WriteOutContentHandler

use of org.apache.tika.sax.WriteOutContentHandler in project tika by apache.

the class TXTParserTest method testEBCDIC_CP500.

@Test
public void testEBCDIC_CP500() throws Exception {
    Metadata metadata = new Metadata();
    StringWriter writer = new StringWriter();
    parser.parse(TXTParserTest.class.getResourceAsStream("/test-documents/english.cp500.txt"), new WriteOutContentHandler(writer), metadata, new ParseContext());
    assertEquals("text/plain; charset=IBM500", metadata.get(Metadata.CONTENT_TYPE));
    // Additional check that it isn't too eager on short blocks of text
    metadata = new Metadata();
    writer = new StringWriter();
    parser.parse(new ByteArrayInputStream("<html><body>hello world</body></html>".getBytes(ISO_8859_1)), new WriteOutContentHandler(writer), metadata, new ParseContext());
    assertEquals("text/plain; charset=ISO-8859-1", metadata.get(Metadata.CONTENT_TYPE));
}
Also used : StringWriter(java.io.StringWriter) WriteOutContentHandler(org.apache.tika.sax.WriteOutContentHandler) ByteArrayInputStream(java.io.ByteArrayInputStream) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest)

Example 2 with WriteOutContentHandler

use of org.apache.tika.sax.WriteOutContentHandler in project tika by apache.

the class RTFParserTest method getResult.

private Result getResult(String filename) throws Exception {
    File file = getResourceAsFile("/test-documents/" + filename);
    Metadata metadata = new Metadata();
    StringWriter writer = new StringWriter();
    tika.getParser().parse(new FileInputStream(file), new WriteOutContentHandler(writer), metadata, new ParseContext());
    String content = writer.toString();
    return new Result(content, metadata);
}
Also used : StringWriter(java.io.StringWriter) WriteOutContentHandler(org.apache.tika.sax.WriteOutContentHandler) Metadata(org.apache.tika.metadata.Metadata) RTFMetadata(org.apache.tika.metadata.RTFMetadata) ParseContext(org.apache.tika.parser.ParseContext) File(java.io.File) FileInputStream(java.io.FileInputStream)

Example 3 with WriteOutContentHandler

use of org.apache.tika.sax.WriteOutContentHandler in project tika by apache.

the class Tika method parseToString.

/**
     * Parses the given document and returns the extracted text content.
     * The given input stream is closed by this method. This method lets
     * you control the maxStringLength per call.
     * <p>
     * To avoid unpredictable excess memory use, the returned string contains
     * only up to maxLength (parameter) first characters extracted
     * from the input document.
     * <p>
     * <strong>NOTE:</strong> Unlike most other Tika methods that take an
     * {@link InputStream}, this method will close the given stream for
     * you as a convenience. With other methods you are still responsible
     * for closing the stream or a wrapper instance returned by Tika.
     *
     * @param stream the document to be parsed
     * @param metadata document metadata
     * @param maxLength maximum length of the returned string
     * @return extracted text content
     * @throws IOException if the document can not be read
     * @throws TikaException if the document can not be parsed
     */
public String parseToString(InputStream stream, Metadata metadata, int maxLength) throws IOException, TikaException {
    WriteOutContentHandler handler = new WriteOutContentHandler(maxLength);
    try {
        ParseContext context = new ParseContext();
        context.set(Parser.class, parser);
        parser.parse(stream, new BodyContentHandler(handler), metadata, context);
    } catch (SAXException e) {
        if (!handler.isWriteLimitReached(e)) {
            // This should never happen with BodyContentHandler...
            throw new TikaException("Unexpected SAX processing failure", e);
        }
    } finally {
        stream.close();
    }
    return handler.toString();
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) TikaException(org.apache.tika.exception.TikaException) WriteOutContentHandler(org.apache.tika.sax.WriteOutContentHandler) ParseContext(org.apache.tika.parser.ParseContext) SAXException(org.xml.sax.SAXException)

Example 4 with WriteOutContentHandler

use of org.apache.tika.sax.WriteOutContentHandler in project jackrabbit-oak by apache.

the class BinaryTextExtractor method parseStringValue0.

private String parseStringValue0(Blob v, Metadata metadata, String path) {
    WriteOutContentHandler handler = new WriteOutContentHandler(definition.getMaxExtractLength());
    long start = System.currentTimeMillis();
    long bytesRead = 0;
    long length = v.length();
    if (log.isDebugEnabled()) {
        log.debug("Extracting {}, {} bytes, id {}", path, length, v.getContentIdentity());
    }
    String oldThreadName = null;
    if (length > SMALL_BINARY) {
        Thread t = Thread.currentThread();
        oldThreadName = t.getName();
        t.setName(oldThreadName + ": Extracting " + path + ", " + length + " bytes");
    }
    try {
        CountingInputStream stream = new CountingInputStream(new LazyInputStream(new BlobByteSource(v)));
        try {
            getParser().parse(stream, handler, metadata, new ParseContext());
        } finally {
            bytesRead = stream.getCount();
            stream.close();
        }
    } catch (LinkageError e) {
    // Capture and ignore errors caused by extraction libraries
    // not being present. This is equivalent to disabling
    // selected media types in configuration, so we can simply
    // ignore these errors.
    } catch (Throwable t) {
        // The special STOP exception is used for normal termination.
        if (!handler.isWriteLimitReached(t)) {
            log.debug("[{}] Failed to extract text from a binary property: {}." + " This is a fairly common case, and nothing to" + " worry about. The stack trace is included to" + " help improve the text extraction feature.", getIndexName(), path, t);
            extractedTextCache.put(v, ExtractedText.ERROR);
            return TEXT_EXTRACTION_ERROR;
        }
    } finally {
        if (oldThreadName != null) {
            Thread.currentThread().setName(oldThreadName);
        }
    }
    String result = handler.toString();
    if (bytesRead > 0) {
        long time = System.currentTimeMillis() - start;
        int len = result.length();
        recordTextExtractionStats(time, bytesRead, len);
        if (log.isDebugEnabled()) {
            log.debug("Extracting {} took {} ms, {} bytes read, {} text size", path, time, bytesRead, len);
        }
    }
    extractedTextCache.put(v, new ExtractedText(ExtractedText.ExtractionResult.SUCCESS, result));
    return result;
}
Also used : WriteOutContentHandler(org.apache.tika.sax.WriteOutContentHandler) LazyInputStream(org.apache.jackrabbit.oak.commons.io.LazyInputStream) CountingInputStream(com.google.common.io.CountingInputStream) ParseContext(org.apache.tika.parser.ParseContext) ExtractedText(org.apache.jackrabbit.oak.plugins.index.fulltext.ExtractedText)

Example 5 with WriteOutContentHandler

use of org.apache.tika.sax.WriteOutContentHandler in project jackrabbit-oak by apache.

the class TextExtractor method parseStringValue.

//~--------------------------------------< Tika >
private String parseStringValue(ByteSource byteSource, Metadata metadata, String path) {
    WriteOutContentHandler handler = new WriteOutContentHandler(maxExtractedLength);
    long start = System.currentTimeMillis();
    long size = 0;
    try {
        CountingInputStream stream = new CountingInputStream(new LazyInputStream(byteSource));
        try {
            tika.getParser().parse(stream, handler, metadata, new ParseContext());
        } finally {
            size = stream.getCount();
            stream.close();
        }
    } catch (LinkageError e) {
    // Capture and ignore errors caused by extraction libraries
    // not being present. This is equivalent to disabling
    // selected media types in configuration, so we can simply
    // ignore these errors.
    } catch (Throwable t) {
        // The special STOP exception is used for normal termination.
        if (!handler.isWriteLimitReached(t)) {
            parserErrorCount.incrementAndGet();
            parserError.debug("Failed to extract text from a binary property: " + path + " This is a fairly common case, and nothing to" + " worry about. The stack trace is included to" + " help improve the text extraction feature.", t);
            return ERROR_TEXT;
        }
    }
    String result = handler.toString();
    timeTaken.addAndGet(System.currentTimeMillis() - start);
    if (size > 0) {
        extractedTextSize.addAndGet(result.length());
        extractionCount.incrementAndGet();
        totalSizeRead.addAndGet(size);
        return result;
    }
    return null;
}
Also used : WriteOutContentHandler(org.apache.tika.sax.WriteOutContentHandler) LazyInputStream(org.apache.jackrabbit.oak.commons.io.LazyInputStream) CountingInputStream(com.google.common.io.CountingInputStream) ParseContext(org.apache.tika.parser.ParseContext)

Aggregations

ParseContext (org.apache.tika.parser.ParseContext)10 WriteOutContentHandler (org.apache.tika.sax.WriteOutContentHandler)10 Metadata (org.apache.tika.metadata.Metadata)6 StringWriter (java.io.StringWriter)5 TikaTest (org.apache.tika.TikaTest)4 Test (org.junit.Test)4 ByteArrayInputStream (java.io.ByteArrayInputStream)3 CountingInputStream (com.google.common.io.CountingInputStream)2 File (java.io.File)2 FileInputStream (java.io.FileInputStream)2 LazyInputStream (org.apache.jackrabbit.oak.commons.io.LazyInputStream)2 TikaException (org.apache.tika.exception.TikaException)2 RTFMetadata (org.apache.tika.metadata.RTFMetadata)2 BodyContentHandler (org.apache.tika.sax.BodyContentHandler)2 SAXException (org.xml.sax.SAXException)2 ExtractedText (org.apache.jackrabbit.oak.plugins.index.fulltext.ExtractedText)1 Parser (org.apache.tika.parser.Parser)1 ApplicationContext (org.springframework.context.ApplicationContext)1 ClassPathXmlApplicationContext (org.springframework.context.support.ClassPathXmlApplicationContext)1