Search in sources :

Example 6 with WriteOutContentHandler

use of org.apache.tika.sax.WriteOutContentHandler in project tika by apache.

the class SpringExample method main.

public static void main(String[] args) throws Exception {
    ApplicationContext context = new ClassPathXmlApplicationContext(new String[] { "org/apache/tika/example/spring.xml" });
    Parser parser = context.getBean("tika", Parser.class);
    parser.parse(new ByteArrayInputStream("Hello, World!".getBytes(UTF_8)), new WriteOutContentHandler(System.out), new Metadata(), new ParseContext());
}
Also used : ClassPathXmlApplicationContext(org.springframework.context.support.ClassPathXmlApplicationContext) ApplicationContext(org.springframework.context.ApplicationContext) WriteOutContentHandler(org.apache.tika.sax.WriteOutContentHandler) ClassPathXmlApplicationContext(org.springframework.context.support.ClassPathXmlApplicationContext) ByteArrayInputStream(java.io.ByteArrayInputStream) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) Parser(org.apache.tika.parser.Parser)

Example 7 with WriteOutContentHandler

use of org.apache.tika.sax.WriteOutContentHandler in project tika by apache.

the class Tika method parseToString.

/**
     * Parses the given document and returns the extracted text content.
     * The given input stream is closed by this method.
     * <p>
     * To avoid unpredictable excess memory use, the returned string contains
     * only up to {@link #getMaxStringLength()} first characters extracted
     * from the input document. Use the {@link #setMaxStringLength(int)}
     * method to adjust this limitation.
     * <p>
     * <strong>NOTE:</strong> Unlike most other Tika methods that take an
     * {@link InputStream}, this method will close the given stream for
     * you as a convenience. With other methods you are still responsible
     * for closing the stream or a wrapper instance returned by Tika.
     *
     * @param stream the document to be parsed
     * @param metadata document metadata
     * @return extracted text content
     * @throws IOException if the document can not be read
     * @throws TikaException if the document can not be parsed
     */
public String parseToString(InputStream stream, Metadata metadata) throws IOException, TikaException {
    WriteOutContentHandler handler = new WriteOutContentHandler(maxStringLength);
    try {
        ParseContext context = new ParseContext();
        context.set(Parser.class, parser);
        parser.parse(stream, new BodyContentHandler(handler), metadata, context);
    } catch (SAXException e) {
        if (!handler.isWriteLimitReached(e)) {
            // This should never happen with BodyContentHandler...
            throw new TikaException("Unexpected SAX processing failure", e);
        }
    } finally {
        stream.close();
    }
    return handler.toString();
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) TikaException(org.apache.tika.exception.TikaException) WriteOutContentHandler(org.apache.tika.sax.WriteOutContentHandler) ParseContext(org.apache.tika.parser.ParseContext) SAXException(org.xml.sax.SAXException)

Example 8 with WriteOutContentHandler

use of org.apache.tika.sax.WriteOutContentHandler in project tika by apache.

the class RTFParserTest method testBasicExtraction.

@Test
public void testBasicExtraction() throws Exception {
    File file = getResourceAsFile("/test-documents/testRTF.rtf");
    Metadata metadata = new Metadata();
    StringWriter writer = new StringWriter();
    tika.getParser().parse(new FileInputStream(file), new WriteOutContentHandler(writer), metadata, new ParseContext());
    String content = writer.toString();
    assertEquals("application/rtf", metadata.get(Metadata.CONTENT_TYPE));
    assertEquals(1, metadata.getValues(Metadata.CONTENT_TYPE).length);
    assertContains("Test", content);
    assertContains("indexation Word", content);
}
Also used : StringWriter(java.io.StringWriter) WriteOutContentHandler(org.apache.tika.sax.WriteOutContentHandler) Metadata(org.apache.tika.metadata.Metadata) RTFMetadata(org.apache.tika.metadata.RTFMetadata) ParseContext(org.apache.tika.parser.ParseContext) File(java.io.File) FileInputStream(java.io.FileInputStream) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest)

Example 9 with WriteOutContentHandler

use of org.apache.tika.sax.WriteOutContentHandler in project tika by apache.

the class TXTParserTest method testEnglishText.

@Test
public void testEnglishText() throws Exception {
    String text = "Hello, World! This is simple UTF-8 text content written" + " in English to test autodetection of both the character" + " encoding and the language of the input stream.";
    Metadata metadata = new Metadata();
    StringWriter writer = new StringWriter();
    parser.parse(new ByteArrayInputStream(text.getBytes(ISO_8859_1)), new WriteOutContentHandler(writer), metadata, new ParseContext());
    String content = writer.toString();
    assertEquals("text/plain; charset=ISO-8859-1", metadata.get(Metadata.CONTENT_TYPE));
    // TIKA-501: Remove language detection from TXTParser
    assertNull(metadata.get(Metadata.CONTENT_LANGUAGE));
    assertNull(metadata.get(TikaCoreProperties.LANGUAGE));
    assertContains("Hello", content);
    assertContains("World", content);
    assertContains("autodetection", content);
    assertContains("stream", content);
}
Also used : StringWriter(java.io.StringWriter) WriteOutContentHandler(org.apache.tika.sax.WriteOutContentHandler) ByteArrayInputStream(java.io.ByteArrayInputStream) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest)

Example 10 with WriteOutContentHandler

use of org.apache.tika.sax.WriteOutContentHandler in project tika by apache.

the class TXTParserTest method testCP866.

@Test
public void testCP866() throws Exception {
    Metadata metadata = new Metadata();
    StringWriter writer = new StringWriter();
    parser.parse(TXTParserTest.class.getResourceAsStream("/test-documents/russian.cp866.txt"), new WriteOutContentHandler(writer), metadata, new ParseContext());
    assertEquals("text/plain; charset=IBM866", metadata.get(Metadata.CONTENT_TYPE));
}
Also used : StringWriter(java.io.StringWriter) WriteOutContentHandler(org.apache.tika.sax.WriteOutContentHandler) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest)

Aggregations

ParseContext (org.apache.tika.parser.ParseContext)10 WriteOutContentHandler (org.apache.tika.sax.WriteOutContentHandler)10 Metadata (org.apache.tika.metadata.Metadata)6 StringWriter (java.io.StringWriter)5 TikaTest (org.apache.tika.TikaTest)4 Test (org.junit.Test)4 ByteArrayInputStream (java.io.ByteArrayInputStream)3 CountingInputStream (com.google.common.io.CountingInputStream)2 File (java.io.File)2 FileInputStream (java.io.FileInputStream)2 LazyInputStream (org.apache.jackrabbit.oak.commons.io.LazyInputStream)2 TikaException (org.apache.tika.exception.TikaException)2 RTFMetadata (org.apache.tika.metadata.RTFMetadata)2 BodyContentHandler (org.apache.tika.sax.BodyContentHandler)2 SAXException (org.xml.sax.SAXException)2 ExtractedText (org.apache.jackrabbit.oak.plugins.index.fulltext.ExtractedText)1 Parser (org.apache.tika.parser.Parser)1 ApplicationContext (org.springframework.context.ApplicationContext)1 ClassPathXmlApplicationContext (org.springframework.context.support.ClassPathXmlApplicationContext)1