use of org.apache.tika.sax.WriteOutContentHandler in project tika by apache.
the class SpringExample method main.
public static void main(String[] args) throws Exception {
ApplicationContext context = new ClassPathXmlApplicationContext(new String[] { "org/apache/tika/example/spring.xml" });
Parser parser = context.getBean("tika", Parser.class);
parser.parse(new ByteArrayInputStream("Hello, World!".getBytes(UTF_8)), new WriteOutContentHandler(System.out), new Metadata(), new ParseContext());
}
use of org.apache.tika.sax.WriteOutContentHandler in project tika by apache.
the class Tika method parseToString.
/**
* Parses the given document and returns the extracted text content.
* The given input stream is closed by this method.
* <p>
* To avoid unpredictable excess memory use, the returned string contains
* only up to {@link #getMaxStringLength()} first characters extracted
* from the input document. Use the {@link #setMaxStringLength(int)}
* method to adjust this limitation.
* <p>
* <strong>NOTE:</strong> Unlike most other Tika methods that take an
* {@link InputStream}, this method will close the given stream for
* you as a convenience. With other methods you are still responsible
* for closing the stream or a wrapper instance returned by Tika.
*
* @param stream the document to be parsed
* @param metadata document metadata
* @return extracted text content
* @throws IOException if the document can not be read
* @throws TikaException if the document can not be parsed
*/
public String parseToString(InputStream stream, Metadata metadata) throws IOException, TikaException {
WriteOutContentHandler handler = new WriteOutContentHandler(maxStringLength);
try {
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
parser.parse(stream, new BodyContentHandler(handler), metadata, context);
} catch (SAXException e) {
if (!handler.isWriteLimitReached(e)) {
// This should never happen with BodyContentHandler...
throw new TikaException("Unexpected SAX processing failure", e);
}
} finally {
stream.close();
}
return handler.toString();
}
use of org.apache.tika.sax.WriteOutContentHandler in project tika by apache.
the class RTFParserTest method testBasicExtraction.
@Test
public void testBasicExtraction() throws Exception {
File file = getResourceAsFile("/test-documents/testRTF.rtf");
Metadata metadata = new Metadata();
StringWriter writer = new StringWriter();
tika.getParser().parse(new FileInputStream(file), new WriteOutContentHandler(writer), metadata, new ParseContext());
String content = writer.toString();
assertEquals("application/rtf", metadata.get(Metadata.CONTENT_TYPE));
assertEquals(1, metadata.getValues(Metadata.CONTENT_TYPE).length);
assertContains("Test", content);
assertContains("indexation Word", content);
}
use of org.apache.tika.sax.WriteOutContentHandler in project tika by apache.
the class TXTParserTest method testEnglishText.
@Test
public void testEnglishText() throws Exception {
String text = "Hello, World! This is simple UTF-8 text content written" + " in English to test autodetection of both the character" + " encoding and the language of the input stream.";
Metadata metadata = new Metadata();
StringWriter writer = new StringWriter();
parser.parse(new ByteArrayInputStream(text.getBytes(ISO_8859_1)), new WriteOutContentHandler(writer), metadata, new ParseContext());
String content = writer.toString();
assertEquals("text/plain; charset=ISO-8859-1", metadata.get(Metadata.CONTENT_TYPE));
// TIKA-501: Remove language detection from TXTParser
assertNull(metadata.get(Metadata.CONTENT_LANGUAGE));
assertNull(metadata.get(TikaCoreProperties.LANGUAGE));
assertContains("Hello", content);
assertContains("World", content);
assertContains("autodetection", content);
assertContains("stream", content);
}
use of org.apache.tika.sax.WriteOutContentHandler in project tika by apache.
the class TXTParserTest method testCP866.
@Test
public void testCP866() throws Exception {
Metadata metadata = new Metadata();
StringWriter writer = new StringWriter();
parser.parse(TXTParserTest.class.getResourceAsStream("/test-documents/russian.cp866.txt"), new WriteOutContentHandler(writer), metadata, new ParseContext());
assertEquals("text/plain; charset=IBM866", metadata.get(Metadata.CONTENT_TYPE));
}
Aggregations