Search in sources :

Example 6 with EmptyParser

use of org.apache.tika.parser.EmptyParser in project tika by apache.

the class ParsingExample method parseNoEmbeddedExample.

/**
     * If you don't want content from embedded documents, send in
     * a {@link org.apache.tika.parser.ParseContext} that does contains a
     * {@link EmptyParser}.
     *
     * @return The content of a file.
     */
public String parseNoEmbeddedExample() throws IOException, SAXException, TikaException {
    AutoDetectParser parser = new AutoDetectParser();
    BodyContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    ParseContext parseContext = new ParseContext();
    parseContext.set(Parser.class, new EmptyParser());
    try (InputStream stream = ParsingExample.class.getResourceAsStream("test_recursive_embedded.docx")) {
        parser.parse(stream, handler, metadata, parseContext);
        return handler.toString();
    }
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) TikaInputStream(org.apache.tika.io.TikaInputStream) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) EmptyParser(org.apache.tika.parser.EmptyParser) AutoDetectParser(org.apache.tika.parser.AutoDetectParser)

Example 7 with EmptyParser

use of org.apache.tika.parser.EmptyParser in project tika by apache.

the class ForkParserIntegrationTest method testForkedPDFParsing.

/**
     * TIKA-808 - Ensure that parsing of our test PDFs work under
     * the Fork Parser, to ensure that complex parsing behaves
     */
@Test
public void testForkedPDFParsing() throws Exception {
    ForkParser parser = new ForkParser(ForkParserIntegrationTest.class.getClassLoader(), tika.getParser());
    try {
        ContentHandler output = new BodyContentHandler();
        InputStream stream = ForkParserIntegrationTest.class.getResourceAsStream("/test-documents/testPDF.pdf");
        ParseContext context = new ParseContext();
        context.set(Parser.class, new EmptyParser());
        parser.parse(stream, output, new Metadata(), context);
        String content = output.toString();
        assertContains("Apache Tika", content);
        assertContains("Tika - Content Analysis Toolkit", content);
        assertContains("incubator", content);
        assertContains("Apache Software Foundation", content);
    } finally {
        parser.close();
    }
}
Also used : ForkParser(org.apache.tika.fork.ForkParser) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) InputStream(java.io.InputStream) ParseContext(org.apache.tika.parser.ParseContext) EmptyParser(org.apache.tika.parser.EmptyParser) Metadata(org.apache.tika.metadata.Metadata) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) Test(org.junit.Test)

Aggregations

EmptyParser (org.apache.tika.parser.EmptyParser)7 ParseContext (org.apache.tika.parser.ParseContext)6 InputStream (java.io.InputStream)4 Metadata (org.apache.tika.metadata.Metadata)4 TikaInputStream (org.apache.tika.io.TikaInputStream)3 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)3 BodyContentHandler (org.apache.tika.sax.BodyContentHandler)3 Test (org.junit.Test)3 TikaTest (org.apache.tika.TikaTest)2 Parser (org.apache.tika.parser.Parser)2 ContentHandler (org.xml.sax.ContentHandler)2 ByteArrayInputStream (java.io.ByteArrayInputStream)1 File (java.io.File)1 Date (java.util.Date)1 EncryptedDocumentException (org.apache.tika.exception.EncryptedDocumentException)1 ForkParser (org.apache.tika.fork.ForkParser)1 MediaType (org.apache.tika.mime.MediaType)1 CompositeParser (org.apache.tika.parser.CompositeParser)1 OfficeParserConfig (org.apache.tika.parser.microsoft.OfficeParserConfig)1 ToXMLContentHandler (org.apache.tika.sax.ToXMLContentHandler)1