Search in sources :

Example 21 with BasicContentHandlerFactory

use of org.apache.tika.sax.BasicContentHandlerFactory in project tika by apache.

the class ParsingExample method recursiveParserWrapperExample.

/**
     * For documents that may contain embedded documents, it might be helpful
     * to create list of metadata objects, one for the container document and
     * one for each embedded document.  This allows easy access to both the
     * extracted content and the metadata of each embedded document.
     * Note that many document formats can contain embedded documents,
     * including traditional container formats -- zip, tar and others -- but also
     * common office document formats including: MSWord, MSExcel,
     * MSPowerPoint, RTF, PDF, MSG and several others.
     * <p>
     * The "content" format is determined by the ContentHandlerFactory, and
     * the content is stored in {@link org.apache.tika.parser.RecursiveParserWrapper#TIKA_CONTENT}
     * <p>
     * The drawback to the RecursiveParserWrapper is that it caches metadata and contents
     * in memory.  This should not be used on files whose contents are too big to be handled
     * in memory.
     *
     * @return a list of metadata object, one each for the container file and each embedded file
     * @throws IOException
     * @throws SAXException
     * @throws TikaException
     */
public List<Metadata> recursiveParserWrapperExample() throws IOException, SAXException, TikaException {
    Parser p = new AutoDetectParser();
    ContentHandlerFactory factory = new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML, -1);
    RecursiveParserWrapper wrapper = new RecursiveParserWrapper(p, factory);
    Metadata metadata = new Metadata();
    metadata.set(Metadata.RESOURCE_NAME_KEY, "test_recursive_embedded.docx");
    ParseContext context = new ParseContext();
    try (InputStream stream = ParsingExample.class.getResourceAsStream("test_recursive_embedded.docx")) {
        wrapper.parse(stream, new DefaultHandler(), metadata, context);
    }
    return wrapper.getMetadata();
}
Also used : BasicContentHandlerFactory(org.apache.tika.sax.BasicContentHandlerFactory) ContentHandlerFactory(org.apache.tika.sax.ContentHandlerFactory) BasicContentHandlerFactory(org.apache.tika.sax.BasicContentHandlerFactory) TikaInputStream(org.apache.tika.io.TikaInputStream) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) RecursiveParserWrapper(org.apache.tika.parser.RecursiveParserWrapper) Parser(org.apache.tika.parser.Parser) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) EmptyParser(org.apache.tika.parser.EmptyParser) DefaultHandler(org.xml.sax.helpers.DefaultHandler)

Example 22 with BasicContentHandlerFactory

use of org.apache.tika.sax.BasicContentHandlerFactory in project tika by apache.

the class PDFParserTest method testEmbeddedFilesInChildren.

// TIKA-1228, TIKA-1268
@Test
public void testEmbeddedFilesInChildren() throws Exception {
    String xml = getXML("/testPDF_childAttachments.pdf").xml;
    //"regressiveness" exists only in Unit10.doc not in the container pdf document
    assertTrue(xml.contains("regressiveness"));
    RecursiveParserWrapper p = new RecursiveParserWrapper(new AutoDetectParser(), new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.IGNORE, -1));
    ParseContext context = new ParseContext();
    PDFParserConfig config = new PDFParserConfig();
    config.setExtractInlineImages(true);
    config.setExtractUniqueInlineImagesOnly(false);
    context.set(org.apache.tika.parser.pdf.PDFParserConfig.class, config);
    context.set(org.apache.tika.parser.Parser.class, p);
    try (TikaInputStream tis = TikaInputStream.get(getResourceAsStream("/test-documents/testPDF_childAttachments.pdf"))) {
        p.parse(tis, new BodyContentHandler(-1), new Metadata(), context);
    }
    List<Metadata> metadatas = p.getMetadata();
    assertEquals(5, metadatas.size());
    assertNull(metadatas.get(0).get(Metadata.RESOURCE_NAME_KEY));
    assertEquals("image0.jpg", metadatas.get(1).get(Metadata.RESOURCE_NAME_KEY));
    assertEquals("Press Quality(1).joboptions", metadatas.get(3).get(Metadata.RESOURCE_NAME_KEY));
    assertEquals("Unit10.doc", metadatas.get(4).get(Metadata.RESOURCE_NAME_KEY));
    assertEquals(MediaType.image("jpeg").toString(), metadatas.get(1).get(Metadata.CONTENT_TYPE));
    assertEquals(MediaType.image("tiff").toString(), metadatas.get(2).get(Metadata.CONTENT_TYPE));
    assertEquals("text/plain; charset=ISO-8859-1", metadatas.get(3).get(Metadata.CONTENT_TYPE));
    assertEquals(TYPE_DOC.toString(), metadatas.get(4).get(Metadata.CONTENT_TYPE));
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) BasicContentHandlerFactory(org.apache.tika.sax.BasicContentHandlerFactory) Metadata(org.apache.tika.metadata.Metadata) TikaInputStream(org.apache.tika.io.TikaInputStream) RecursiveParserWrapper(org.apache.tika.parser.RecursiveParserWrapper) ParseContext(org.apache.tika.parser.ParseContext) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest)

Aggregations

BasicContentHandlerFactory (org.apache.tika.sax.BasicContentHandlerFactory)22 Metadata (org.apache.tika.metadata.Metadata)21 Test (org.junit.Test)16 InputStream (java.io.InputStream)10 TikaInputStream (org.apache.tika.io.TikaInputStream)9 RecursiveParserWrapper (org.apache.tika.parser.RecursiveParserWrapper)9 ParseContext (org.apache.tika.parser.ParseContext)8 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)7 Parser (org.apache.tika.parser.Parser)7 DefaultHandler (org.xml.sax.helpers.DefaultHandler)7 TikaTest (org.apache.tika.TikaTest)6 BodyContentHandler (org.apache.tika.sax.BodyContentHandler)4 ByteArrayInputStream (java.io.ByteArrayInputStream)3 IOException (java.io.IOException)3 InputStreamReader (java.io.InputStreamReader)2 ArrayBlockingQueue (java.util.concurrent.ArrayBlockingQueue)2 RecursiveParserWrapperFSConsumer (org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer)2 TikaConfig (org.apache.tika.config.TikaConfig)2 EmptyParser (org.apache.tika.parser.EmptyParser)2 ContentHandler (org.xml.sax.ContentHandler)2