Search in sources :

Example 1 with RecursiveParserWrapperFSConsumer

use of org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer in project tika by apache.

the class RecursiveParserWrapperFSConsumerTest method testEmbeddedWithNPE.

@Test
public void testEmbeddedWithNPE() throws Exception {
    final String path = "/test-documents/embedded_with_npe.xml";
    final Metadata metadata = new Metadata();
    metadata.add(Metadata.RESOURCE_NAME_KEY, "embedded_with_npe.xml");
    ArrayBlockingQueue<FileResource> queue = new ArrayBlockingQueue<FileResource>(2);
    queue.add(new FileResource() {

        @Override
        public String getResourceId() {
            return "testFile";
        }

        @Override
        public Metadata getMetadata() {
            return metadata;
        }

        @Override
        public InputStream openInputStream() throws IOException {
            return this.getClass().getResourceAsStream(path);
        }
    });
    queue.add(new PoisonFileResource());
    MockOSFactory mockOSFactory = new MockOSFactory();
    RecursiveParserWrapperFSConsumer consumer = new RecursiveParserWrapperFSConsumer(queue, new AutoDetectParserFactory(), new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1), mockOSFactory, new TikaConfig());
    IFileProcessorFutureResult result = consumer.call();
    mockOSFactory.getStreams().get(0).flush();
    byte[] bytes = mockOSFactory.getStreams().get(0).toByteArray();
    List<Metadata> results = JsonMetadataList.fromJson(new InputStreamReader(new ByteArrayInputStream(bytes), UTF_8));
    assertEquals(4, results.size());
    assertContains("another null pointer", results.get(2).get(RecursiveParserWrapper.EMBEDDED_EXCEPTION));
    assertEquals("Nikolai Lobachevsky", results.get(0).get("author"));
    for (int i = 1; i < 4; i++) {
        assertEquals("embeddedAuthor" + i, results.get(i).get("author"));
        assertContains("some_embedded_content" + i, results.get(i).get(RecursiveParserWrapper.TIKA_CONTENT));
    }
}
Also used : RecursiveParserWrapperFSConsumer(org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer) BasicContentHandlerFactory(org.apache.tika.sax.BasicContentHandlerFactory) TikaConfig(org.apache.tika.config.TikaConfig) InputStreamReader(java.io.InputStreamReader) ByteArrayInputStream(java.io.ByteArrayInputStream) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) IOException(java.io.IOException) ArrayBlockingQueue(java.util.concurrent.ArrayBlockingQueue) ByteArrayInputStream(java.io.ByteArrayInputStream) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest)

Example 2 with RecursiveParserWrapperFSConsumer

use of org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer in project tika by apache.

the class RecursiveParserWrapperFSConsumerTest method testEmbeddedThenNPE.

@Test
public void testEmbeddedThenNPE() throws Exception {
    final String path = "/test-documents/embedded_then_npe.xml";
    final Metadata metadata = new Metadata();
    metadata.add(Metadata.RESOURCE_NAME_KEY, "embedded_then_npe.xml");
    ArrayBlockingQueue<FileResource> queue = new ArrayBlockingQueue<FileResource>(2);
    queue.add(new FileResource() {

        @Override
        public String getResourceId() {
            return "testFile";
        }

        @Override
        public Metadata getMetadata() {
            return metadata;
        }

        @Override
        public InputStream openInputStream() throws IOException {
            return this.getClass().getResourceAsStream(path);
        }
    });
    queue.add(new PoisonFileResource());
    MockOSFactory mockOSFactory = new MockOSFactory();
    RecursiveParserWrapperFSConsumer consumer = new RecursiveParserWrapperFSConsumer(queue, new AutoDetectParserFactory(), new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1), mockOSFactory, new TikaConfig());
    IFileProcessorFutureResult result = consumer.call();
    mockOSFactory.getStreams().get(0).flush();
    byte[] bytes = mockOSFactory.getStreams().get(0).toByteArray();
    List<Metadata> results = JsonMetadataList.fromJson(new InputStreamReader(new ByteArrayInputStream(bytes), UTF_8));
    assertEquals(2, results.size());
    assertContains("another null pointer", results.get(0).get(TikaCoreProperties.TIKA_META_EXCEPTION_PREFIX + "runtime"));
    assertEquals("Nikolai Lobachevsky", results.get(0).get("author"));
    assertEquals("embeddedAuthor", results.get(1).get("author"));
    assertContains("some_embedded_content", results.get(1).get(RecursiveParserWrapper.TIKA_CONTENT));
}
Also used : RecursiveParserWrapperFSConsumer(org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer) BasicContentHandlerFactory(org.apache.tika.sax.BasicContentHandlerFactory) TikaConfig(org.apache.tika.config.TikaConfig) InputStreamReader(java.io.InputStreamReader) ByteArrayInputStream(java.io.ByteArrayInputStream) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) IOException(java.io.IOException) ArrayBlockingQueue(java.util.concurrent.ArrayBlockingQueue) ByteArrayInputStream(java.io.ByteArrayInputStream) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest)

Example 3 with RecursiveParserWrapperFSConsumer

use of org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer in project tika by apache.

the class BasicTikaFSConsumersBuilder method build.

@Override
public ConsumersManager build(Node node, Map<String, String> runtimeAttributes, ArrayBlockingQueue<FileResource> queue) {
    //figure out if we're building a recursiveParserWrapper
    boolean recursiveParserWrapper = false;
    String recursiveParserWrapperString = runtimeAttributes.get("recursiveParserWrapper");
    if (recursiveParserWrapperString != null) {
        recursiveParserWrapper = PropsUtil.getBoolean(recursiveParserWrapperString, recursiveParserWrapper);
    } else {
        Node recursiveParserWrapperNode = node.getAttributes().getNamedItem("recursiveParserWrapper");
        if (recursiveParserWrapperNode != null) {
            recursiveParserWrapper = PropsUtil.getBoolean(recursiveParserWrapperNode.getNodeValue(), recursiveParserWrapper);
        }
    }
    //how long to let the consumersManager run on init() and shutdown()
    Long consumersManagerMaxMillis = null;
    String consumersManagerMaxMillisString = runtimeAttributes.get("consumersManagerMaxMillis");
    if (consumersManagerMaxMillisString != null) {
        consumersManagerMaxMillis = PropsUtil.getLong(consumersManagerMaxMillisString, null);
    } else {
        Node consumersManagerMaxMillisNode = node.getAttributes().getNamedItem("consumersManagerMaxMillis");
        if (consumersManagerMaxMillis == null && consumersManagerMaxMillisNode != null) {
            consumersManagerMaxMillis = PropsUtil.getLong(consumersManagerMaxMillisNode.getNodeValue(), null);
        }
    }
    TikaConfig config = null;
    String tikaConfigPath = runtimeAttributes.get("c");
    if (tikaConfigPath == null) {
        Node tikaConfigNode = node.getAttributes().getNamedItem("tikaConfig");
        if (tikaConfigNode != null) {
            tikaConfigPath = PropsUtil.getString(tikaConfigNode.getNodeValue(), null);
        }
    }
    if (tikaConfigPath != null) {
        try (InputStream is = Files.newInputStream(Paths.get(tikaConfigPath))) {
            config = new TikaConfig(is);
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    } else {
        config = TikaConfig.getDefaultConfig();
    }
    List<FileResourceConsumer> consumers = new LinkedList<FileResourceConsumer>();
    int numConsumers = BatchProcessBuilder.getNumConsumers(runtimeAttributes);
    NodeList nodeList = node.getChildNodes();
    Node contentHandlerFactoryNode = null;
    Node parserFactoryNode = null;
    Node outputStreamFactoryNode = null;
    for (int i = 0; i < nodeList.getLength(); i++) {
        Node child = nodeList.item(i);
        String cn = child.getNodeName();
        if (cn.equals("parser")) {
            parserFactoryNode = child;
        } else if (cn.equals("contenthandler")) {
            contentHandlerFactoryNode = child;
        } else if (cn.equals("outputstream")) {
            outputStreamFactoryNode = child;
        }
    }
    if (contentHandlerFactoryNode == null || parserFactoryNode == null || outputStreamFactoryNode == null) {
        throw new RuntimeException("You must specify a ContentHandlerFactory, " + "a ParserFactory and an OutputStreamFactory");
    }
    ContentHandlerFactory contentHandlerFactory = getContentHandlerFactory(contentHandlerFactoryNode, runtimeAttributes);
    ParserFactory parserFactory = getParserFactory(parserFactoryNode, runtimeAttributes);
    OutputStreamFactory outputStreamFactory = getOutputStreamFactory(outputStreamFactoryNode, runtimeAttributes, contentHandlerFactory, recursiveParserWrapper);
    if (recursiveParserWrapper) {
        for (int i = 0; i < numConsumers; i++) {
            FileResourceConsumer c = new RecursiveParserWrapperFSConsumer(queue, parserFactory, contentHandlerFactory, outputStreamFactory, config);
            consumers.add(c);
        }
    } else {
        for (int i = 0; i < numConsumers; i++) {
            FileResourceConsumer c = new BasicTikaFSConsumer(queue, parserFactory, contentHandlerFactory, outputStreamFactory, config);
            consumers.add(c);
        }
    }
    ConsumersManager manager = new FSConsumersManager(consumers);
    if (consumersManagerMaxMillis != null) {
        manager.setConsumersManagerMaxMillis(consumersManagerMaxMillis);
    }
    return manager;
}
Also used : ContentHandlerFactory(org.apache.tika.sax.ContentHandlerFactory) BasicContentHandlerFactory(org.apache.tika.sax.BasicContentHandlerFactory) FSConsumersManager(org.apache.tika.batch.fs.FSConsumersManager) RecursiveParserWrapperFSConsumer(org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer) TikaConfig(org.apache.tika.config.TikaConfig) InputStream(java.io.InputStream) Node(org.w3c.dom.Node) NodeList(org.w3c.dom.NodeList) FSOutputStreamFactory(org.apache.tika.batch.fs.FSOutputStreamFactory) OutputStreamFactory(org.apache.tika.batch.OutputStreamFactory) ParserFactory(org.apache.tika.batch.ParserFactory) LinkedList(java.util.LinkedList) ConsumersManager(org.apache.tika.batch.ConsumersManager) FSConsumersManager(org.apache.tika.batch.fs.FSConsumersManager) BasicTikaFSConsumer(org.apache.tika.batch.fs.BasicTikaFSConsumer) FileResourceConsumer(org.apache.tika.batch.FileResourceConsumer)

Aggregations

InputStream (java.io.InputStream)3 RecursiveParserWrapperFSConsumer (org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer)3 TikaConfig (org.apache.tika.config.TikaConfig)3 BasicContentHandlerFactory (org.apache.tika.sax.BasicContentHandlerFactory)3 ByteArrayInputStream (java.io.ByteArrayInputStream)2 IOException (java.io.IOException)2 InputStreamReader (java.io.InputStreamReader)2 ArrayBlockingQueue (java.util.concurrent.ArrayBlockingQueue)2 TikaTest (org.apache.tika.TikaTest)2 Metadata (org.apache.tika.metadata.Metadata)2 Test (org.junit.Test)2 LinkedList (java.util.LinkedList)1 ConsumersManager (org.apache.tika.batch.ConsumersManager)1 FileResourceConsumer (org.apache.tika.batch.FileResourceConsumer)1 OutputStreamFactory (org.apache.tika.batch.OutputStreamFactory)1 ParserFactory (org.apache.tika.batch.ParserFactory)1 BasicTikaFSConsumer (org.apache.tika.batch.fs.BasicTikaFSConsumer)1 FSConsumersManager (org.apache.tika.batch.fs.FSConsumersManager)1 FSOutputStreamFactory (org.apache.tika.batch.fs.FSOutputStreamFactory)1 ContentHandlerFactory (org.apache.tika.sax.ContentHandlerFactory)1