Search in sources :

Example 1 with BatchProcess

use of org.apache.tika.batch.BatchProcess in project tika by apache.

the class BatchProcessBuilder method build.

/**
     * Builds a FileResourceBatchProcessor from runtime arguments and a
     * document node of a configuration file.  With the exception of the QueueBuilder,
     * the builders choose how to adjudicate between
     * runtime arguments and the elements in the configuration file.
     *
     * @param docElement   document element of the xml config file
     * @param incomingRuntimeAttributes runtime arguments
     * @return FileResourceBatchProcessor
     */
public BatchProcess build(Node docElement, Map<String, String> incomingRuntimeAttributes) {
    //key components
    long timeoutThresholdMillis = XMLDOMUtil.getLong("timeoutThresholdMillis", incomingRuntimeAttributes, docElement);
    long timeoutCheckPulseMillis = XMLDOMUtil.getLong("timeoutCheckPulseMillis", incomingRuntimeAttributes, docElement);
    long pauseOnEarlyTerminationMillis = XMLDOMUtil.getLong("pauseOnEarlyTerminationMillis", incomingRuntimeAttributes, docElement);
    int maxAliveTimeSeconds = XMLDOMUtil.getInt("maxAliveTimeSeconds", incomingRuntimeAttributes, docElement);
    FileResourceCrawler crawler = null;
    ConsumersManager consumersManager = null;
    StatusReporter reporter = null;
    Interrupter interrupter = null;
    /*
         * TODO: This is a bit smelly.  NumConsumers needs to be used by the crawler
         * and the consumers.  This copies the incomingRuntimeAttributes and then
         * supplies the numConsumers from the commandline (if it exists) or from the config file
         * At least this creates an unmodifiable defensive copy of incomingRuntimeAttributes...
         */
    Map<String, String> runtimeAttributes = setNumConsumersInRuntimeAttributes(docElement, incomingRuntimeAttributes);
    //build queue
    ArrayBlockingQueue<FileResource> queue = buildQueue(docElement, runtimeAttributes);
    NodeList children = docElement.getChildNodes();
    Map<String, Node> keyNodes = new HashMap<String, Node>();
    for (int i = 0; i < children.getLength(); i++) {
        Node child = children.item(i);
        if (child.getNodeType() != Node.ELEMENT_NODE) {
            continue;
        }
        String nodeName = child.getNodeName();
        keyNodes.put(nodeName, child);
    }
    //build consumers
    consumersManager = buildConsumersManager(keyNodes.get("consumers"), runtimeAttributes, queue);
    //build crawler
    crawler = buildCrawler(queue, keyNodes.get("crawler"), runtimeAttributes);
    reporter = buildReporter(crawler, consumersManager, keyNodes.get("reporter"), runtimeAttributes);
    interrupter = buildInterrupter(keyNodes.get("interrupter"), runtimeAttributes);
    BatchProcess proc = new BatchProcess(crawler, consumersManager, reporter, interrupter);
    if (timeoutThresholdMillis > -1) {
        proc.setTimeoutThresholdMillis(timeoutThresholdMillis);
    }
    if (pauseOnEarlyTerminationMillis > -1) {
        proc.setPauseOnEarlyTerminationMillis(pauseOnEarlyTerminationMillis);
    }
    if (timeoutCheckPulseMillis > -1) {
        proc.setTimeoutCheckPulseMillis(timeoutCheckPulseMillis);
    }
    proc.setMaxAliveTimeSeconds(maxAliveTimeSeconds);
    return proc;
}
Also used : Interrupter(org.apache.tika.batch.Interrupter) FileResourceCrawler(org.apache.tika.batch.FileResourceCrawler) HashMap(java.util.HashMap) NodeList(org.w3c.dom.NodeList) Node(org.w3c.dom.Node) BatchProcess(org.apache.tika.batch.BatchProcess) FileResource(org.apache.tika.batch.FileResource) ConsumersManager(org.apache.tika.batch.ConsumersManager) StatusReporter(org.apache.tika.batch.StatusReporter)

Example 2 with BatchProcess

use of org.apache.tika.batch.BatchProcess in project tika by apache.

the class OutputStreamFactoryTest method testSkip.

@Test
public void testSkip() throws Exception {
    Path outputDir = getNewOutputDir("os-factory-skip-");
    Map<String, String> args = getDefaultArgs("basic", outputDir);
    args.put("handleExisting", "skip");
    BatchProcess runner = getNewBatchRunner("/tika-batch-config-test.xml", args);
    ParallelFileProcessingResult result = run(runner);
    assertEquals(1, countChildren(outputDir));
    runner = getNewBatchRunner("/tika-batch-config-test.xml", args);
    result = run(runner);
    assertEquals(1, countChildren(outputDir));
}
Also used : Path(java.nio.file.Path) ParallelFileProcessingResult(org.apache.tika.batch.ParallelFileProcessingResult) BatchProcess(org.apache.tika.batch.BatchProcess) Test(org.junit.Test)

Example 3 with BatchProcess

use of org.apache.tika.batch.BatchProcess in project tika by apache.

the class OutputStreamFactoryTest method testIllegalState.

@Test
public void testIllegalState() throws Exception {
    Path outputDir = getNewOutputDir("os-factory-illegal-state-");
    Map<String, String> args = getDefaultArgs("basic", outputDir);
    BatchProcess runner = getNewBatchRunner("/tika-batch-config-test.xml", args);
    run(runner);
    assertEquals(1, countChildren(outputDir));
    boolean illegalState = false;
    try {
        ParallelFileProcessingResult result = run(runner);
    } catch (ExecutionException e) {
        if (e.getCause() instanceof IllegalStateException) {
            illegalState = true;
        }
    }
    assertTrue("Should have been an illegal state exception", illegalState);
}
Also used : Path(java.nio.file.Path) ParallelFileProcessingResult(org.apache.tika.batch.ParallelFileProcessingResult) BatchProcess(org.apache.tika.batch.BatchProcess) ExecutionException(java.util.concurrent.ExecutionException) Test(org.junit.Test)

Example 4 with BatchProcess

use of org.apache.tika.batch.BatchProcess in project tika by apache.

the class FSBatchTestBase method getNewBatchRunner.

BatchProcess getNewBatchRunner(String testConfig, Map<String, String> args) throws IOException {
    InputStream is = this.getClass().getResourceAsStream(testConfig);
    BatchProcessBuilder b = new BatchProcessBuilder();
    BatchProcess runner = b.build(is, args);
    IOUtils.closeQuietly(is);
    return runner;
}
Also used : BatchProcessBuilder(org.apache.tika.batch.builders.BatchProcessBuilder) InputStream(java.io.InputStream) BatchProcess(org.apache.tika.batch.BatchProcess)

Example 5 with BatchProcess

use of org.apache.tika.batch.BatchProcess in project tika by apache.

the class HandlerBuilderTest method testXML.

@Test
public void testXML() throws Exception {
    Path outputDir = getNewOutputDir("handler-xml-");
    Map<String, String> args = getDefaultArgs("basic", outputDir);
    args.put("basicHandlerType", "xml");
    BatchProcess runner = getNewBatchRunner("/tika-batch-config-test.xml", args);
    ParallelFileProcessingResult result = run(runner);
    Path outputFile = outputDir.resolve("test0.xml.xml");
    String resultString = readFileToString(outputFile, UTF_8);
    assertTrue(resultString.contains("<html xmlns=\"http://www.w3.org/1999/xhtml\">"));
    assertTrue(resultString.contains("<?xml version=\"1.0\" encoding=\"UTF-8\"?>"));
    assertTrue(resultString.contains("This is tika-batch's first test file"));
}
Also used : Path(java.nio.file.Path) ParallelFileProcessingResult(org.apache.tika.batch.ParallelFileProcessingResult) BatchProcess(org.apache.tika.batch.BatchProcess) Test(org.junit.Test)

Aggregations

BatchProcess (org.apache.tika.batch.BatchProcess)10 ParallelFileProcessingResult (org.apache.tika.batch.ParallelFileProcessingResult)8 Path (java.nio.file.Path)7 Test (org.junit.Test)7 HashMap (java.util.HashMap)2 BatchProcessBuilder (org.apache.tika.batch.builders.BatchProcessBuilder)2 InputStream (java.io.InputStream)1 ExecutionException (java.util.concurrent.ExecutionException)1 ExecutorService (java.util.concurrent.ExecutorService)1 CommandLine (org.apache.commons.cli.CommandLine)1 CommandLineParser (org.apache.commons.cli.CommandLineParser)1 DefaultParser (org.apache.commons.cli.DefaultParser)1 Option (org.apache.commons.cli.Option)1 ConsumersManager (org.apache.tika.batch.ConsumersManager)1 FileResource (org.apache.tika.batch.FileResource)1 FileResourceCrawler (org.apache.tika.batch.FileResourceCrawler)1 Interrupter (org.apache.tika.batch.Interrupter)1 StatusReporter (org.apache.tika.batch.StatusReporter)1 TikaInputStream (org.apache.tika.io.TikaInputStream)1 Node (org.w3c.dom.Node)1