Search in sources :

Example 6 with BatchProcess

use of org.apache.tika.batch.BatchProcess in project tika by apache.

the class HandlerBuilderTest method testRecursiveParserWrapper.

@Test
public void testRecursiveParserWrapper() throws Exception {
    Path outputDir = getNewOutputDir("handler-recursive-parser");
    Map<String, String> args = getDefaultArgs("basic", outputDir);
    args.put("basicHandlerType", "txt");
    args.put("recursiveParserWrapper", "true");
    BatchProcess runner = getNewBatchRunner("/tika-batch-config-test.xml", args);
    ParallelFileProcessingResult result = run(runner);
    Path outputFile = outputDir.resolve("test0.xml.json");
    String resultString = readFileToString(outputFile, UTF_8);
    assertTrue(resultString.contains("\"author\":\"Nikolai Lobachevsky\""));
    assertTrue(resultString.contains("tika-batch\\u0027s first test file"));
}
Also used : Path(java.nio.file.Path) ParallelFileProcessingResult(org.apache.tika.batch.ParallelFileProcessingResult) BatchProcess(org.apache.tika.batch.BatchProcess) Test(org.junit.Test)

Example 7 with BatchProcess

use of org.apache.tika.batch.BatchProcess in project tika by apache.

the class HandlerBuilderTest method testText.

@Test
public void testText() throws Exception {
    Path outputDir = getNewOutputDir("handler-txt-");
    Map<String, String> args = getDefaultArgs("basic", outputDir);
    args.put("basicHandlerType", "txt");
    BatchProcess runner = getNewBatchRunner("/tika-batch-config-test.xml", args);
    ParallelFileProcessingResult result = run(runner);
    Path outputFile = outputDir.resolve("test0.xml.txt");
    String resultString = readFileToString(outputFile, UTF_8);
    assertFalse(resultString.contains("<html xmlns=\"http://www.w3.org/1999/xhtml\">"));
    assertFalse(resultString.contains("<?xml version=\"1.0\" encoding=\"UTF-8\"?>"));
    assertTrue(resultString.contains("This is tika-batch's first test file"));
}
Also used : Path(java.nio.file.Path) ParallelFileProcessingResult(org.apache.tika.batch.ParallelFileProcessingResult) BatchProcess(org.apache.tika.batch.BatchProcess) Test(org.junit.Test)

Example 8 with BatchProcess

use of org.apache.tika.batch.BatchProcess in project tika by apache.

the class HandlerBuilderTest method testXMLWithWriteLimit.

@Test
public void testXMLWithWriteLimit() throws Exception {
    Path outputDir = getNewOutputDir("handler-xml-write-limit-");
    Map<String, String> args = getDefaultArgs("basic", outputDir);
    args.put("writeLimit", "5");
    BatchProcess runner = getNewBatchRunner("/tika-batch-config-test.xml", args);
    ParallelFileProcessingResult result = run(runner);
    Path outputFile = outputDir.resolve("test0.xml.xml");
    String resultString = readFileToString(outputFile, UTF_8);
    //this is not ideal. How can we change handlers to writeout whatever
    //they've gotten so far, up to the writeLimit?
    assertTrue(resultString.equals(""));
}
Also used : Path(java.nio.file.Path) ParallelFileProcessingResult(org.apache.tika.batch.ParallelFileProcessingResult) BatchProcess(org.apache.tika.batch.BatchProcess) Test(org.junit.Test)

Example 9 with BatchProcess

use of org.apache.tika.batch.BatchProcess in project tika by apache.

the class HandlerBuilderTest method testHTML.

@Test
public void testHTML() throws Exception {
    Path outputDir = getNewOutputDir("handler-html-");
    Map<String, String> args = getDefaultArgs("basic", outputDir);
    args.put("basicHandlerType", "html");
    BatchProcess runner = getNewBatchRunner("/tika-batch-config-test.xml", args);
    ParallelFileProcessingResult result = run(runner);
    Path outputFile = outputDir.resolve("test0.xml.html");
    String resultString = readFileToString(outputFile, UTF_8);
    assertTrue(resultString.contains("<html xmlns=\"http://www.w3.org/1999/xhtml\">"));
    assertFalse(resultString.contains("<?xml version=\"1.0\" encoding=\"UTF-8\"?>"));
    assertTrue(resultString.contains("This is tika-batch's first test file"));
}
Also used : Path(java.nio.file.Path) ParallelFileProcessingResult(org.apache.tika.batch.ParallelFileProcessingResult) BatchProcess(org.apache.tika.batch.BatchProcess) Test(org.junit.Test)

Example 10 with BatchProcess

use of org.apache.tika.batch.BatchProcess in project tika by apache.

the class FSBatchProcessCLI method execute.

private void execute(String[] args) throws Exception {
    CommandLineParser cliParser = new DefaultParser();
    CommandLine line = cliParser.parse(options, args);
    if (line.hasOption("help")) {
        usage();
        System.exit(BatchProcessDriverCLI.PROCESS_NO_RESTART_EXIT_CODE);
    }
    Map<String, String> mapArgs = new HashMap<String, String>();
    for (Option option : line.getOptions()) {
        String v = option.getValue();
        if (v == null || v.equals("")) {
            v = "true";
        }
        mapArgs.put(option.getOpt(), v);
    }
    BatchProcessBuilder b = new BatchProcessBuilder();
    TikaInputStream is = null;
    BatchProcess process = null;
    try {
        is = getConfigInputStream(args, false);
        process = b.build(is, mapArgs);
    } finally {
        IOUtils.closeQuietly(is);
    }
    final Thread mainThread = Thread.currentThread();
    ExecutorService executor = Executors.newSingleThreadExecutor();
    Future<ParallelFileProcessingResult> futureResult = executor.submit(process);
    ParallelFileProcessingResult result = futureResult.get();
    System.out.println(FINISHED_STRING);
    System.out.println("\n");
    System.out.println(result.toString());
    System.exit(result.getExitStatus());
}
Also used : HashMap(java.util.HashMap) BatchProcess(org.apache.tika.batch.BatchProcess) TikaInputStream(org.apache.tika.io.TikaInputStream) CommandLine(org.apache.commons.cli.CommandLine) ParallelFileProcessingResult(org.apache.tika.batch.ParallelFileProcessingResult) BatchProcessBuilder(org.apache.tika.batch.builders.BatchProcessBuilder) ExecutorService(java.util.concurrent.ExecutorService) Option(org.apache.commons.cli.Option) CommandLineParser(org.apache.commons.cli.CommandLineParser) DefaultParser(org.apache.commons.cli.DefaultParser)

Aggregations

BatchProcess (org.apache.tika.batch.BatchProcess)10 ParallelFileProcessingResult (org.apache.tika.batch.ParallelFileProcessingResult)8 Path (java.nio.file.Path)7 Test (org.junit.Test)7 HashMap (java.util.HashMap)2 BatchProcessBuilder (org.apache.tika.batch.builders.BatchProcessBuilder)2 InputStream (java.io.InputStream)1 ExecutionException (java.util.concurrent.ExecutionException)1 ExecutorService (java.util.concurrent.ExecutorService)1 CommandLine (org.apache.commons.cli.CommandLine)1 CommandLineParser (org.apache.commons.cli.CommandLineParser)1 DefaultParser (org.apache.commons.cli.DefaultParser)1 Option (org.apache.commons.cli.Option)1 ConsumersManager (org.apache.tika.batch.ConsumersManager)1 FileResource (org.apache.tika.batch.FileResource)1 FileResourceCrawler (org.apache.tika.batch.FileResourceCrawler)1 Interrupter (org.apache.tika.batch.Interrupter)1 StatusReporter (org.apache.tika.batch.StatusReporter)1 TikaInputStream (org.apache.tika.io.TikaInputStream)1 Node (org.w3c.dom.Node)1