Search in sources :

Example 1 with ForkParser

use of org.apache.tika.fork.ForkParser in project tika by apache.

the class ForkParserIntegrationTest method testParsingErrorInForkedParserShouldBeReported.

/**
     * TIKA-831 Parsers throwing errors should be caught and
     *  properly reported
     */
@Test
public void testParsingErrorInForkedParserShouldBeReported() throws Exception {
    BrokenParser brokenParser = new BrokenParser();
    ForkParser parser = new ForkParser(ForkParser.class.getClassLoader(), brokenParser);
    InputStream stream = getClass().getResourceAsStream("/test-documents/testTXT.txt");
    // With a serializable error, we'll get that back
    try {
        ContentHandler output = new BodyContentHandler();
        ParseContext context = new ParseContext();
        parser.parse(stream, output, new Metadata(), context);
        fail("Expected TikaException caused by Error");
    } catch (TikaException e) {
        assertEquals(brokenParser.err, e.getCause());
    } finally {
        parser.close();
    }
    // With a non serializable one, we'll get something else
    // TODO Fix this test
    brokenParser = new BrokenParser();
    brokenParser.re = new WontBeSerializedError("Can't Serialize");
    parser = new ForkParser(ForkParser.class.getClassLoader(), brokenParser);
//        try {
//           ContentHandler output = new BodyContentHandler();
//           ParseContext context = new ParseContext();
//           parser.parse(stream, output, new Metadata(), context);
//           fail("Expected TikaException caused by Error");
//       } catch (TikaException e) {
//           assertEquals(TikaException.class, e.getCause().getClass());
//           assertEquals("Bang!", e.getCause().getMessage());
//       }
}
Also used : ForkParser(org.apache.tika.fork.ForkParser) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) TikaException(org.apache.tika.exception.TikaException) InputStream(java.io.InputStream) ParseContext(org.apache.tika.parser.ParseContext) Metadata(org.apache.tika.metadata.Metadata) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) Test(org.junit.Test)

Example 2 with ForkParser

use of org.apache.tika.fork.ForkParser in project tika by apache.

the class ForkParserIntegrationTest method testParserHandlingOfNonSerializable.

/**
     * If we supply a non serializable object on the ParseContext,
     *  check we get a helpful exception back
     */
@Test
public void testParserHandlingOfNonSerializable() throws Exception {
    ForkParser parser = new ForkParser(ForkParserIntegrationTest.class.getClassLoader(), tika.getParser());
    ParseContext context = new ParseContext();
    context.set(Detector.class, new Detector() {

        public MediaType detect(InputStream input, Metadata metadata) {
            return MediaType.OCTET_STREAM;
        }
    });
    try {
        ContentHandler output = new BodyContentHandler();
        InputStream stream = ForkParserIntegrationTest.class.getResourceAsStream("/test-documents/testTXT.txt");
        parser.parse(stream, output, new Metadata(), context);
        fail("Should have blown up with a non serializable ParseContext");
    } catch (TikaException e) {
        // Check the right details
        assertNotNull(e.getCause());
        assertEquals(NotSerializableException.class, e.getCause().getClass());
        assertEquals("Unable to serialize ParseContext to pass to the Forked Parser", e.getMessage());
    } finally {
        parser.close();
    }
}
Also used : ForkParser(org.apache.tika.fork.ForkParser) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) NotSerializableException(java.io.NotSerializableException) Detector(org.apache.tika.detect.Detector) TikaException(org.apache.tika.exception.TikaException) InputStream(java.io.InputStream) ParseContext(org.apache.tika.parser.ParseContext) Metadata(org.apache.tika.metadata.Metadata) MediaType(org.apache.tika.mime.MediaType) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) Test(org.junit.Test)

Example 3 with ForkParser

use of org.apache.tika.fork.ForkParser in project tika by apache.

the class ForkParserIntegrationTest method testAttachingADebuggerOnTheForkedParserShouldWork.

/**
     * TIKA-832
     */
@Test
public void testAttachingADebuggerOnTheForkedParserShouldWork() throws Exception {
    ParseContext context = new ParseContext();
    context.set(Parser.class, tika.getParser());
    ForkParser parser = new ForkParser(ForkParserIntegrationTest.class.getClassLoader(), tika.getParser());
    parser.setJavaCommand(Arrays.asList("java", "-Xmx32m", "-Xdebug", "-Xrunjdwp:transport=dt_socket,address=54321,server=y,suspend=n"));
    try {
        ContentHandler body = new BodyContentHandler();
        InputStream stream = ForkParserIntegrationTest.class.getResourceAsStream("/test-documents/testTXT.txt");
        parser.parse(stream, body, new Metadata(), context);
        String content = body.toString();
        assertContains("Test d'indexation", content);
        assertContains("http://www.apache.org", content);
    } finally {
        parser.close();
    }
}
Also used : ForkParser(org.apache.tika.fork.ForkParser) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) InputStream(java.io.InputStream) ParseContext(org.apache.tika.parser.ParseContext) Metadata(org.apache.tika.metadata.Metadata) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) Test(org.junit.Test)

Example 4 with ForkParser

use of org.apache.tika.fork.ForkParser in project tika by apache.

the class BundleIT method testForkParser.

@Test
public void testForkParser() throws Exception {
    ForkParser parser = new ForkParser(Activator.class.getClassLoader(), defaultParser);
    String data = "<!DOCTYPE html>\n<html><body><p>test <span>content</span></p></body></html>";
    InputStream stream = new ByteArrayInputStream(data.getBytes(UTF_8));
    Writer writer = new StringWriter();
    ContentHandler contentHandler = new BodyContentHandler(writer);
    Metadata metadata = new Metadata();
    MediaType type = contentTypeDetector.detect(stream, metadata);
    assertEquals(type.toString(), "text/html");
    metadata.add(Metadata.CONTENT_TYPE, type.toString());
    ParseContext parseCtx = new ParseContext();
    parser.parse(stream, contentHandler, metadata, parseCtx);
    writer.flush();
    String content = writer.toString();
    assertTrue(content.length() > 0);
    assertEquals("test content", content.trim());
}
Also used : ForkParser(org.apache.tika.fork.ForkParser) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) StringWriter(java.io.StringWriter) Activator(org.apache.tika.parser.internal.Activator) ByteArrayInputStream(java.io.ByteArrayInputStream) ByteArrayInputStream(java.io.ByteArrayInputStream) JarInputStream(java.util.jar.JarInputStream) FileInputStream(java.io.FileInputStream) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) MediaType(org.apache.tika.mime.MediaType) StringWriter(java.io.StringWriter) Writer(java.io.Writer) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) Test(org.junit.Test)

Example 5 with ForkParser

use of org.apache.tika.fork.ForkParser in project jackrabbit by apache.

the class SearchIndex method createParser.

private Parser createParser() {
    URL url = null;
    if (tikaConfigPath != null) {
        File file = new File(tikaConfigPath);
        if (file.exists()) {
            try {
                url = file.toURI().toURL();
            } catch (MalformedURLException e) {
                log.warn("Invalid Tika configuration path: " + file, e);
            }
        } else {
            ClassLoader loader = SearchIndex.class.getClassLoader();
            url = loader.getResource(tikaConfigPath);
        }
    }
    if (url == null) {
        url = SearchIndex.class.getResource("tika-config.xml");
    }
    TikaConfig config = null;
    if (url != null) {
        try {
            config = new TikaConfig(url);
        } catch (Exception e) {
            log.warn("Tika configuration not available: " + url, e);
        }
    }
    if (config == null) {
        config = TikaConfig.getDefaultConfig();
    }
    if (forkJavaCommand != null) {
        ForkParser forkParser = new ForkParser(SearchIndex.class.getClassLoader(), new AutoDetectParser(config));
        forkParser.setJavaCommand(forkJavaCommand);
        forkParser.setPoolSize(extractorPoolSize);
        return forkParser;
    } else {
        return new AutoDetectParser(config);
    }
}
Also used : ForkParser(org.apache.tika.fork.ForkParser) MalformedURLException(java.net.MalformedURLException) TikaConfig(org.apache.tika.config.TikaConfig) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) File(java.io.File) URL(java.net.URL) FileSystemException(org.apache.jackrabbit.core.fs.FileSystemException) SAXException(org.xml.sax.SAXException) JournalException(org.apache.jackrabbit.core.journal.JournalException) NoSuchItemStateException(org.apache.jackrabbit.core.state.NoSuchItemStateException) RepositoryException(javax.jcr.RepositoryException) MalformedURLException(java.net.MalformedURLException) IOException(java.io.IOException) ItemStateException(org.apache.jackrabbit.core.state.ItemStateException) ParserConfigurationException(javax.xml.parsers.ParserConfigurationException) InvalidQueryException(javax.jcr.query.InvalidQueryException)

Aggregations

ForkParser (org.apache.tika.fork.ForkParser)7 InputStream (java.io.InputStream)6 Metadata (org.apache.tika.metadata.Metadata)6 ParseContext (org.apache.tika.parser.ParseContext)6 BodyContentHandler (org.apache.tika.sax.BodyContentHandler)6 Test (org.junit.Test)6 ContentHandler (org.xml.sax.ContentHandler)6 TikaException (org.apache.tika.exception.TikaException)2 MediaType (org.apache.tika.mime.MediaType)2 ByteArrayInputStream (java.io.ByteArrayInputStream)1 File (java.io.File)1 FileInputStream (java.io.FileInputStream)1 IOException (java.io.IOException)1 NotSerializableException (java.io.NotSerializableException)1 StringWriter (java.io.StringWriter)1 Writer (java.io.Writer)1 MalformedURLException (java.net.MalformedURLException)1 URL (java.net.URL)1 JarInputStream (java.util.jar.JarInputStream)1 RepositoryException (javax.jcr.RepositoryException)1