Search in sources :

Example 61 with ContentHandler

use of org.xml.sax.ContentHandler in project tika by apache.

the class HtmlParserTest method testParseEmpty.

@Test
public void testParseEmpty() throws Exception {
    ContentHandler handler = new BodyContentHandler();
    new HtmlParser().parse(new ByteArrayInputStream(new byte[0]), handler, new Metadata(), new ParseContext());
    assertEquals("", handler.toString());
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ByteArrayInputStream(java.io.ByteArrayInputStream) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) LinkContentHandler(org.apache.tika.sax.LinkContentHandler) TeeContentHandler(org.apache.tika.sax.TeeContentHandler) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest)

Example 62 with ContentHandler

use of org.xml.sax.ContentHandler in project tika by apache.

the class HtmlParserTest method testBoilerplateWithMarkup.

/**
     * Test case for TIKA-564. Support returning markup from BoilerpipeContentHandler.
     *
     * @see <a href="https://issues.apache.org/jira/browse/TIKA-564">TIKA-564</a>
     */
@Test
public void testBoilerplateWithMarkup() throws Exception {
    String path = "/test-documents/boilerplate.html";
    Metadata metadata = new Metadata();
    StringWriter sw = new StringWriter();
    ContentHandler ch = makeHtmlTransformer(sw);
    BoilerpipeContentHandler bpch = new BoilerpipeContentHandler(ch);
    bpch.setIncludeMarkup(true);
    new HtmlParser().parse(HtmlParserTest.class.getResourceAsStream(path), bpch, metadata, new ParseContext());
    String content = sw.toString();
    assertTrue("Has empty table elements", content.contains("<body><table><tr><td><table><tr><td>"));
    assertTrue("Has empty a element", content.contains("<a shape=\"rect\" href=\"Main.php\"/>"));
    assertTrue("Has real content", content.contains("<p>This is the real meat"));
    assertTrue("Ends with appropriate HTML", content.endsWith("</p></body></html>"));
    assertFalse(content.contains("boilerplate"));
    assertFalse(content.contains("footer"));
}
Also used : StringWriter(java.io.StringWriter) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) LinkContentHandler(org.apache.tika.sax.LinkContentHandler) TeeContentHandler(org.apache.tika.sax.TeeContentHandler) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest)

Example 63 with ContentHandler

use of org.xml.sax.ContentHandler in project tika by apache.

the class Pkcs7ParserTest method testDetachedSignature.

public void testDetachedSignature() throws Exception {
    try (InputStream input = Pkcs7ParserTest.class.getResourceAsStream("/test-documents/testDetached.p7s")) {
        ContentHandler handler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        new Pkcs7Parser().parse(input, handler, metadata, new ParseContext());
    } catch (NullPointerException npe) {
        fail("should not get NPE");
    } catch (TikaException te) {
        assertTrue(te.toString().contains("cannot parse detached pkcs7 signature"));
    }
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) TikaException(org.apache.tika.exception.TikaException) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler)

Example 64 with ContentHandler

use of org.xml.sax.ContentHandler in project tika by apache.

the class ForkParserIntegrationTest method testParsingErrorInForkedParserShouldBeReported.

/**
     * TIKA-831 Parsers throwing errors should be caught and
     *  properly reported
     */
@Test
public void testParsingErrorInForkedParserShouldBeReported() throws Exception {
    BrokenParser brokenParser = new BrokenParser();
    ForkParser parser = new ForkParser(ForkParser.class.getClassLoader(), brokenParser);
    InputStream stream = getClass().getResourceAsStream("/test-documents/testTXT.txt");
    // With a serializable error, we'll get that back
    try {
        ContentHandler output = new BodyContentHandler();
        ParseContext context = new ParseContext();
        parser.parse(stream, output, new Metadata(), context);
        fail("Expected TikaException caused by Error");
    } catch (TikaException e) {
        assertEquals(brokenParser.err, e.getCause());
    } finally {
        parser.close();
    }
    // With a non serializable one, we'll get something else
    // TODO Fix this test
    brokenParser = new BrokenParser();
    brokenParser.re = new WontBeSerializedError("Can't Serialize");
    parser = new ForkParser(ForkParser.class.getClassLoader(), brokenParser);
//        try {
//           ContentHandler output = new BodyContentHandler();
//           ParseContext context = new ParseContext();
//           parser.parse(stream, output, new Metadata(), context);
//           fail("Expected TikaException caused by Error");
//       } catch (TikaException e) {
//           assertEquals(TikaException.class, e.getCause().getClass());
//           assertEquals("Bang!", e.getCause().getMessage());
//       }
}
Also used : ForkParser(org.apache.tika.fork.ForkParser) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) TikaException(org.apache.tika.exception.TikaException) InputStream(java.io.InputStream) ParseContext(org.apache.tika.parser.ParseContext) Metadata(org.apache.tika.metadata.Metadata) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) Test(org.junit.Test)

Example 65 with ContentHandler

use of org.xml.sax.ContentHandler in project tika by apache.

the class ForkParserIntegrationTest method testParserHandlingOfNonSerializable.

/**
     * If we supply a non serializable object on the ParseContext,
     *  check we get a helpful exception back
     */
@Test
public void testParserHandlingOfNonSerializable() throws Exception {
    ForkParser parser = new ForkParser(ForkParserIntegrationTest.class.getClassLoader(), tika.getParser());
    ParseContext context = new ParseContext();
    context.set(Detector.class, new Detector() {

        public MediaType detect(InputStream input, Metadata metadata) {
            return MediaType.OCTET_STREAM;
        }
    });
    try {
        ContentHandler output = new BodyContentHandler();
        InputStream stream = ForkParserIntegrationTest.class.getResourceAsStream("/test-documents/testTXT.txt");
        parser.parse(stream, output, new Metadata(), context);
        fail("Should have blown up with a non serializable ParseContext");
    } catch (TikaException e) {
        // Check the right details
        assertNotNull(e.getCause());
        assertEquals(NotSerializableException.class, e.getCause().getClass());
        assertEquals("Unable to serialize ParseContext to pass to the Forked Parser", e.getMessage());
    } finally {
        parser.close();
    }
}
Also used : ForkParser(org.apache.tika.fork.ForkParser) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) NotSerializableException(java.io.NotSerializableException) Detector(org.apache.tika.detect.Detector) TikaException(org.apache.tika.exception.TikaException) InputStream(java.io.InputStream) ParseContext(org.apache.tika.parser.ParseContext) Metadata(org.apache.tika.metadata.Metadata) MediaType(org.apache.tika.mime.MediaType) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) Test(org.junit.Test)

Aggregations

ContentHandler (org.xml.sax.ContentHandler)354 Metadata (org.apache.tika.metadata.Metadata)229 BodyContentHandler (org.apache.tika.sax.BodyContentHandler)229 InputStream (java.io.InputStream)210 Test (org.junit.Test)208 ParseContext (org.apache.tika.parser.ParseContext)164 Parser (org.apache.tika.parser.Parser)106 TikaTest (org.apache.tika.TikaTest)103 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)102 TikaInputStream (org.apache.tika.io.TikaInputStream)75 ByteArrayInputStream (java.io.ByteArrayInputStream)64 SAXException (org.xml.sax.SAXException)40 IOException (java.io.IOException)34 TeeContentHandler (org.apache.tika.sax.TeeContentHandler)28 TikaException (org.apache.tika.exception.TikaException)24 ExcelParserTest (org.apache.tika.parser.microsoft.ExcelParserTest)24 WordParserTest (org.apache.tika.parser.microsoft.WordParserTest)24 XHTMLContentHandler (org.apache.tika.sax.XHTMLContentHandler)21 AttributesImpl (org.xml.sax.helpers.AttributesImpl)21 InputSource (org.xml.sax.InputSource)20