Search in sources :

Example 1 with TikaTest

use of org.apache.tika.TikaTest in project tika by apache.

the class PDFParserTest method testSkipBadPage.

@Test
public void testSkipBadPage() throws Exception {
    //test file comes from govdocs1
    //can't use TikaTest shortcuts because of exception
    Parser p = new AutoDetectParser();
    ContentHandler handler = new BodyContentHandler(-1);
    Metadata m = new Metadata();
    ParseContext context = new ParseContext();
    boolean tikaEx = false;
    try (InputStream is = getResourceAsStream("/test-documents/testPDF_bad_page_303226.pdf")) {
        p.parse(is, handler, m, context);
    } catch (TikaException e) {
        tikaEx = true;
    }
    String content = handler.toString();
    assertTrue("Should have thrown exception", tikaEx);
    assertEquals(1, m.getValues(TikaCoreProperties.TIKA_META_EXCEPTION_WARNING).length);
    assertContains("Unknown dir", m.get(TikaCoreProperties.TIKA_META_EXCEPTION_WARNING));
    assertContains("1309.61", content);
    //now try throwing exception immediately
    PDFParserConfig config = new PDFParserConfig();
    config.setCatchIntermediateIOExceptions(false);
    context.set(PDFParserConfig.class, config);
    handler = new BodyContentHandler(-1);
    m = new Metadata();
    tikaEx = false;
    try (InputStream is = getResourceAsStream("/test-documents/testPDF_bad_page_303226.pdf")) {
        p.parse(is, handler, m, context);
    } catch (TikaException e) {
        tikaEx = true;
    }
    content = handler.toString();
    assertTrue("Should have thrown exception", tikaEx);
    assertEquals(0, m.getValues(TikaCoreProperties.TIKA_META_EXCEPTION_WARNING).length);
    assertNotContained("1309.61", content);
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) TikaException(org.apache.tika.exception.TikaException) TikaInputStream(org.apache.tika.io.TikaInputStream) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) Parser(org.apache.tika.parser.Parser) CompositeParser(org.apache.tika.parser.CompositeParser) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) TesseractOCRParser(org.apache.tika.parser.ocr.TesseractOCRParser) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest)

Aggregations

InputStream (java.io.InputStream)1 TikaTest (org.apache.tika.TikaTest)1 TikaException (org.apache.tika.exception.TikaException)1 TikaInputStream (org.apache.tika.io.TikaInputStream)1 Metadata (org.apache.tika.metadata.Metadata)1 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)1 CompositeParser (org.apache.tika.parser.CompositeParser)1 ParseContext (org.apache.tika.parser.ParseContext)1 Parser (org.apache.tika.parser.Parser)1 TesseractOCRParser (org.apache.tika.parser.ocr.TesseractOCRParser)1 BodyContentHandler (org.apache.tika.sax.BodyContentHandler)1 Test (org.junit.Test)1 ContentHandler (org.xml.sax.ContentHandler)1