Search in sources :

Example 1 with TesseractOCRParser

use of org.apache.tika.parser.ocr.TesseractOCRParser in project tika by apache.

the class TikaResourceTest method testPDFOCRConfig.

//TIKA-2290
@Test
public void testPDFOCRConfig() throws Exception {
    if (!new TesseractOCRParser().hasTesseract(new TesseractOCRConfig())) {
        return;
    }
    Response response = WebClient.create(endPoint + TIKA_PATH).type("application/pdf").accept("text/plain").header(TikaResource.X_TIKA_PDF_HEADER_PREFIX + "OcrStrategy", "no_ocr").put(ClassLoader.getSystemResourceAsStream("testOCR.pdf"));
    String responseMsg = getStringFromInputStream((InputStream) response.getEntity());
    assertTrue(responseMsg.trim().equals(""));
    response = WebClient.create(endPoint + TIKA_PATH).type("application/pdf").accept("text/plain").header(TikaResource.X_TIKA_PDF_HEADER_PREFIX + "OcrStrategy", "ocr_only").put(ClassLoader.getSystemResourceAsStream("testOCR.pdf"));
    responseMsg = getStringFromInputStream((InputStream) response.getEntity());
    assertContains("Happy New Year 2003!", responseMsg);
    //now try a bad value
    response = WebClient.create(endPoint + TIKA_PATH).type("application/pdf").accept("text/plain").header(TikaResource.X_TIKA_PDF_HEADER_PREFIX + "OcrStrategy", "non-sense-value").put(ClassLoader.getSystemResourceAsStream("testOCR.pdf"));
    assertEquals(500, response.getStatus());
}
Also used : TesseractOCRConfig(org.apache.tika.parser.ocr.TesseractOCRConfig) Response(javax.ws.rs.core.Response) InputStream(java.io.InputStream) TesseractOCRParser(org.apache.tika.parser.ocr.TesseractOCRParser) Test(org.junit.Test)

Example 2 with TesseractOCRParser

use of org.apache.tika.parser.ocr.TesseractOCRParser in project tika by apache.

the class AbstractPDF2XHTML method doOCROnCurrentPage.

void doOCROnCurrentPage() throws IOException, TikaException, SAXException {
    if (config.getOcrStrategy().equals(NO_OCR)) {
        return;
    }
    TesseractOCRConfig tesseractConfig = context.get(TesseractOCRConfig.class, DEFAULT_TESSERACT_CONFIG);
    TesseractOCRParser tesseractOCRParser = new TesseractOCRParser();
    if (!tesseractOCRParser.hasTesseract(tesseractConfig)) {
        throw new TikaException("Tesseract is not available. " + "Please set the OCR_STRATEGY to NO_OCR or configure Tesseract correctly");
    }
    PDFRenderer renderer = new PDFRenderer(pdDocument);
    TemporaryResources tmp = new TemporaryResources();
    try {
        BufferedImage image = renderer.renderImage(pageIndex, 2.0f, config.getOcrImageType());
        Path tmpFile = tmp.createTempFile();
        try (OutputStream os = Files.newOutputStream(tmpFile)) {
            //TODO: get output format from TesseractConfig
            ImageIOUtil.writeImage(image, config.getOcrImageFormatName(), os, config.getOcrDPI(), config.getOcrImageQuality());
        }
        try (InputStream is = TikaInputStream.get(tmpFile)) {
            tesseractOCRParser.parseInline(is, xhtml, tesseractConfig);
        }
    } catch (IOException e) {
        handleCatchableIOE(e);
    } catch (SAXException e) {
        throw new IOExceptionWithCause("error writing OCR content from PDF", e);
    } finally {
        tmp.dispose();
    }
}
Also used : TesseractOCRConfig(org.apache.tika.parser.ocr.TesseractOCRConfig) Path(java.nio.file.Path) IOExceptionWithCause(org.apache.commons.io.IOExceptionWithCause) TikaException(org.apache.tika.exception.TikaException) BufferedInputStream(java.io.BufferedInputStream) ByteArrayInputStream(java.io.ByteArrayInputStream) TikaInputStream(org.apache.tika.io.TikaInputStream) InputStream(java.io.InputStream) OutputStream(java.io.OutputStream) TemporaryResources(org.apache.tika.io.TemporaryResources) IOException(java.io.IOException) TesseractOCRParser(org.apache.tika.parser.ocr.TesseractOCRParser) PDFRenderer(org.apache.pdfbox.rendering.PDFRenderer) BufferedImage(java.awt.image.BufferedImage) SAXException(org.xml.sax.SAXException)

Example 3 with TesseractOCRParser

use of org.apache.tika.parser.ocr.TesseractOCRParser in project tika by apache.

the class BundleIT method testTesseractParser.

@Test
public void testTesseractParser() throws Exception {
    ContentHandler handler = new BodyContentHandler();
    ParseContext context = new ParseContext();
    Parser tesseractParser = new TesseractOCRParser();
    try (InputStream stream = new FileInputStream("src/test/resources/testOCR.jpg")) {
        tesseractParser.parse(stream, handler, new Metadata(), context);
    }
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ByteArrayInputStream(java.io.ByteArrayInputStream) JarInputStream(java.util.jar.JarInputStream) FileInputStream(java.io.FileInputStream) InputStream(java.io.InputStream) ParseContext(org.apache.tika.parser.ParseContext) Metadata(org.apache.tika.metadata.Metadata) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) TesseractOCRParser(org.apache.tika.parser.ocr.TesseractOCRParser) FileInputStream(java.io.FileInputStream) Parser(org.apache.tika.parser.Parser) CompositeParser(org.apache.tika.parser.CompositeParser) DefaultParser(org.apache.tika.parser.DefaultParser) ForkParser(org.apache.tika.fork.ForkParser) TesseractOCRParser(org.apache.tika.parser.ocr.TesseractOCRParser) Test(org.junit.Test)

Aggregations

InputStream (java.io.InputStream)3 TesseractOCRParser (org.apache.tika.parser.ocr.TesseractOCRParser)3 ByteArrayInputStream (java.io.ByteArrayInputStream)2 TesseractOCRConfig (org.apache.tika.parser.ocr.TesseractOCRConfig)2 Test (org.junit.Test)2 BufferedImage (java.awt.image.BufferedImage)1 BufferedInputStream (java.io.BufferedInputStream)1 FileInputStream (java.io.FileInputStream)1 IOException (java.io.IOException)1 OutputStream (java.io.OutputStream)1 Path (java.nio.file.Path)1 JarInputStream (java.util.jar.JarInputStream)1 Response (javax.ws.rs.core.Response)1 IOExceptionWithCause (org.apache.commons.io.IOExceptionWithCause)1 PDFRenderer (org.apache.pdfbox.rendering.PDFRenderer)1 TikaException (org.apache.tika.exception.TikaException)1 ForkParser (org.apache.tika.fork.ForkParser)1 TemporaryResources (org.apache.tika.io.TemporaryResources)1 TikaInputStream (org.apache.tika.io.TikaInputStream)1 Metadata (org.apache.tika.metadata.Metadata)1