Examples with PDFTextStripper - org.apache.pdfbox.util.PDFTextStripper

Example 1 with PDFTextStripper

use of org.apache.pdfbox.util.PDFTextStripper in project camel by apache.

the class PdfProducer method doExtractText.

private String doExtractText(Exchange exchange) throws IOException, CryptographyException, InvalidPasswordException, BadSecurityHandlerException {
    LOG.debug("Got {} operation, going to extract text from provided pdf.", pdfConfiguration.getOperation());
    PDDocument document = exchange.getIn().getBody(PDDocument.class);
    if (document.isEncrypted()) {
        DecryptionMaterial decryptionMaterial = exchange.getIn().getHeader(DECRYPTION_MATERIAL_HEADER_NAME, DecryptionMaterial.class);
        if (decryptionMaterial == null) {
            throw new IllegalArgumentException(String.format("%s header is expected for %s operation " + "on encrypted document", DECRYPTION_MATERIAL_HEADER_NAME, pdfConfiguration.getOperation()));
        }
        document.openProtection(decryptionMaterial);
    }
    PDFTextStripper pdfTextStripper = new PDFTextStripper();
    return pdfTextStripper.getText(document);
}

Also used : DecryptionMaterial(org.apache.pdfbox.pdmodel.encryption.DecryptionMaterial) PDDocument(org.apache.pdfbox.pdmodel.PDDocument) PDFTextStripper(org.apache.pdfbox.util.PDFTextStripper)

Example 2 with PDFTextStripper

use of org.apache.pdfbox.util.PDFTextStripper in project camel by apache.

the class PdfAppendTest method testAppend.

@Test
public void testAppend() throws Exception {
    final String originalText = "Test";
    final String textToAppend = "Append";
    PDDocument document = new PDDocument();
    PDPage page = new PDPage(PDPage.PAGE_SIZE_A4);
    document.addPage(page);
    PDPageContentStream contentStream = new PDPageContentStream(document, page);
    contentStream.setFont(PDType1Font.HELVETICA, 12);
    contentStream.beginText();
    contentStream.moveTextPositionByAmount(20, 400);
    contentStream.drawString(originalText);
    contentStream.endText();
    contentStream.close();
    template.sendBodyAndHeader("direct:start", textToAppend, PdfHeaderConstants.PDF_DOCUMENT_HEADER_NAME, document);
    resultEndpoint.setExpectedMessageCount(1);
    resultEndpoint.expectedMessagesMatches(new Predicate() {

        @Override
        public boolean matches(Exchange exchange) {
            Object body = exchange.getIn().getBody();
            assertThat(body, instanceOf(ByteArrayOutputStream.class));
            try {
                PDDocument doc = PDDocument.load(new ByteArrayInputStream(((ByteArrayOutputStream) body).toByteArray()));
                PDFTextStripper pdfTextStripper = new PDFTextStripper();
                String text = pdfTextStripper.getText(doc);
                assertEquals(2, doc.getNumberOfPages());
                assertThat(text, containsString(originalText));
                assertThat(text, containsString(textToAppend));
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
            return true;
        }
    });
    resultEndpoint.assertIsSatisfied();
}

Also used : Exchange(org.apache.camel.Exchange) PDPage(org.apache.pdfbox.pdmodel.PDPage) ByteArrayInputStream(java.io.ByteArrayInputStream) PDDocument(org.apache.pdfbox.pdmodel.PDDocument) PDPageContentStream(org.apache.pdfbox.pdmodel.edit.PDPageContentStream) Matchers.containsString(org.hamcrest.Matchers.containsString) IOException(java.io.IOException) Predicate(org.apache.camel.Predicate) PDFTextStripper(org.apache.pdfbox.util.PDFTextStripper) Test(org.junit.Test)

Example 3 with PDFTextStripper

use of org.apache.pdfbox.util.PDFTextStripper in project camel by apache.

the class PdfAppendTest method testAppendEncrypted.

@Test
public void testAppendEncrypted() throws Exception {
    final String originalText = "Test";
    final String textToAppend = "Append";
    PDDocument document = new PDDocument();
    PDPage page = new PDPage(PDPage.PAGE_SIZE_A4);
    document.addPage(page);
    PDPageContentStream contentStream = new PDPageContentStream(document, page);
    contentStream.setFont(PDType1Font.HELVETICA, 12);
    contentStream.beginText();
    contentStream.moveTextPositionByAmount(20, 400);
    contentStream.drawString(originalText);
    contentStream.endText();
    contentStream.close();
    final String ownerPass = "ownerPass";
    final String userPass = "userPass";
    AccessPermission accessPermission = new AccessPermission();
    accessPermission.setCanExtractContent(false);
    StandardProtectionPolicy protectionPolicy = new StandardProtectionPolicy(ownerPass, userPass, accessPermission);
    protectionPolicy.setEncryptionKeyLength(128);
    document.protect(protectionPolicy);
    ByteArrayOutputStream output = new ByteArrayOutputStream();
    document.save(output);
    // Encryption happens after saving.
    PDDocument encryptedDocument = PDDocument.load(new ByteArrayInputStream(output.toByteArray()));
    Map<String, Object> headers = new HashMap<String, Object>();
    headers.put(PdfHeaderConstants.PDF_DOCUMENT_HEADER_NAME, encryptedDocument);
    headers.put(PdfHeaderConstants.DECRYPTION_MATERIAL_HEADER_NAME, new StandardDecryptionMaterial(userPass));
    template.sendBodyAndHeaders("direct:start", textToAppend, headers);
    resultEndpoint.setExpectedMessageCount(1);
    resultEndpoint.expectedMessagesMatches(new Predicate() {

        @Override
        public boolean matches(Exchange exchange) {
            Object body = exchange.getIn().getBody();
            assertThat(body, instanceOf(ByteArrayOutputStream.class));
            try {
                PDDocument doc = PDDocument.load(new ByteArrayInputStream(((ByteArrayOutputStream) body).toByteArray()));
                PDFTextStripper pdfTextStripper = new PDFTextStripper();
                String text = pdfTextStripper.getText(doc);
                assertEquals(2, doc.getNumberOfPages());
                assertThat(text, containsString(originalText));
                assertThat(text, containsString(textToAppend));
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
            return true;
        }
    });
    resultEndpoint.assertIsSatisfied();
}

Also used : PDPage(org.apache.pdfbox.pdmodel.PDPage) HashMap(java.util.HashMap) StandardProtectionPolicy(org.apache.pdfbox.pdmodel.encryption.StandardProtectionPolicy) AccessPermission(org.apache.pdfbox.pdmodel.encryption.AccessPermission) StandardDecryptionMaterial(org.apache.pdfbox.pdmodel.encryption.StandardDecryptionMaterial) Matchers.containsString(org.hamcrest.Matchers.containsString) ByteArrayOutputStream(java.io.ByteArrayOutputStream) IOException(java.io.IOException) Predicate(org.apache.camel.Predicate) Exchange(org.apache.camel.Exchange) ByteArrayInputStream(java.io.ByteArrayInputStream) PDDocument(org.apache.pdfbox.pdmodel.PDDocument) PDPageContentStream(org.apache.pdfbox.pdmodel.edit.PDPageContentStream) PDFTextStripper(org.apache.pdfbox.util.PDFTextStripper) Test(org.junit.Test)

Example 4 with PDFTextStripper

use of org.apache.pdfbox.util.PDFTextStripper in project OpenOLAT by OpenOLAT.

the class PdfBoxExtractor method extractTextFromPdf.

private FileContent extractTextFromPdf(VFSLeaf leaf) throws IOException, DocumentAccessException {
    if (log.isDebug())
        log.debug("readContent from pdf starts...");
    PDDocument document = null;
    BufferedInputStream bis = null;
    try {
        bis = new BufferedInputStream(leaf.getInputStream());
        document = PDDocument.load(bis);
        if (document.isEncrypted()) {
            try {
                document.decrypt("");
            } catch (Exception e) {
                log.warn("PDF is encrypted. Can not read content file=" + leaf.getName());
                LimitedContentWriter writer = new LimitedContentWriter(128, FileDocumentFactory.getMaxFileSize());
                writer.append(leaf.getName());
                writer.close();
                return new FileContent(leaf.getName(), writer.toString());
            }
        }
        String title = getTitle(document);
        if (log.isDebug())
            log.debug("readContent PDDocument loaded");
        PDFTextStripper stripper = new PDFTextStripper();
        LimitedContentWriter writer = new LimitedContentWriter(50000, FileDocumentFactory.getMaxFileSize());
        stripper.writeText(document, writer);
        writer.close();
        return new FileContent(title, writer.toString());
    } finally {
        if (document != null) {
            document.close();
        }
        if (bis != null) {
            bis.close();
        }
    }
}

Also used : LimitedContentWriter(org.olat.core.util.io.LimitedContentWriter) FileContent(org.olat.search.service.document.file.FileContent) BufferedInputStream(java.io.BufferedInputStream) PDDocument(org.apache.pdfbox.pdmodel.PDDocument) IOException(java.io.IOException) DocumentAccessException(org.olat.search.service.document.file.DocumentAccessException) PDFTextStripper(org.apache.pdfbox.util.PDFTextStripper)

Example 5 with PDFTextStripper

use of org.apache.pdfbox.util.PDFTextStripper in project portfolio by buchen.

the class PDFInputFile method parse.

public void parse() throws IOException {
    try (PDDocument document = PDDocument.load(getFile())) {
        PDDocumentInformation pdd = document.getDocumentInformation();
        // $NON-NLS-1$
        author = pdd.getAuthor() == null ? "" : pdd.getAuthor();
        PDFTextStripper textStripper = new PDFTextStripper();
        textStripper.setSortByPosition(true);
        text = textStripper.getText(document);
    }
}

Also used : PDDocument(org.apache.pdfbox.pdmodel.PDDocument) PDDocumentInformation(org.apache.pdfbox.pdmodel.PDDocumentInformation) PDFTextStripper(org.apache.pdfbox.util.PDFTextStripper)

Aggregations

PDFTextStripper (org.apache.pdfbox.util.PDFTextStripper)11 PDDocument (org.apache.pdfbox.pdmodel.PDDocument)9 IOException (java.io.IOException)7 ByteArrayInputStream (java.io.ByteArrayInputStream)4 Exchange (org.apache.camel.Exchange)4 Predicate (org.apache.camel.Predicate)4 Matchers.containsString (org.hamcrest.Matchers.containsString)4 Test (org.junit.Test)4 BufferedInputStream (java.io.BufferedInputStream)2 StringWriter (java.io.StringWriter)2 PDPage (org.apache.pdfbox.pdmodel.PDPage)2 PDPageContentStream (org.apache.pdfbox.pdmodel.edit.PDPageContentStream)2 AccessPermission (org.apache.pdfbox.pdmodel.encryption.AccessPermission)2 StandardProtectionPolicy (org.apache.pdfbox.pdmodel.encryption.StandardProtectionPolicy)2 LimitedContentWriter (org.olat.core.util.io.LimitedContentWriter)2 DocumentAccessException (org.olat.search.service.document.file.DocumentAccessException)2 FileContent (org.olat.search.service.document.file.FileContent)2 ByteArrayOutputStream (java.io.ByteArrayOutputStream)1 Writer (java.io.Writer)1 HashMap (java.util.HashMap)1