Search in sources :

Example 6 with Table

use of technology.tabula.Table in project drill by apache.

the class PdfUtils method getSpecificTable.

/**
 * Returns a specific table from a PDF document. Returns null in the event that
 * the user requests a table that does not exist.  If there is an error with the document
 * the function will throw a UserException.
 * @param document The source PDF document
 * @param tableIndex The index of the desired table
 * @return The desired Table, null if the table is not valid, or if the document has no tables.
 */
public static Table getSpecificTable(PDDocument document, int tableIndex, ExtractionAlgorithm algorithm) {
    NurminenDetectionAlgorithm detectionAlgorithm = new NurminenDetectionAlgorithm();
    ExtractionAlgorithm algExtractor;
    if (algorithm == null) {
        algExtractor = DEFAULT_ALGORITHM;
    } else {
        algExtractor = algorithm;
    }
    ObjectExtractor objectExtractor = new ObjectExtractor(document);
    PageIterator pages = objectExtractor.extract();
    Table specificTable;
    int tableCounter = 0;
    while (pages.hasNext()) {
        Page page = pages.next();
        List<Rectangle> rectanglesOnPage = detectionAlgorithm.detect(page);
        List<Table> tablesOnPage = new ArrayList<>();
        for (Rectangle guessRect : rectanglesOnPage) {
            Page guess = page.getArea(guessRect);
            tablesOnPage.addAll(algExtractor.extract(guess));
            if (tablesOnPage.size() == 0) {
                return null;
            }
            for (Table table : tablesOnPage) {
                if (tableCounter == tableIndex) {
                    specificTable = table;
                    return specificTable;
                }
                tableCounter++;
            }
        }
    }
    try {
        objectExtractor.close();
    } catch (Exception e) {
        throw UserException.parseError(e).message("Error extracting table: " + e.getMessage()).build(logger);
    }
    return null;
}
Also used : ExtractionAlgorithm(technology.tabula.extractors.ExtractionAlgorithm) SpreadsheetExtractionAlgorithm(technology.tabula.extractors.SpreadsheetExtractionAlgorithm) BasicExtractionAlgorithm(technology.tabula.extractors.BasicExtractionAlgorithm) PageIterator(technology.tabula.PageIterator) Table(technology.tabula.Table) Rectangle(technology.tabula.Rectangle) ArrayList(java.util.ArrayList) ObjectExtractor(technology.tabula.ObjectExtractor) Page(technology.tabula.Page) NurminenDetectionAlgorithm(technology.tabula.detectors.NurminenDetectionAlgorithm) UserException(org.apache.drill.common.exceptions.UserException)

Example 7 with Table

use of technology.tabula.Table in project drill by apache.

the class TestPdfUtils method testTableExtractorWithNoBoundingFrame.

@Test
public void testTableExtractorWithNoBoundingFrame() throws Exception {
    PDDocument document = getDocument("spreadsheet_no_bounding_frame.pdf");
    List<Table> tableList = PdfUtils.extractTablesFromPDF(document);
    document.close();
    assertEquals(tableList.size(), 1);
}
Also used : Table(technology.tabula.Table) PDDocument(org.apache.pdfbox.pdmodel.PDDocument) Test(org.junit.Test)

Example 8 with Table

use of technology.tabula.Table in project drill by apache.

the class TestPdfUtils method testTableExtractorWitMultipage.

@Test
public void testTableExtractorWitMultipage() throws Exception {
    PDDocument document = getDocument("us-020.pdf");
    List<Table> tableList = PdfUtils.extractTablesFromPDF(document);
    document.close();
    assertEquals(tableList.size(), 4);
}
Also used : Table(technology.tabula.Table) PDDocument(org.apache.pdfbox.pdmodel.PDDocument) Test(org.junit.Test)

Example 9 with Table

use of technology.tabula.Table in project drill by apache.

the class TestPdfUtils method testGetSpecificTableOutSideOfBounds.

@Test
public void testGetSpecificTableOutSideOfBounds() throws Exception {
    PDDocument document = getDocument("us-020.pdf");
    Table table = PdfUtils.getSpecificTable(document, 4, null);
    assertNull(table);
}
Also used : Table(technology.tabula.Table) PDDocument(org.apache.pdfbox.pdmodel.PDDocument) Test(org.junit.Test)

Aggregations

Table (technology.tabula.Table)9 PDDocument (org.apache.pdfbox.pdmodel.PDDocument)7 Test (org.junit.Test)7 ArrayList (java.util.ArrayList)2 UserException (org.apache.drill.common.exceptions.UserException)2 ObjectExtractor (technology.tabula.ObjectExtractor)2 Page (technology.tabula.Page)2 PageIterator (technology.tabula.PageIterator)2 Rectangle (technology.tabula.Rectangle)2 NurminenDetectionAlgorithm (technology.tabula.detectors.NurminenDetectionAlgorithm)2 BasicExtractionAlgorithm (technology.tabula.extractors.BasicExtractionAlgorithm)2 ExtractionAlgorithm (technology.tabula.extractors.ExtractionAlgorithm)2 SpreadsheetExtractionAlgorithm (technology.tabula.extractors.SpreadsheetExtractionAlgorithm)2