Search in sources :

Example 1 with SpreadsheetExtractionAlgorithm

use of technology.tabula.extractors.SpreadsheetExtractionAlgorithm in project drill by apache.

the class PdfUtils method extractTablesFromPDF.

/**
 * Returns a list of tables found in a given PDF document.  There are several extraction algorithms
 * available and this function allows the user to select which to use.
 * @param document The input PDF document to search for tables
 * @param algorithm The extraction algorithm
 * @return A list of tables found in the document.
 */
public static List<Table> extractTablesFromPDF(PDDocument document, ExtractionAlgorithm algorithm) {
    NurminenDetectionAlgorithm detectionAlgorithm = new NurminenDetectionAlgorithm();
    ExtractionAlgorithm algExtractor;
    SpreadsheetExtractionAlgorithm extractor = new SpreadsheetExtractionAlgorithm();
    ObjectExtractor objectExtractor = new ObjectExtractor(document);
    PageIterator pages = objectExtractor.extract();
    List<Table> tables = new ArrayList<>();
    while (pages.hasNext()) {
        Page page = pages.next();
        algExtractor = algorithm;
        List<Rectangle> tablesOnPage = detectionAlgorithm.detect(page);
        for (Rectangle guessRect : tablesOnPage) {
            Page guess = page.getArea(guessRect);
            tables.addAll(algExtractor.extract(guess));
        }
    }
    try {
        objectExtractor.close();
    } catch (Exception e) {
        throw UserException.parseError(e).message("Error extracting table: " + e.getMessage()).build(logger);
    }
    return tables;
}
Also used : ExtractionAlgorithm(technology.tabula.extractors.ExtractionAlgorithm) SpreadsheetExtractionAlgorithm(technology.tabula.extractors.SpreadsheetExtractionAlgorithm) BasicExtractionAlgorithm(technology.tabula.extractors.BasicExtractionAlgorithm) PageIterator(technology.tabula.PageIterator) Table(technology.tabula.Table) SpreadsheetExtractionAlgorithm(technology.tabula.extractors.SpreadsheetExtractionAlgorithm) ArrayList(java.util.ArrayList) Rectangle(technology.tabula.Rectangle) ObjectExtractor(technology.tabula.ObjectExtractor) Page(technology.tabula.Page) NurminenDetectionAlgorithm(technology.tabula.detectors.NurminenDetectionAlgorithm) UserException(org.apache.drill.common.exceptions.UserException)

Aggregations

ArrayList (java.util.ArrayList)1 UserException (org.apache.drill.common.exceptions.UserException)1 ObjectExtractor (technology.tabula.ObjectExtractor)1 Page (technology.tabula.Page)1 PageIterator (technology.tabula.PageIterator)1 Rectangle (technology.tabula.Rectangle)1 Table (technology.tabula.Table)1 NurminenDetectionAlgorithm (technology.tabula.detectors.NurminenDetectionAlgorithm)1 BasicExtractionAlgorithm (technology.tabula.extractors.BasicExtractionAlgorithm)1 ExtractionAlgorithm (technology.tabula.extractors.ExtractionAlgorithm)1 SpreadsheetExtractionAlgorithm (technology.tabula.extractors.SpreadsheetExtractionAlgorithm)1