Search in sources :

Example 1 with XSSFShape

use of org.apache.poi.xssf.usermodel.XSSFShape in project tika by apache.

the class XSSFBExcelExtractorDecorator method buildXHTML.

/**
     * @see org.apache.poi.xssf.extractor.XSSFBEventBasedExcelExtractor#getText()
     */
@Override
protected void buildXHTML(XHTMLContentHandler xhtml) throws SAXException, XmlException, IOException {
    OPCPackage container = extractor.getPackage();
    XSSFBSharedStringsTable strings;
    XSSFBReader.SheetIterator iter;
    XSSFBReader xssfReader;
    XSSFBStylesTable styles;
    try {
        xssfReader = new XSSFBReader(container);
        styles = xssfReader.getXSSFBStylesTable();
        iter = (XSSFBReader.SheetIterator) xssfReader.getSheetsData();
        strings = new XSSFBSharedStringsTable(container);
    } catch (InvalidFormatException e) {
        throw new XmlException(e);
    } catch (OpenXML4JException oe) {
        throw new XmlException(oe);
    }
    while (iter.hasNext()) {
        InputStream stream = iter.next();
        PackagePart sheetPart = iter.getSheetPart();
        addDrawingHyperLinks(sheetPart);
        sheetParts.add(sheetPart);
        SheetTextAsHTML sheetExtractor = new SheetTextAsHTML(xhtml);
        XSSFBCommentsTable comments = iter.getXSSFBSheetComments();
        // Start, and output the sheet name
        xhtml.startElement("div");
        xhtml.element("h1", iter.getSheetName());
        // Extract the main sheet contents
        xhtml.startElement("table");
        xhtml.startElement("tbody");
        processSheet(sheetExtractor, comments, styles, strings, stream);
        xhtml.endElement("tbody");
        xhtml.endElement("table");
        //  do the headers before the contents)
        for (String header : sheetExtractor.headers) {
            extractHeaderFooter(header, xhtml);
        }
        for (String footer : sheetExtractor.footers) {
            extractHeaderFooter(footer, xhtml);
        }
        List<XSSFShape> shapes = iter.getShapes();
        processShapes(shapes, xhtml);
        //for now dump sheet hyperlinks at bottom of page
        //consider a double-pass of the inputstream to reunite hyperlinks with cells/textboxes
        //step 1: extract hyperlink info from bottom of page
        //step 2: process as we do now, but with cached hyperlink relationship info
        extractHyperLinks(sheetPart, xhtml);
        // All done with this sheet
        xhtml.endElement("div");
    }
}
Also used : XSSFBReader(org.apache.poi.xssf.eventusermodel.XSSFBReader) XSSFBCommentsTable(org.apache.poi.xssf.binary.XSSFBCommentsTable) InputStream(java.io.InputStream) PackagePart(org.apache.poi.openxml4j.opc.PackagePart) XSSFBStylesTable(org.apache.poi.xssf.binary.XSSFBStylesTable) InvalidFormatException(org.apache.poi.openxml4j.exceptions.InvalidFormatException) XSSFShape(org.apache.poi.xssf.usermodel.XSSFShape) OpenXML4JException(org.apache.poi.openxml4j.exceptions.OpenXML4JException) XmlException(org.apache.xmlbeans.XmlException) XSSFBSharedStringsTable(org.apache.poi.xssf.binary.XSSFBSharedStringsTable) OPCPackage(org.apache.poi.openxml4j.opc.OPCPackage)

Example 2 with XSSFShape

use of org.apache.poi.xssf.usermodel.XSSFShape in project tika by apache.

the class XSSFExcelExtractorDecorator method processShapes.

private void processShapes(List<XSSFShape> shapes, XHTMLContentHandler xhtml) throws SAXException {
    if (shapes == null) {
        return;
    }
    for (XSSFShape shape : shapes) {
        if (shape instanceof XSSFSimpleShape) {
            String sText = ((XSSFSimpleShape) shape).getText();
            if (sText != null && sText.length() > 0) {
                xhtml.element("p", sText);
            }
            extractHyperLinksFromShape(((XSSFSimpleShape) shape).getCTShape(), xhtml);
        }
    }
}
Also used : XSSFShape(org.apache.poi.xssf.usermodel.XSSFShape) XSSFSimpleShape(org.apache.poi.xssf.usermodel.XSSFSimpleShape)

Example 3 with XSSFShape

use of org.apache.poi.xssf.usermodel.XSSFShape in project tika by apache.

the class XSSFExcelExtractorDecorator method buildXHTML.

/**
     * @see org.apache.poi.xssf.extractor.XSSFExcelExtractor#getText()
     */
@Override
protected void buildXHTML(XHTMLContentHandler xhtml) throws SAXException, XmlException, IOException {
    OPCPackage container = extractor.getPackage();
    ReadOnlySharedStringsTable strings;
    XSSFReader.SheetIterator iter;
    XSSFReader xssfReader;
    StylesTable styles;
    try {
        xssfReader = new XSSFReader(container);
        styles = xssfReader.getStylesTable();
        iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
        strings = new ReadOnlySharedStringsTable(container);
    } catch (InvalidFormatException e) {
        throw new XmlException(e);
    } catch (OpenXML4JException oe) {
        throw new XmlException(oe);
    }
    //temporary workaround for POI-61034
    //remove once POI 3.17-beta1 is released
    Set<String> seen = new HashSet<>();
    while (iter.hasNext()) {
        SheetTextAsHTML sheetExtractor = new SheetTextAsHTML(xhtml);
        PackagePart sheetPart = null;
        try (InputStream stream = iter.next()) {
            sheetPart = iter.getSheetPart();
            final String partName = sheetPart.getPartName().toString();
            if (seen.contains(partName)) {
                continue;
            }
            seen.add(partName);
            addDrawingHyperLinks(sheetPart);
            sheetParts.add(sheetPart);
            CommentsTable comments = iter.getSheetComments();
            // Start, and output the sheet name
            xhtml.startElement("div");
            xhtml.element("h1", iter.getSheetName());
            // Extract the main sheet contents
            xhtml.startElement("table");
            xhtml.startElement("tbody");
            processSheet(sheetExtractor, comments, styles, strings, stream);
        }
        xhtml.endElement("tbody");
        xhtml.endElement("table");
        //  do the headers before the contents)
        for (String header : sheetExtractor.headers) {
            extractHeaderFooter(header, xhtml);
        }
        for (String footer : sheetExtractor.footers) {
            extractHeaderFooter(footer, xhtml);
        }
        // Do text held in shapes, if required
        if (config.getIncludeShapeBasedContent()) {
            List<XSSFShape> shapes = iter.getShapes();
            processShapes(shapes, xhtml);
        }
        //for now dump sheet hyperlinks at bottom of page
        //consider a double-pass of the inputstream to reunite hyperlinks with cells/textboxes
        //step 1: extract hyperlink info from bottom of page
        //step 2: process as we do now, but with cached hyperlink relationship info
        extractHyperLinks(sheetPart, xhtml);
        // All done with this sheet
        xhtml.endElement("div");
    }
}
Also used : ReadOnlySharedStringsTable(org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable) InputStream(java.io.InputStream) StylesTable(org.apache.poi.xssf.model.StylesTable) PackagePart(org.apache.poi.openxml4j.opc.PackagePart) InvalidFormatException(org.apache.poi.openxml4j.exceptions.InvalidFormatException) CommentsTable(org.apache.poi.xssf.model.CommentsTable) XSSFShape(org.apache.poi.xssf.usermodel.XSSFShape) OpenXML4JException(org.apache.poi.openxml4j.exceptions.OpenXML4JException) XmlException(org.apache.xmlbeans.XmlException) OPCPackage(org.apache.poi.openxml4j.opc.OPCPackage) XSSFReader(org.apache.poi.xssf.eventusermodel.XSSFReader) HashSet(java.util.HashSet)

Example 4 with XSSFShape

use of org.apache.poi.xssf.usermodel.XSSFShape in project tika by apache.

the class XSSFBExcelExtractorDecorator method processShapes.

private void processShapes(List<XSSFShape> shapes, XHTMLContentHandler xhtml) throws SAXException {
    if (shapes == null) {
        return;
    }
    for (XSSFShape shape : shapes) {
        if (shape instanceof XSSFSimpleShape) {
            String sText = ((XSSFSimpleShape) shape).getText();
            if (sText != null && sText.length() > 0) {
                xhtml.element("p", sText);
            }
            extractHyperLinksFromShape(((XSSFSimpleShape) shape).getCTShape(), xhtml);
        }
    }
}
Also used : XSSFShape(org.apache.poi.xssf.usermodel.XSSFShape) XSSFSimpleShape(org.apache.poi.xssf.usermodel.XSSFSimpleShape)

Example 5 with XSSFShape

use of org.apache.poi.xssf.usermodel.XSSFShape in project poi by apache.

the class XSSFExcelExtractor method getText.

/**
     * Retrieves the text contents of the file
     */
public String getText() {
    DataFormatter formatter;
    if (locale == null) {
        formatter = new DataFormatter();
    } else {
        formatter = new DataFormatter(locale);
    }
    StringBuffer text = new StringBuffer();
    for (Sheet sh : workbook) {
        XSSFSheet sheet = (XSSFSheet) sh;
        if (includeSheetNames) {
            text.append(sheet.getSheetName()).append("\n");
        }
        // Header(s), if present
        if (includeHeadersFooters) {
            text.append(extractHeaderFooter(sheet.getFirstHeader()));
            text.append(extractHeaderFooter(sheet.getOddHeader()));
            text.append(extractHeaderFooter(sheet.getEvenHeader()));
        }
        // Rows and cells
        for (Object rawR : sheet) {
            Row row = (Row) rawR;
            for (Iterator<Cell> ri = row.cellIterator(); ri.hasNext(); ) {
                Cell cell = ri.next();
                // Is it a formula one?
                if (cell.getCellTypeEnum() == CellType.FORMULA) {
                    if (formulasNotResults) {
                        String contents = cell.getCellFormula();
                        checkMaxTextSize(text, contents);
                        text.append(contents);
                    } else {
                        if (cell.getCachedFormulaResultTypeEnum() == CellType.STRING) {
                            handleStringCell(text, cell);
                        } else {
                            handleNonStringCell(text, cell, formatter);
                        }
                    }
                } else if (cell.getCellTypeEnum() == CellType.STRING) {
                    handleStringCell(text, cell);
                } else {
                    handleNonStringCell(text, cell, formatter);
                }
                // Output the comment, if requested and exists
                Comment comment = cell.getCellComment();
                if (includeCellComments && comment != null) {
                    // Replace any newlines with spaces, otherwise it
                    //  breaks the output
                    String commentText = comment.getString().getString().replace('\n', ' ');
                    checkMaxTextSize(text, commentText);
                    text.append(" Comment by ").append(comment.getAuthor()).append(": ").append(commentText);
                }
                if (ri.hasNext()) {
                    text.append("\t");
                }
            }
            text.append("\n");
        }
        // add textboxes
        if (includeTextBoxes) {
            XSSFDrawing drawing = sheet.getDrawingPatriarch();
            if (drawing != null) {
                for (XSSFShape shape : drawing.getShapes()) {
                    if (shape instanceof XSSFSimpleShape) {
                        String boxText = ((XSSFSimpleShape) shape).getText();
                        if (boxText.length() > 0) {
                            text.append(boxText);
                            text.append('\n');
                        }
                    }
                }
            }
        }
        // Finally footer(s), if present
        if (includeHeadersFooters) {
            text.append(extractHeaderFooter(sheet.getFirstFooter()));
            text.append(extractHeaderFooter(sheet.getOddFooter()));
            text.append(extractHeaderFooter(sheet.getEvenFooter()));
        }
    }
    return text.toString();
}
Also used : Comment(org.apache.poi.ss.usermodel.Comment) XSSFSimpleShape(org.apache.poi.xssf.usermodel.XSSFSimpleShape) XSSFShape(org.apache.poi.xssf.usermodel.XSSFShape) XSSFSheet(org.apache.poi.xssf.usermodel.XSSFSheet) Row(org.apache.poi.ss.usermodel.Row) Sheet(org.apache.poi.ss.usermodel.Sheet) XSSFSheet(org.apache.poi.xssf.usermodel.XSSFSheet) Cell(org.apache.poi.ss.usermodel.Cell) XSSFCell(org.apache.poi.xssf.usermodel.XSSFCell) XSSFDrawing(org.apache.poi.xssf.usermodel.XSSFDrawing) DataFormatter(org.apache.poi.ss.usermodel.DataFormatter)

Aggregations

XSSFShape (org.apache.poi.xssf.usermodel.XSSFShape)6 XSSFSimpleShape (org.apache.poi.xssf.usermodel.XSSFSimpleShape)4 InputStream (java.io.InputStream)2 InvalidFormatException (org.apache.poi.openxml4j.exceptions.InvalidFormatException)2 OpenXML4JException (org.apache.poi.openxml4j.exceptions.OpenXML4JException)2 OPCPackage (org.apache.poi.openxml4j.opc.OPCPackage)2 PackagePart (org.apache.poi.openxml4j.opc.PackagePart)2 XmlException (org.apache.xmlbeans.XmlException)2 HashSet (java.util.HashSet)1 Cell (org.apache.poi.ss.usermodel.Cell)1 Comment (org.apache.poi.ss.usermodel.Comment)1 DataFormatter (org.apache.poi.ss.usermodel.DataFormatter)1 Row (org.apache.poi.ss.usermodel.Row)1 Sheet (org.apache.poi.ss.usermodel.Sheet)1 XSSFBCommentsTable (org.apache.poi.xssf.binary.XSSFBCommentsTable)1 XSSFBSharedStringsTable (org.apache.poi.xssf.binary.XSSFBSharedStringsTable)1 XSSFBStylesTable (org.apache.poi.xssf.binary.XSSFBStylesTable)1 ReadOnlySharedStringsTable (org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable)1 XSSFBReader (org.apache.poi.xssf.eventusermodel.XSSFBReader)1 XSSFReader (org.apache.poi.xssf.eventusermodel.XSSFReader)1