Search in sources :

Example 1 with XSSFBEventBasedExcelExtractor

use of org.apache.poi.xssf.extractor.XSSFBEventBasedExcelExtractor in project poi by apache.

the class XSSFBFileHandler method testOne.

private void testOne(OPCPackage pkg) throws Exception {
    XSSFBEventBasedExcelExtractor ex = new XSSFBEventBasedExcelExtractor(pkg);
    String txt = ex.getText();
    if (txt.length() < 1) {
        throw new RuntimeException("Should have gotten some text.");
Also used : XSSFBEventBasedExcelExtractor(org.apache.poi.xssf.extractor.XSSFBEventBasedExcelExtractor)

Example 2 with XSSFBEventBasedExcelExtractor

use of org.apache.poi.xssf.extractor.XSSFBEventBasedExcelExtractor in project poi by apache.

the class ExtractorFactory method createExtractor.

     * Tries to determine the actual type of file and produces a matching text-extractor for it.
     * @param pkg An {@link OPCPackage}.
     * @return A {@link POIXMLTextExtractor} for the given file.
     * @throws IOException If an error occurs while reading the file 
     * @throws OpenXML4JException If an error parsing the OpenXML file format is found. 
     * @throws XmlException If an XML parsing error occurs.
     * @throws IllegalArgumentException If no matching file type could be found.
public static POIXMLTextExtractor createExtractor(OPCPackage pkg) throws IOException, OpenXML4JException, XmlException {
    try {
        // Check for the normal Office core document
        PackageRelationshipCollection core;
        core = pkg.getRelationshipsByType(CORE_DOCUMENT_REL);
        // If nothing was found, try some of the other OOXML-based core types
        if (core.size() == 0) {
            // Could it be an OOXML-Strict one?
            core = pkg.getRelationshipsByType(STRICT_DOCUMENT_REL);
        if (core.size() == 0) {
            // Could it be a visio one?
            core = pkg.getRelationshipsByType(VISIO_DOCUMENT_REL);
            if (core.size() == 1)
                return new XDGFVisioExtractor(pkg);
        // Should just be a single core document, complain if not
        if (core.size() != 1) {
            throw new IllegalArgumentException("Invalid OOXML Package received - expected 1 core document, found " + core.size());
        // Grab the core document part, and try to identify from that
        final PackagePart corePart = pkg.getPart(core.getRelationship(0));
        final String contentType = corePart.getContentType();
        // Is it XSSF?
        for (XSSFRelation rel : XSSFExcelExtractor.SUPPORTED_TYPES) {
            if (rel.getContentType().equals(contentType)) {
                if (getPreferEventExtractor()) {
                    return new XSSFEventBasedExcelExtractor(pkg);
                return new XSSFExcelExtractor(pkg);
        // Is it XWPF?
        for (XWPFRelation rel : XWPFWordExtractor.SUPPORTED_TYPES) {
            if (rel.getContentType().equals(contentType)) {
                return new XWPFWordExtractor(pkg);
        // Is it XSLF?
        for (XSLFRelation rel : XSLFPowerPointExtractor.SUPPORTED_TYPES) {
            if (rel.getContentType().equals(contentType)) {
                return new XSLFPowerPointExtractor(pkg);
        // special handling for SlideShow-Theme-files, 
        if (XSLFRelation.THEME_MANAGER.getContentType().equals(contentType)) {
            return new XSLFPowerPointExtractor(new XSLFSlideShow(pkg));
        // How about xlsb?
        for (XSSFRelation rel : XSSFBEventBasedExcelExtractor.SUPPORTED_TYPES) {
            if (rel.getContentType().equals(contentType)) {
                return new XSSFBEventBasedExcelExtractor(pkg);
        throw new IllegalArgumentException("No supported documents found in the OOXML package (found " + contentType + ")");
    } catch (IOException e) {
        // ensure that we close the package again if there is an error opening it, however
        // we need to revert the package to not re-write the file via close(), which is very likely not wanted for a TextExtractor!
        throw e;
    } catch (OpenXML4JException e) {
        // ensure that we close the package again if there is an error opening it, however
        // we need to revert the package to not re-write the file via close(), which is very likely not wanted for a TextExtractor!
        throw e;
    } catch (XmlException e) {
        // ensure that we close the package again if there is an error opening it, however
        // we need to revert the package to not re-write the file via close(), which is very likely not wanted for a TextExtractor!
        throw e;
    } catch (RuntimeException e) {
        // ensure that we close the package again if there is an error opening it, however
        // we need to revert the package to not re-write the file via close(), which is very likely not wanted for a TextExtractor!
        throw e;
Also used : XSSFRelation(org.apache.poi.xssf.usermodel.XSSFRelation) XDGFVisioExtractor(org.apache.poi.xdgf.extractor.XDGFVisioExtractor) XSSFBEventBasedExcelExtractor(org.apache.poi.xssf.extractor.XSSFBEventBasedExcelExtractor) PackageRelationshipCollection(org.apache.poi.openxml4j.opc.PackageRelationshipCollection) XSSFExcelExtractor(org.apache.poi.xssf.extractor.XSSFExcelExtractor) XWPFWordExtractor(org.apache.poi.xwpf.extractor.XWPFWordExtractor) IOException( PackagePart(org.apache.poi.openxml4j.opc.PackagePart) XSLFSlideShow(org.apache.poi.xslf.usermodel.XSLFSlideShow) XWPFRelation(org.apache.poi.xwpf.usermodel.XWPFRelation) OpenXML4JException(org.apache.poi.openxml4j.exceptions.OpenXML4JException) XSSFEventBasedExcelExtractor(org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor) XSLFPowerPointExtractor(org.apache.poi.xslf.extractor.XSLFPowerPointExtractor) XmlException(org.apache.xmlbeans.XmlException) XSLFRelation(org.apache.poi.xslf.usermodel.XSLFRelation)

Example 3 with XSSFBEventBasedExcelExtractor

use of org.apache.poi.xssf.extractor.XSSFBEventBasedExcelExtractor in project tika by apache.

the class OOXMLExtractorFactory method parse.

public static void parse(InputStream stream, ContentHandler baseHandler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    Locale locale = context.get(Locale.class, Locale.getDefault());
    try {
        OOXMLExtractor extractor;
        OPCPackage pkg;
        // Locate or Open the OPCPackage for the file
        TikaInputStream tis = TikaInputStream.cast(stream);
        if (tis != null && tis.getOpenContainer() instanceof OPCPackage) {
            pkg = (OPCPackage) tis.getOpenContainer();
        } else if (tis != null && tis.hasFile()) {
            pkg =, PackageAccess.READ);
        } else {
            InputStream shield = new CloseShieldInputStream(stream);
            pkg =;
        // Get the type, and ensure it's one we handle
        MediaType type = ZipContainerDetector.detectOfficeOpenXML(pkg);
        if (type == null || OOXMLParser.UNSUPPORTED_OOXML_TYPES.contains(type)) {
            // Not a supported type, delegate to Empty Parser
            EmptyParser.INSTANCE.parse(stream, baseHandler, metadata, context);
        metadata.set(Metadata.CONTENT_TYPE, type.toString());
        // Have the appropriate OOXML text extractor picked
        POIXMLTextExtractor poiExtractor = null;
        // This has already been set by OOXMLParser's call to configure()
        // We can rely on this being non-null.
        OfficeParserConfig config = context.get(OfficeParserConfig.class);
        if (config.getUseSAXDocxExtractor()) {
            poiExtractor = trySXWPF(pkg);
        if (poiExtractor == null && config.getUseSAXPptxExtractor()) {
            poiExtractor = trySXSLF(pkg);
        if (poiExtractor == null) {
            poiExtractor = ExtractorFactory.createExtractor(pkg);
        POIXMLDocument document = poiExtractor.getDocument();
        if (poiExtractor instanceof XSSFBEventBasedExcelExtractor) {
            extractor = new XSSFBExcelExtractorDecorator(context, poiExtractor, locale);
        } else if (poiExtractor instanceof XSSFEventBasedExcelExtractor) {
            extractor = new XSSFExcelExtractorDecorator(context, poiExtractor, locale);
        } else if (poiExtractor instanceof XWPFEventBasedWordExtractor) {
            extractor = new SXWPFWordExtractorDecorator(metadata, context, (XWPFEventBasedWordExtractor) poiExtractor);
            metadata.add("X-Parsed-By", XWPFEventBasedWordExtractor.class.getCanonicalName());
        } else if (poiExtractor instanceof XSLFEventBasedPowerPointExtractor) {
            extractor = new SXSLFPowerPointExtractorDecorator(metadata, context, (XSLFEventBasedPowerPointExtractor) poiExtractor);
            metadata.add("X-Parsed-By", XSLFEventBasedPowerPointExtractor.class.getCanonicalName());
        } else if (document == null) {
            throw new TikaException("Expecting UserModel based POI OOXML extractor with a document, but none found. " + "The extractor returned was a " + poiExtractor);
        } else if (document instanceof XMLSlideShow) {
            extractor = new XSLFPowerPointExtractorDecorator(context, (org.apache.poi.xslf.extractor.XSLFPowerPointExtractor) poiExtractor);
        } else if (document instanceof XWPFDocument) {
            extractor = new XWPFWordExtractorDecorator(context, (XWPFWordExtractor) poiExtractor);
        } else {
            extractor = new POIXMLTextExtractorDecorator(context, poiExtractor);
        // Get the bulk of the metadata first, so that it's accessible during
        //  parsing if desired by the client (see TIKA-1109)
        // Extract the text, along with any in-document metadata
        extractor.getXHTML(baseHandler, metadata, context);
    } catch (IllegalArgumentException e) {
        if (e.getMessage() != null && e.getMessage().startsWith("No supported documents found")) {
            throw new TikaException("TIKA-418: RuntimeException while getting content" + " for thmx and xps file types", e);
        } else {
            throw new TikaException("Error creating OOXML extractor", e);
    } catch (InvalidFormatException e) {
        throw new TikaException("Error creating OOXML extractor", e);
    } catch (OpenXML4JException e) {
        throw new TikaException("Error creating OOXML extractor", e);
    } catch (XmlException e) {
        throw new TikaException("Error creating OOXML extractor", e);
Also used : Locale(java.util.Locale) TikaInputStream( XWPFEventBasedWordExtractor( InvalidFormatException(org.apache.poi.openxml4j.exceptions.InvalidFormatException) OpenXML4JException(org.apache.poi.openxml4j.exceptions.OpenXML4JException) XSSFEventBasedExcelExtractor(org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor) OfficeParserConfig( MediaType(org.apache.tika.mime.MediaType) XWPFDocument(org.apache.poi.xwpf.usermodel.XWPFDocument) XSLFEventBasedPowerPointExtractor( TikaException(org.apache.tika.exception.TikaException) XSSFBEventBasedExcelExtractor(org.apache.poi.xssf.extractor.XSSFBEventBasedExcelExtractor) CloseShieldInputStream( TikaInputStream( InputStream( XWPFWordExtractor(org.apache.poi.xwpf.extractor.XWPFWordExtractor) POIXMLDocument(org.apache.poi.POIXMLDocument) POIXMLTextExtractor(org.apache.poi.POIXMLTextExtractor) XmlException(org.apache.xmlbeans.XmlException) XMLSlideShow(org.apache.poi.xslf.usermodel.XMLSlideShow) OPCPackage(org.apache.poi.openxml4j.opc.OPCPackage) CloseShieldInputStream(

Example 4 with XSSFBEventBasedExcelExtractor

use of org.apache.poi.xssf.extractor.XSSFBEventBasedExcelExtractor in project poi by apache.

the class TestSecureTempZip method protectedXLSBZip.

     * Now try with xlsb.
public void protectedXLSBZip() throws IOException, GeneralSecurityException, XmlException, OpenXML4JException {
    //The test file requires that JCE unlimited be installed.
    //If it isn't installed, skip this test.
    int maxKeyLen = Cipher.getMaxAllowedKeyLength("AES");
    Assume.assumeTrue("Please install JCE Unlimited Strength Jurisdiction Policy files for AES 256", maxKeyLen == 2147483647);
    File tikaProt = XSSFTestDataSamples.getSampleFile("protected_passtika.xlsb");
    FileInputStream fis = new FileInputStream(tikaProt);
    POIFSFileSystem poifs = new POIFSFileSystem(fis);
    EncryptionInfo ei = new EncryptionInfo(poifs);
    Decryptor dec = ei.getDecryptor();
    boolean passOk = dec.verifyPassword("tika");
    // extract encrypted ooxml file and write to custom encrypted zip file
    InputStream is = dec.getDataStream(poifs);
    // provide ZipEntrySource to poi which decrypts on the fly
    ZipEntrySource source = AesZipFileZipEntrySource.createZipEntrySource(is);
    // test the source
    OPCPackage opc =;
    String expected = "You can't see me";
    XSSFBEventBasedExcelExtractor extractor = new XSSFBEventBasedExcelExtractor(opc);
    String txt = extractor.getText();
    assertEquals(expected, txt.trim());
Also used : XSSFBEventBasedExcelExtractor(org.apache.poi.xssf.extractor.XSSFBEventBasedExcelExtractor) POIFSFileSystem(org.apache.poi.poifs.filesystem.POIFSFileSystem) FileInputStream( InputStream( File( ZipEntrySource(org.apache.poi.openxml4j.util.ZipEntrySource) AesZipFileZipEntrySource(org.apache.poi.poifs.crypt.temp.AesZipFileZipEntrySource) OPCPackage(org.apache.poi.openxml4j.opc.OPCPackage) FileInputStream( Test(org.junit.Test)


XSSFBEventBasedExcelExtractor (org.apache.poi.xssf.extractor.XSSFBEventBasedExcelExtractor)4 InputStream ( OpenXML4JException (org.apache.poi.openxml4j.exceptions.OpenXML4JException)2 OPCPackage (org.apache.poi.openxml4j.opc.OPCPackage)2 XSSFEventBasedExcelExtractor (org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor)2 XWPFWordExtractor (org.apache.poi.xwpf.extractor.XWPFWordExtractor)2 XmlException (org.apache.xmlbeans.XmlException)2 File ( FileInputStream ( IOException ( Locale (java.util.Locale)1 CloseShieldInputStream ( POIXMLDocument (org.apache.poi.POIXMLDocument)1 POIXMLTextExtractor (org.apache.poi.POIXMLTextExtractor)1 InvalidFormatException (org.apache.poi.openxml4j.exceptions.InvalidFormatException)1 PackagePart (org.apache.poi.openxml4j.opc.PackagePart)1 PackageRelationshipCollection (org.apache.poi.openxml4j.opc.PackageRelationshipCollection)1 ZipEntrySource (org.apache.poi.openxml4j.util.ZipEntrySource)1 AesZipFileZipEntrySource (org.apache.poi.poifs.crypt.temp.AesZipFileZipEntrySource)1 POIFSFileSystem (org.apache.poi.poifs.filesystem.POIFSFileSystem)1