Search in sources :

Example 61 with NPOIFSFileSystem

use of org.apache.poi.poifs.filesystem.NPOIFSFileSystem in project tika by apache.

the class TestContainerAwareDetector method testOpenContainer.

@Test
public void testOpenContainer() throws Exception {
    try (TikaInputStream stream = TikaInputStream.get(TestContainerAwareDetector.class.getResource("/test-documents/testPPT.ppt"))) {
        assertNull(stream.getOpenContainer());
        assertEquals(MediaType.parse("application/vnd.ms-powerpoint"), detector.detect(stream, new Metadata()));
        assertTrue(stream.getOpenContainer() instanceof NPOIFSFileSystem);
    }
}
Also used : NPOIFSFileSystem(org.apache.poi.poifs.filesystem.NPOIFSFileSystem) Metadata(org.apache.tika.metadata.Metadata) TikaInputStream(org.apache.tika.io.TikaInputStream) Test(org.junit.Test)

Example 62 with NPOIFSFileSystem

use of org.apache.poi.poifs.filesystem.NPOIFSFileSystem in project tika by apache.

the class OfficeParser method parse.

/**
     * Extracts properties and text from an MS Document input stream
     */
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    configure(context);
    XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
    xhtml.startDocument();
    final DirectoryNode root;
    TikaInputStream tstream = TikaInputStream.cast(stream);
    NPOIFSFileSystem mustCloseFs = null;
    try {
        if (tstream == null) {
            mustCloseFs = new NPOIFSFileSystem(new CloseShieldInputStream(stream));
            root = mustCloseFs.getRoot();
        } else {
            final Object container = tstream.getOpenContainer();
            if (container instanceof NPOIFSFileSystem) {
                root = ((NPOIFSFileSystem) container).getRoot();
            } else if (container instanceof DirectoryNode) {
                root = (DirectoryNode) container;
            } else {
                NPOIFSFileSystem fs = null;
                if (tstream.hasFile()) {
                    fs = new NPOIFSFileSystem(tstream.getFile(), true);
                } else {
                    fs = new NPOIFSFileSystem(new CloseShieldInputStream(tstream));
                }
                //tstream will close the fs, no need to close this below
                tstream.setOpenContainer(fs);
                root = fs.getRoot();
            }
        }
        parse(root, context, metadata, xhtml);
        OfficeParserConfig officeParserConfig = context.get(OfficeParserConfig.class);
        if (officeParserConfig.getExtractMacros()) {
            //now try to get macros
            extractMacros(root.getNFileSystem(), xhtml, EmbeddedDocumentUtil.getEmbeddedDocumentExtractor(context));
        }
    } finally {
        IOUtils.closeQuietly(mustCloseFs);
    }
    xhtml.endDocument();
}
Also used : NPOIFSFileSystem(org.apache.poi.poifs.filesystem.NPOIFSFileSystem) TikaInputStream(org.apache.tika.io.TikaInputStream) DirectoryNode(org.apache.poi.poifs.filesystem.DirectoryNode) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) CloseShieldInputStream(org.apache.commons.io.input.CloseShieldInputStream)

Example 63 with NPOIFSFileSystem

use of org.apache.poi.poifs.filesystem.NPOIFSFileSystem in project tika by apache.

the class POIFSContainerDetector method getTopLevelNames.

private static Set<String> getTopLevelNames(TikaInputStream stream) throws IOException {
    // Force the document stream to a (possibly temporary) file
    // so we don't modify the current position of the stream
    File file = stream.getFile();
    try {
        NPOIFSFileSystem fs = new NPOIFSFileSystem(file, true);
        // Optimize a possible later parsing process by keeping
        // a reference to the already opened POI file system
        stream.setOpenContainer(fs);
        return getTopLevelNames(fs.getRoot());
    } catch (IOException e) {
        // Parse error in POI, so we don't know the file type
        return Collections.emptySet();
    } catch (RuntimeException e) {
        // Another problem in POI
        return Collections.emptySet();
    }
}
Also used : NPOIFSFileSystem(org.apache.poi.poifs.filesystem.NPOIFSFileSystem) IOException(java.io.IOException) File(java.io.File)

Example 64 with NPOIFSFileSystem

use of org.apache.poi.poifs.filesystem.NPOIFSFileSystem in project tika by apache.

the class HSLFExtractor method handleSlideEmbeddedResources.

private void handleSlideEmbeddedResources(HSLFSlide slide, XHTMLContentHandler xhtml) throws TikaException, SAXException, IOException {
    List<HSLFShape> shapes;
    try {
        shapes = slide.getShapes();
    } catch (NullPointerException e) {
        // Sometimes HSLF hits problems
        // Please open POI bugs for any you come across!
        EmbeddedDocumentUtil.recordEmbeddedStreamException(e, parentMetadata);
        return;
    }
    for (HSLFShape shape : shapes) {
        if (shape instanceof OLEShape) {
            OLEShape oleShape = (OLEShape) shape;
            HSLFObjectData data = null;
            try {
                data = oleShape.getObjectData();
            } catch (NullPointerException e) {
                /* getObjectData throws NPE some times. */
                EmbeddedDocumentUtil.recordEmbeddedStreamException(e, parentMetadata);
                continue;
            }
            if (data != null) {
                String objID = Integer.toString(oleShape.getObjectID());
                // Embedded Object: add a <div
                // class="embedded" id="X"/> so consumer can see where
                // in the main text each embedded document
                // occurred:
                AttributesImpl attributes = new AttributesImpl();
                attributes.addAttribute("", "class", "class", "CDATA", "embedded");
                attributes.addAttribute("", "id", "id", "CDATA", objID);
                xhtml.startElement("div", attributes);
                xhtml.endElement("div");
                InputStream dataStream = null;
                try {
                    dataStream = data.getData();
                } catch (Exception e) {
                    EmbeddedDocumentUtil.recordEmbeddedStreamException(e, parentMetadata);
                    continue;
                }
                try (TikaInputStream stream = TikaInputStream.get(dataStream)) {
                    String mediaType = null;
                    if ("Excel.Chart.8".equals(oleShape.getProgID())) {
                        mediaType = "application/vnd.ms-excel";
                    } else {
                        MediaType mt = getTikaConfig().getDetector().detect(stream, new Metadata());
                        mediaType = mt.toString();
                    }
                    if (mediaType.equals("application/x-tika-msoffice-embedded; format=comp_obj")) {
                        try (NPOIFSFileSystem npoifs = new NPOIFSFileSystem(new CloseShieldInputStream(stream))) {
                            handleEmbeddedOfficeDoc(npoifs.getRoot(), objID, xhtml);
                        }
                    } else {
                        handleEmbeddedResource(stream, objID, objID, mediaType, xhtml, false);
                    }
                } catch (IOException e) {
                    EmbeddedDocumentUtil.recordEmbeddedStreamException(e, parentMetadata);
                }
            }
        }
    }
}
Also used : TikaInputStream(org.apache.tika.io.TikaInputStream) CloseShieldInputStream(org.apache.tika.io.CloseShieldInputStream) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) TikaInputStream(org.apache.tika.io.TikaInputStream) IOException(java.io.IOException) HSLFObjectData(org.apache.poi.hslf.usermodel.HSLFObjectData) OLEShape(org.apache.poi.hslf.model.OLEShape) TikaException(org.apache.tika.exception.TikaException) IOException(java.io.IOException) SAXException(org.xml.sax.SAXException) NPOIFSFileSystem(org.apache.poi.poifs.filesystem.NPOIFSFileSystem) HSLFShape(org.apache.poi.hslf.usermodel.HSLFShape) AttributesImpl(org.xml.sax.helpers.AttributesImpl) MediaType(org.apache.tika.mime.MediaType) CloseShieldInputStream(org.apache.tika.io.CloseShieldInputStream)

Example 65 with NPOIFSFileSystem

use of org.apache.poi.poifs.filesystem.NPOIFSFileSystem in project tika by apache.

the class JackcessExtractor method handleCompoundContent.

private void handleCompoundContent(OleBlob.CompoundContent cc, XHTMLContentHandler xhtml) throws IOException, SAXException, TikaException {
    InputStream is = null;
    NPOIFSFileSystem nfs = null;
    try {
        try {
            is = cc.getStream();
        } catch (IOException e) {
            EmbeddedDocumentUtil.recordEmbeddedStreamException(e, parentMetadata);
            return;
        }
        try {
            nfs = new NPOIFSFileSystem(is);
        } catch (Exception e) {
            EmbeddedDocumentUtil.recordEmbeddedStreamException(e, parentMetadata);
            return;
        }
        handleEmbeddedOfficeDoc(nfs.getRoot(), xhtml);
    } finally {
        if (nfs != null) {
            try {
                nfs.close();
            } catch (IOException e) {
            //swallow
            }
        }
        if (is != null) {
            IOUtils.closeQuietly(is);
        }
    }
}
Also used : NPOIFSFileSystem(org.apache.poi.poifs.filesystem.NPOIFSFileSystem) ByteArrayInputStream(java.io.ByteArrayInputStream) TikaInputStream(org.apache.tika.io.TikaInputStream) InputStream(java.io.InputStream) IOException(java.io.IOException) TikaException(org.apache.tika.exception.TikaException) IOException(java.io.IOException) SAXException(org.xml.sax.SAXException)

Aggregations

NPOIFSFileSystem (org.apache.poi.poifs.filesystem.NPOIFSFileSystem)101 Test (org.junit.Test)57 File (java.io.File)35 InputStream (java.io.InputStream)26 ByteArrayInputStream (java.io.ByteArrayInputStream)19 ByteArrayOutputStream (java.io.ByteArrayOutputStream)14 MAPIMessage (org.apache.poi.hsmf.MAPIMessage)14 FileOutputStream (java.io.FileOutputStream)12 TempFile (org.apache.poi.util.TempFile)12 FileInputStream (java.io.FileInputStream)11 OPOIFSFileSystem (org.apache.poi.poifs.filesystem.OPOIFSFileSystem)10 POIFSFileSystem (org.apache.poi.poifs.filesystem.POIFSFileSystem)10 DocumentSummaryInformation (org.apache.poi.hpsf.DocumentSummaryInformation)9 DirectoryNode (org.apache.poi.poifs.filesystem.DirectoryNode)9 IOException (java.io.IOException)8 OutputStream (java.io.OutputStream)8 SummaryInformation (org.apache.poi.hpsf.SummaryInformation)7 TikaInputStream (org.apache.tika.io.TikaInputStream)6 AgileDecryptor (org.apache.poi.poifs.crypt.agile.AgileDecryptor)5 DirectoryEntry (org.apache.poi.poifs.filesystem.DirectoryEntry)5