Search in sources :

Example 96 with NPOIFSFileSystem

use of org.apache.poi.poifs.filesystem.NPOIFSFileSystem in project poi by apache.

the class TestExtractor method testDifferentPOIFS.

/**
     * Tests that we can work with both {@link POIFSFileSystem}
     * and {@link NPOIFSFileSystem}
     */
@SuppressWarnings("resource")
@Test
public void testDifferentPOIFS() throws IOException {
    // Open the two filesystems
    File pptFile = slTests.getFile("basic_test_ppt_file.ppt");
    InputStream is1 = new FileInputStream(pptFile);
    OPOIFSFileSystem opoifs = new OPOIFSFileSystem(is1);
    is1.close();
    NPOIFSFileSystem npoifs = new NPOIFSFileSystem(pptFile);
    DirectoryNode[] files = { opoifs.getRoot(), npoifs.getRoot() };
    // Open directly
    for (DirectoryNode dir : files) {
        PowerPointExtractor extractor = new PowerPointExtractor(dir);
        assertEquals(expectText, extractor.getText());
    }
    // Open via a HSLFSlideShow
    for (DirectoryNode dir : files) {
        HSLFSlideShowImpl slideshow = new HSLFSlideShowImpl(dir);
        PowerPointExtractor extractor = new PowerPointExtractor(slideshow);
        assertEquals(expectText, extractor.getText());
        extractor.close();
        slideshow.close();
    }
    npoifs.close();
}
Also used : NPOIFSFileSystem(org.apache.poi.poifs.filesystem.NPOIFSFileSystem) FileInputStream(java.io.FileInputStream) InputStream(java.io.InputStream) DirectoryNode(org.apache.poi.poifs.filesystem.DirectoryNode) OPOIFSFileSystem(org.apache.poi.poifs.filesystem.OPOIFSFileSystem) File(java.io.File) FileInputStream(java.io.FileInputStream) HSLFSlideShowImpl(org.apache.poi.hslf.usermodel.HSLFSlideShowImpl) Test(org.junit.Test)

Example 97 with NPOIFSFileSystem

use of org.apache.poi.poifs.filesystem.NPOIFSFileSystem in project tika by apache.

the class TestContainerAwareDetector method testOpenContainer.

@Test
public void testOpenContainer() throws Exception {
    try (TikaInputStream stream = TikaInputStream.get(TestContainerAwareDetector.class.getResource("/test-documents/testPPT.ppt"))) {
        assertNull(stream.getOpenContainer());
        assertEquals(MediaType.parse("application/vnd.ms-powerpoint"), detector.detect(stream, new Metadata()));
        assertTrue(stream.getOpenContainer() instanceof NPOIFSFileSystem);
    }
}
Also used : NPOIFSFileSystem(org.apache.poi.poifs.filesystem.NPOIFSFileSystem) Metadata(org.apache.tika.metadata.Metadata) TikaInputStream(org.apache.tika.io.TikaInputStream) Test(org.junit.Test)

Example 98 with NPOIFSFileSystem

use of org.apache.poi.poifs.filesystem.NPOIFSFileSystem in project tika by apache.

the class OfficeParser method parse.

/**
     * Extracts properties and text from an MS Document input stream
     */
public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    configure(context);
    XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
    xhtml.startDocument();
    final DirectoryNode root;
    TikaInputStream tstream = TikaInputStream.cast(stream);
    NPOIFSFileSystem mustCloseFs = null;
    try {
        if (tstream == null) {
            mustCloseFs = new NPOIFSFileSystem(new CloseShieldInputStream(stream));
            root = mustCloseFs.getRoot();
        } else {
            final Object container = tstream.getOpenContainer();
            if (container instanceof NPOIFSFileSystem) {
                root = ((NPOIFSFileSystem) container).getRoot();
            } else if (container instanceof DirectoryNode) {
                root = (DirectoryNode) container;
            } else {
                NPOIFSFileSystem fs = null;
                if (tstream.hasFile()) {
                    fs = new NPOIFSFileSystem(tstream.getFile(), true);
                } else {
                    fs = new NPOIFSFileSystem(new CloseShieldInputStream(tstream));
                }
                //tstream will close the fs, no need to close this below
                tstream.setOpenContainer(fs);
                root = fs.getRoot();
            }
        }
        parse(root, context, metadata, xhtml);
        OfficeParserConfig officeParserConfig = context.get(OfficeParserConfig.class);
        if (officeParserConfig.getExtractMacros()) {
            //now try to get macros
            extractMacros(root.getNFileSystem(), xhtml, EmbeddedDocumentUtil.getEmbeddedDocumentExtractor(context));
        }
    } finally {
        IOUtils.closeQuietly(mustCloseFs);
    }
    xhtml.endDocument();
}
Also used : NPOIFSFileSystem(org.apache.poi.poifs.filesystem.NPOIFSFileSystem) TikaInputStream(org.apache.tika.io.TikaInputStream) DirectoryNode(org.apache.poi.poifs.filesystem.DirectoryNode) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) CloseShieldInputStream(org.apache.commons.io.input.CloseShieldInputStream)

Example 99 with NPOIFSFileSystem

use of org.apache.poi.poifs.filesystem.NPOIFSFileSystem in project tika by apache.

the class POIFSContainerDetector method getTopLevelNames.

private static Set<String> getTopLevelNames(TikaInputStream stream) throws IOException {
    // Force the document stream to a (possibly temporary) file
    // so we don't modify the current position of the stream
    File file = stream.getFile();
    try {
        NPOIFSFileSystem fs = new NPOIFSFileSystem(file, true);
        // Optimize a possible later parsing process by keeping
        // a reference to the already opened POI file system
        stream.setOpenContainer(fs);
        return getTopLevelNames(fs.getRoot());
    } catch (IOException e) {
        // Parse error in POI, so we don't know the file type
        return Collections.emptySet();
    } catch (RuntimeException e) {
        // Another problem in POI
        return Collections.emptySet();
    }
}
Also used : NPOIFSFileSystem(org.apache.poi.poifs.filesystem.NPOIFSFileSystem) IOException(java.io.IOException) File(java.io.File)

Example 100 with NPOIFSFileSystem

use of org.apache.poi.poifs.filesystem.NPOIFSFileSystem in project tika by apache.

the class HSLFExtractor method handleSlideEmbeddedResources.

private void handleSlideEmbeddedResources(HSLFSlide slide, XHTMLContentHandler xhtml) throws TikaException, SAXException, IOException {
    List<HSLFShape> shapes;
    try {
        shapes = slide.getShapes();
    } catch (NullPointerException e) {
        // Sometimes HSLF hits problems
        // Please open POI bugs for any you come across!
        EmbeddedDocumentUtil.recordEmbeddedStreamException(e, parentMetadata);
        return;
    }
    for (HSLFShape shape : shapes) {
        if (shape instanceof OLEShape) {
            OLEShape oleShape = (OLEShape) shape;
            HSLFObjectData data = null;
            try {
                data = oleShape.getObjectData();
            } catch (NullPointerException e) {
                /* getObjectData throws NPE some times. */
                EmbeddedDocumentUtil.recordEmbeddedStreamException(e, parentMetadata);
                continue;
            }
            if (data != null) {
                String objID = Integer.toString(oleShape.getObjectID());
                // Embedded Object: add a <div
                // class="embedded" id="X"/> so consumer can see where
                // in the main text each embedded document
                // occurred:
                AttributesImpl attributes = new AttributesImpl();
                attributes.addAttribute("", "class", "class", "CDATA", "embedded");
                attributes.addAttribute("", "id", "id", "CDATA", objID);
                xhtml.startElement("div", attributes);
                xhtml.endElement("div");
                InputStream dataStream = null;
                try {
                    dataStream = data.getData();
                } catch (Exception e) {
                    EmbeddedDocumentUtil.recordEmbeddedStreamException(e, parentMetadata);
                    continue;
                }
                try (TikaInputStream stream = TikaInputStream.get(dataStream)) {
                    String mediaType = null;
                    if ("Excel.Chart.8".equals(oleShape.getProgID())) {
                        mediaType = "application/vnd.ms-excel";
                    } else {
                        MediaType mt = getTikaConfig().getDetector().detect(stream, new Metadata());
                        mediaType = mt.toString();
                    }
                    if (mediaType.equals("application/x-tika-msoffice-embedded; format=comp_obj")) {
                        try (NPOIFSFileSystem npoifs = new NPOIFSFileSystem(new CloseShieldInputStream(stream))) {
                            handleEmbeddedOfficeDoc(npoifs.getRoot(), objID, xhtml);
                        }
                    } else {
                        handleEmbeddedResource(stream, objID, objID, mediaType, xhtml, false);
                    }
                } catch (IOException e) {
                    EmbeddedDocumentUtil.recordEmbeddedStreamException(e, parentMetadata);
                }
            }
        }
    }
}
Also used : TikaInputStream(org.apache.tika.io.TikaInputStream) CloseShieldInputStream(org.apache.tika.io.CloseShieldInputStream) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) TikaInputStream(org.apache.tika.io.TikaInputStream) IOException(java.io.IOException) HSLFObjectData(org.apache.poi.hslf.usermodel.HSLFObjectData) OLEShape(org.apache.poi.hslf.model.OLEShape) TikaException(org.apache.tika.exception.TikaException) IOException(java.io.IOException) SAXException(org.xml.sax.SAXException) NPOIFSFileSystem(org.apache.poi.poifs.filesystem.NPOIFSFileSystem) HSLFShape(org.apache.poi.hslf.usermodel.HSLFShape) AttributesImpl(org.xml.sax.helpers.AttributesImpl) MediaType(org.apache.tika.mime.MediaType) CloseShieldInputStream(org.apache.tika.io.CloseShieldInputStream)

Aggregations

NPOIFSFileSystem (org.apache.poi.poifs.filesystem.NPOIFSFileSystem)101 Test (org.junit.Test)57 File (java.io.File)35 InputStream (java.io.InputStream)26 ByteArrayInputStream (java.io.ByteArrayInputStream)19 ByteArrayOutputStream (java.io.ByteArrayOutputStream)14 MAPIMessage (org.apache.poi.hsmf.MAPIMessage)14 FileOutputStream (java.io.FileOutputStream)12 TempFile (org.apache.poi.util.TempFile)12 FileInputStream (java.io.FileInputStream)11 OPOIFSFileSystem (org.apache.poi.poifs.filesystem.OPOIFSFileSystem)10 POIFSFileSystem (org.apache.poi.poifs.filesystem.POIFSFileSystem)10 DocumentSummaryInformation (org.apache.poi.hpsf.DocumentSummaryInformation)9 DirectoryNode (org.apache.poi.poifs.filesystem.DirectoryNode)9 IOException (java.io.IOException)8 OutputStream (java.io.OutputStream)8 SummaryInformation (org.apache.poi.hpsf.SummaryInformation)7 TikaInputStream (org.apache.tika.io.TikaInputStream)6 AgileDecryptor (org.apache.poi.poifs.crypt.agile.AgileDecryptor)5 DirectoryEntry (org.apache.poi.poifs.filesystem.DirectoryEntry)5