Search in sources :

Example 1 with OutlookTextExtactor

use of org.apache.poi.hsmf.extractor.OutlookTextExtactor in project poi by apache.

the class ExtractorFactory method getEmbededDocsTextExtractors.

/**
     * Returns an array of text extractors, one for each of
     *  the embedded documents in the file (if there are any).
     * If there are no embedded documents, you'll get back an
     *  empty array. Otherwise, you'll get one open
     *  {@link POITextExtractor} for each embedded file.
     */
public static POITextExtractor[] getEmbededDocsTextExtractors(POIOLE2TextExtractor ext) throws IOException, OpenXML4JException, XmlException {
    // All the embedded directories we spotted
    ArrayList<Entry> dirs = new ArrayList<Entry>();
    // For anything else not directly held in as a POIFS directory
    ArrayList<InputStream> nonPOIFS = new ArrayList<InputStream>();
    // Find all the embedded directories
    DirectoryEntry root = ext.getRoot();
    if (root == null) {
        throw new IllegalStateException("The extractor didn't know which POIFS it came from!");
    }
    if (ext instanceof ExcelExtractor) {
        // These are in MBD... under the root
        Iterator<Entry> it = root.getEntries();
        while (it.hasNext()) {
            Entry entry = it.next();
            if (entry.getName().startsWith("MBD")) {
                dirs.add(entry);
            }
        }
    } else if (ext instanceof WordExtractor) {
        // These are in ObjectPool -> _... under the root
        try {
            DirectoryEntry op = (DirectoryEntry) root.getEntry("ObjectPool");
            Iterator<Entry> it = op.getEntries();
            while (it.hasNext()) {
                Entry entry = it.next();
                if (entry.getName().startsWith("_")) {
                    dirs.add(entry);
                }
            }
        } catch (FileNotFoundException e) {
            logger.log(POILogger.INFO, "Ignoring FileNotFoundException while extracting Word document", e.getLocalizedMessage());
        // ignored here
        }
    //} else if(ext instanceof PowerPointExtractor) {
    // Tricky, not stored directly in poifs
    // TODO
    } else if (ext instanceof OutlookTextExtactor) {
        // Stored in the Attachment blocks
        MAPIMessage msg = ((OutlookTextExtactor) ext).getMAPIMessage();
        for (AttachmentChunks attachment : msg.getAttachmentFiles()) {
            if (attachment.getAttachData() != null) {
                byte[] data = attachment.getAttachData().getValue();
                nonPOIFS.add(new ByteArrayInputStream(data));
            } else if (attachment.getAttachmentDirectory() != null) {
                dirs.add(attachment.getAttachmentDirectory().getDirectory());
            }
        }
    }
    // Create the extractors
    if (dirs.size() == 0 && nonPOIFS.size() == 0) {
        return new POITextExtractor[0];
    }
    ArrayList<POITextExtractor> textExtractors = new ArrayList<POITextExtractor>();
    for (Entry dir : dirs) {
        textExtractors.add(createExtractor((DirectoryNode) dir));
    }
    for (InputStream nonPOIF : nonPOIFS) {
        try {
            textExtractors.add(createExtractor(nonPOIF));
        } catch (IllegalArgumentException e) {
            // Ignore, just means it didn't contain
            //  a format we support as yet
            logger.log(POILogger.INFO, "Format not supported yet", e.getLocalizedMessage());
        } catch (XmlException e) {
            throw new IOException(e.getMessage(), e);
        } catch (OpenXML4JException e) {
            throw new IOException(e.getMessage(), e);
        }
    }
    return textExtractors.toArray(new POITextExtractor[textExtractors.size()]);
}
Also used : PushbackInputStream(java.io.PushbackInputStream) ByteArrayInputStream(java.io.ByteArrayInputStream) InputStream(java.io.InputStream) ArrayList(java.util.ArrayList) FileNotFoundException(java.io.FileNotFoundException) DirectoryNode(org.apache.poi.poifs.filesystem.DirectoryNode) IOException(java.io.IOException) DirectoryEntry(org.apache.poi.poifs.filesystem.DirectoryEntry) WordExtractor(org.apache.poi.hwpf.extractor.WordExtractor) XWPFWordExtractor(org.apache.poi.xwpf.extractor.XWPFWordExtractor) MAPIMessage(org.apache.poi.hsmf.MAPIMessage) Entry(org.apache.poi.poifs.filesystem.Entry) DirectoryEntry(org.apache.poi.poifs.filesystem.DirectoryEntry) OutlookTextExtactor(org.apache.poi.hsmf.extractor.OutlookTextExtactor) OpenXML4JException(org.apache.poi.openxml4j.exceptions.OpenXML4JException) ByteArrayInputStream(java.io.ByteArrayInputStream) POITextExtractor(org.apache.poi.POITextExtractor) XSSFExcelExtractor(org.apache.poi.xssf.extractor.XSSFExcelExtractor) ExcelExtractor(org.apache.poi.hssf.extractor.ExcelExtractor) XSSFEventBasedExcelExtractor(org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor) XSSFBEventBasedExcelExtractor(org.apache.poi.xssf.extractor.XSSFBEventBasedExcelExtractor) XmlException(org.apache.xmlbeans.XmlException) Iterator(java.util.Iterator) AttachmentChunks(org.apache.poi.hsmf.datatypes.AttachmentChunks)

Example 2 with OutlookTextExtactor

use of org.apache.poi.hsmf.extractor.OutlookTextExtactor in project poi by apache.

the class TestExtractorFactory method testOPOIFS.

@Test
public void testOPOIFS() throws Exception {
    // Excel
    assertTrue(ExtractorFactory.createExtractor(new OPOIFSFileSystem(new FileInputStream(xls))) instanceof ExcelExtractor);
    assertTrue(ExtractorFactory.createExtractor(new OPOIFSFileSystem(new FileInputStream(xls))).getText().length() > 200);
    // Word
    assertTrue(ExtractorFactory.createExtractor(new OPOIFSFileSystem(new FileInputStream(doc))) instanceof WordExtractor);
    assertTrue(ExtractorFactory.createExtractor(new OPOIFSFileSystem(new FileInputStream(doc))).getText().length() > 120);
    assertTrue(ExtractorFactory.createExtractor(new OPOIFSFileSystem(new FileInputStream(doc6))) instanceof Word6Extractor);
    assertTrue(ExtractorFactory.createExtractor(new OPOIFSFileSystem(new FileInputStream(doc6))).getText().length() > 20);
    assertTrue(ExtractorFactory.createExtractor(new OPOIFSFileSystem(new FileInputStream(doc95))) instanceof Word6Extractor);
    assertTrue(ExtractorFactory.createExtractor(new OPOIFSFileSystem(new FileInputStream(doc95))).getText().length() > 120);
    // PowerPoint
    assertTrue(ExtractorFactory.createExtractor(new OPOIFSFileSystem(new FileInputStream(ppt))) instanceof PowerPointExtractor);
    assertTrue(ExtractorFactory.createExtractor(new OPOIFSFileSystem(new FileInputStream(ppt))).getText().length() > 120);
    // Visio
    assertTrue(ExtractorFactory.createExtractor(new OPOIFSFileSystem(new FileInputStream(vsd))) instanceof VisioTextExtractor);
    assertTrue(ExtractorFactory.createExtractor(new OPOIFSFileSystem(new FileInputStream(vsd))).getText().length() > 50);
    // Publisher
    assertTrue(ExtractorFactory.createExtractor(new OPOIFSFileSystem(new FileInputStream(pub))) instanceof PublisherTextExtractor);
    assertTrue(ExtractorFactory.createExtractor(new OPOIFSFileSystem(new FileInputStream(pub))).getText().length() > 50);
    // Outlook msg
    assertTrue(ExtractorFactory.createExtractor(new OPOIFSFileSystem(new FileInputStream(msg))) instanceof OutlookTextExtactor);
    assertTrue(ExtractorFactory.createExtractor(new OPOIFSFileSystem(new FileInputStream(msg))).getText().length() > 50);
    // Text
    try {
        ExtractorFactory.createExtractor(new OPOIFSFileSystem(new FileInputStream(txt)));
        fail();
    } catch (IOException e) {
    // Good
    }
}
Also used : OutlookTextExtactor(org.apache.poi.hsmf.extractor.OutlookTextExtactor) XSSFExcelExtractor(org.apache.poi.xssf.extractor.XSSFExcelExtractor) ExcelExtractor(org.apache.poi.hssf.extractor.ExcelExtractor) XSSFEventBasedExcelExtractor(org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor) EventBasedExcelExtractor(org.apache.poi.hssf.extractor.EventBasedExcelExtractor) Word6Extractor(org.apache.poi.hwpf.extractor.Word6Extractor) PowerPointExtractor(org.apache.poi.hslf.extractor.PowerPointExtractor) XSLFPowerPointExtractor(org.apache.poi.xslf.extractor.XSLFPowerPointExtractor) PublisherTextExtractor(org.apache.poi.hpbf.extractor.PublisherTextExtractor) OPOIFSFileSystem(org.apache.poi.poifs.filesystem.OPOIFSFileSystem) IOException(java.io.IOException) VisioTextExtractor(org.apache.poi.hdgf.extractor.VisioTextExtractor) FileInputStream(java.io.FileInputStream) WordExtractor(org.apache.poi.hwpf.extractor.WordExtractor) XWPFWordExtractor(org.apache.poi.xwpf.extractor.XWPFWordExtractor) Test(org.junit.Test)

Example 3 with OutlookTextExtactor

use of org.apache.poi.hsmf.extractor.OutlookTextExtactor in project poi by apache.

the class TestExtractorFactory method testInputStream.

@Test
public void testInputStream() throws Exception {
    // Excel
    POITextExtractor extractor = ExtractorFactory.createExtractor(new FileInputStream(xls));
    assertTrue(extractor instanceof ExcelExtractor);
    assertTrue(extractor.getText().length() > 200);
    extractor.close();
    extractor = ExtractorFactory.createExtractor(new FileInputStream(xlsx));
    assertTrue(extractor.getClass().getName(), extractor instanceof XSSFExcelExtractor);
    assertTrue(extractor.getText().length() > 200);
    // TODO Support OOXML-Strict, see bug #57699
    //        assertTrue(
    //                ExtractorFactory.createExtractor(new FileInputStream(xlsxStrict))
    //                instanceof XSSFExcelExtractor
    //        );
    //        assertTrue(
    //                ExtractorFactory.createExtractor(new FileInputStream(xlsxStrict)).getText().length() > 200
    //        );
    extractor.close();
    // Word
    extractor = ExtractorFactory.createExtractor(new FileInputStream(doc));
    assertTrue(extractor.getClass().getName(), extractor instanceof WordExtractor);
    assertTrue(extractor.getText().length() > 120);
    extractor.close();
    extractor = ExtractorFactory.createExtractor(new FileInputStream(doc6));
    assertTrue(extractor.getClass().getName(), extractor instanceof Word6Extractor);
    assertTrue(extractor.getText().length() > 20);
    extractor.close();
    extractor = ExtractorFactory.createExtractor(new FileInputStream(doc95));
    assertTrue(extractor.getClass().getName(), extractor instanceof Word6Extractor);
    assertTrue(extractor.getText().length() > 120);
    extractor.close();
    extractor = ExtractorFactory.createExtractor(new FileInputStream(docx));
    assertTrue(extractor instanceof XWPFWordExtractor);
    assertTrue(extractor.getText().length() > 120);
    extractor.close();
    // PowerPoint
    extractor = ExtractorFactory.createExtractor(new FileInputStream(ppt));
    assertTrue(extractor instanceof PowerPointExtractor);
    assertTrue(extractor.getText().length() > 120);
    extractor.close();
    extractor = ExtractorFactory.createExtractor(new FileInputStream(pptx));
    assertTrue(extractor instanceof XSLFPowerPointExtractor);
    assertTrue(extractor.getText().length() > 120);
    extractor.close();
    // Visio
    extractor = ExtractorFactory.createExtractor(new FileInputStream(vsd));
    assertTrue(extractor instanceof VisioTextExtractor);
    assertTrue(extractor.getText().length() > 50);
    extractor.close();
    // Visio - vsdx
    extractor = ExtractorFactory.createExtractor(new FileInputStream(vsdx));
    assertTrue(extractor instanceof XDGFVisioExtractor);
    assertTrue(extractor.getText().length() > 20);
    extractor.close();
    // Publisher
    extractor = ExtractorFactory.createExtractor(new FileInputStream(pub));
    assertTrue(extractor instanceof PublisherTextExtractor);
    assertTrue(extractor.getText().length() > 50);
    extractor.close();
    // Outlook msg
    extractor = ExtractorFactory.createExtractor(new FileInputStream(msg));
    assertTrue(extractor instanceof OutlookTextExtactor);
    assertTrue(extractor.getText().length() > 50);
    extractor.close();
    // Text
    try {
        FileInputStream stream = new FileInputStream(txt);
        try {
            ExtractorFactory.createExtractor(stream);
            fail();
        } finally {
            IOUtils.closeQuietly(stream);
        }
    } catch (IllegalArgumentException e) {
    // Good
    }
}
Also used : XDGFVisioExtractor(org.apache.poi.xdgf.extractor.XDGFVisioExtractor) XSSFExcelExtractor(org.apache.poi.xssf.extractor.XSSFExcelExtractor) Word6Extractor(org.apache.poi.hwpf.extractor.Word6Extractor) PowerPointExtractor(org.apache.poi.hslf.extractor.PowerPointExtractor) XSLFPowerPointExtractor(org.apache.poi.xslf.extractor.XSLFPowerPointExtractor) XWPFWordExtractor(org.apache.poi.xwpf.extractor.XWPFWordExtractor) PublisherTextExtractor(org.apache.poi.hpbf.extractor.PublisherTextExtractor) FileInputStream(java.io.FileInputStream) WordExtractor(org.apache.poi.hwpf.extractor.WordExtractor) XWPFWordExtractor(org.apache.poi.xwpf.extractor.XWPFWordExtractor) OutlookTextExtactor(org.apache.poi.hsmf.extractor.OutlookTextExtactor) XSLFPowerPointExtractor(org.apache.poi.xslf.extractor.XSLFPowerPointExtractor) POITextExtractor(org.apache.poi.POITextExtractor) XSSFExcelExtractor(org.apache.poi.xssf.extractor.XSSFExcelExtractor) ExcelExtractor(org.apache.poi.hssf.extractor.ExcelExtractor) XSSFEventBasedExcelExtractor(org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor) EventBasedExcelExtractor(org.apache.poi.hssf.extractor.EventBasedExcelExtractor) VisioTextExtractor(org.apache.poi.hdgf.extractor.VisioTextExtractor) Test(org.junit.Test)

Example 4 with OutlookTextExtactor

use of org.apache.poi.hsmf.extractor.OutlookTextExtactor in project poi by apache.

the class TestExtractorFactory method testEmbeded.

/**
     * Test embeded docs text extraction. For now, only
     *  does poifs embeded, but will do ooxml ones 
     *  at some point.
     */
@Test
public void testEmbeded() throws Exception {
    POIOLE2TextExtractor ext;
    POITextExtractor[] embeds;
    // No embedings
    ext = (POIOLE2TextExtractor) ExtractorFactory.createExtractor(xls);
    embeds = ExtractorFactory.getEmbededDocsTextExtractors(ext);
    assertEquals(0, embeds.length);
    ext.close();
    // Excel
    ext = (POIOLE2TextExtractor) ExtractorFactory.createExtractor(xlsEmb);
    embeds = ExtractorFactory.getEmbededDocsTextExtractors(ext);
    assertEquals(6, embeds.length);
    int numWord = 0, numXls = 0, numPpt = 0, numMsg = 0, numWordX;
    for (POITextExtractor embed : embeds) {
        assertTrue(embed.getText().length() > 20);
        if (embed instanceof PowerPointExtractor)
            numPpt++;
        else if (embed instanceof ExcelExtractor)
            numXls++;
        else if (embed instanceof WordExtractor)
            numWord++;
        else if (embed instanceof OutlookTextExtactor)
            numMsg++;
    }
    assertEquals(2, numPpt);
    assertEquals(2, numXls);
    assertEquals(2, numWord);
    assertEquals(0, numMsg);
    ext.close();
    // Word
    ext = (POIOLE2TextExtractor) ExtractorFactory.createExtractor(docEmb);
    embeds = ExtractorFactory.getEmbededDocsTextExtractors(ext);
    numWord = 0;
    numXls = 0;
    numPpt = 0;
    numMsg = 0;
    assertEquals(4, embeds.length);
    for (POITextExtractor embed : embeds) {
        assertTrue(embed.getText().length() > 20);
        if (embed instanceof PowerPointExtractor)
            numPpt++;
        else if (embed instanceof ExcelExtractor)
            numXls++;
        else if (embed instanceof WordExtractor)
            numWord++;
        else if (embed instanceof OutlookTextExtactor)
            numMsg++;
    }
    assertEquals(1, numPpt);
    assertEquals(2, numXls);
    assertEquals(1, numWord);
    assertEquals(0, numMsg);
    ext.close();
    // Word which contains an OOXML file
    ext = (POIOLE2TextExtractor) ExtractorFactory.createExtractor(docEmbOOXML);
    embeds = ExtractorFactory.getEmbededDocsTextExtractors(ext);
    numWord = 0;
    numXls = 0;
    numPpt = 0;
    numMsg = 0;
    numWordX = 0;
    assertEquals(3, embeds.length);
    for (POITextExtractor embed : embeds) {
        assertTrue(embed.getText().length() > 20);
        if (embed instanceof PowerPointExtractor)
            numPpt++;
        else if (embed instanceof ExcelExtractor)
            numXls++;
        else if (embed instanceof WordExtractor)
            numWord++;
        else if (embed instanceof OutlookTextExtactor)
            numMsg++;
        else if (embed instanceof XWPFWordExtractor)
            numWordX++;
    }
    assertEquals(1, numPpt);
    assertEquals(1, numXls);
    assertEquals(0, numWord);
    assertEquals(1, numWordX);
    assertEquals(0, numMsg);
    ext.close();
    // Outlook
    ext = (OutlookTextExtactor) ExtractorFactory.createExtractor(msgEmb);
    embeds = ExtractorFactory.getEmbededDocsTextExtractors(ext);
    numWord = 0;
    numXls = 0;
    numPpt = 0;
    numMsg = 0;
    assertEquals(1, embeds.length);
    for (POITextExtractor embed : embeds) {
        assertTrue(embed.getText().length() > 20);
        if (embed instanceof PowerPointExtractor)
            numPpt++;
        else if (embed instanceof ExcelExtractor)
            numXls++;
        else if (embed instanceof WordExtractor)
            numWord++;
        else if (embed instanceof OutlookTextExtactor)
            numMsg++;
    }
    assertEquals(0, numPpt);
    assertEquals(0, numXls);
    assertEquals(1, numWord);
    assertEquals(0, numMsg);
    ext.close();
    // Outlook with another outlook file in it
    ext = (OutlookTextExtactor) ExtractorFactory.createExtractor(msgEmbMsg);
    embeds = ExtractorFactory.getEmbededDocsTextExtractors(ext);
    numWord = 0;
    numXls = 0;
    numPpt = 0;
    numMsg = 0;
    assertEquals(1, embeds.length);
    for (POITextExtractor embed : embeds) {
        assertTrue(embed.getText().length() > 20);
        if (embed instanceof PowerPointExtractor)
            numPpt++;
        else if (embed instanceof ExcelExtractor)
            numXls++;
        else if (embed instanceof WordExtractor)
            numWord++;
        else if (embed instanceof OutlookTextExtactor)
            numMsg++;
    }
    assertEquals(0, numPpt);
    assertEquals(0, numXls);
    assertEquals(0, numWord);
    assertEquals(1, numMsg);
    ext.close();
// TODO - PowerPoint
// TODO - Publisher
// TODO - Visio
}
Also used : OutlookTextExtactor(org.apache.poi.hsmf.extractor.OutlookTextExtactor) POITextExtractor(org.apache.poi.POITextExtractor) PowerPointExtractor(org.apache.poi.hslf.extractor.PowerPointExtractor) XSLFPowerPointExtractor(org.apache.poi.xslf.extractor.XSLFPowerPointExtractor) XSSFExcelExtractor(org.apache.poi.xssf.extractor.XSSFExcelExtractor) ExcelExtractor(org.apache.poi.hssf.extractor.ExcelExtractor) XSSFEventBasedExcelExtractor(org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor) EventBasedExcelExtractor(org.apache.poi.hssf.extractor.EventBasedExcelExtractor) XWPFWordExtractor(org.apache.poi.xwpf.extractor.XWPFWordExtractor) POIOLE2TextExtractor(org.apache.poi.POIOLE2TextExtractor) WordExtractor(org.apache.poi.hwpf.extractor.WordExtractor) XWPFWordExtractor(org.apache.poi.xwpf.extractor.XWPFWordExtractor) Test(org.junit.Test)

Example 5 with OutlookTextExtactor

use of org.apache.poi.hsmf.extractor.OutlookTextExtactor in project poi by apache.

the class TestFixedSizedProperties method testReadMessageDateSucceedsWithOutlookTextExtractor.

/**
    * Test to see if we can read the Date Chunk with OutlookTextExtractor.
    */
@Test
public // @Ignore("TODO Work out why the Fri 22nd vs Monday 25th problem is occurring and fix")
void testReadMessageDateSucceedsWithOutlookTextExtractor() throws Exception {
    OutlookTextExtactor ext = new OutlookTextExtactor(mapiMessageSucceeds);
    // Don't close re-used test resources here
    ext.setFilesystem(null);
    String text = ext.getText();
    assertContains(text, "Date: Fri, 22 Jun 2012 18:32:54 +0000\n");
    ext.close();
}
Also used : OutlookTextExtactor(org.apache.poi.hsmf.extractor.OutlookTextExtactor) Test(org.junit.Test)

Aggregations

OutlookTextExtactor (org.apache.poi.hsmf.extractor.OutlookTextExtactor)9 WordExtractor (org.apache.poi.hwpf.extractor.WordExtractor)7 Test (org.junit.Test)7 ExcelExtractor (org.apache.poi.hssf.extractor.ExcelExtractor)6 XSSFEventBasedExcelExtractor (org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor)6 XSSFExcelExtractor (org.apache.poi.xssf.extractor.XSSFExcelExtractor)6 XWPFWordExtractor (org.apache.poi.xwpf.extractor.XWPFWordExtractor)6 PowerPointExtractor (org.apache.poi.hslf.extractor.PowerPointExtractor)5 EventBasedExcelExtractor (org.apache.poi.hssf.extractor.EventBasedExcelExtractor)5 XSLFPowerPointExtractor (org.apache.poi.xslf.extractor.XSLFPowerPointExtractor)5 POITextExtractor (org.apache.poi.POITextExtractor)4 VisioTextExtractor (org.apache.poi.hdgf.extractor.VisioTextExtractor)4 PublisherTextExtractor (org.apache.poi.hpbf.extractor.PublisherTextExtractor)4 Word6Extractor (org.apache.poi.hwpf.extractor.Word6Extractor)4 FileInputStream (java.io.FileInputStream)3 IOException (java.io.IOException)3 ByteArrayInputStream (java.io.ByteArrayInputStream)2 FileNotFoundException (java.io.FileNotFoundException)2 MAPIMessage (org.apache.poi.hsmf.MAPIMessage)2 AttachmentChunks (org.apache.poi.hsmf.datatypes.AttachmentChunks)2