Search in sources :

Example 6 with VisioTextExtractor

use of org.apache.poi.hdgf.extractor.VisioTextExtractor in project poi by apache.

the class TestHDGFCore method testUtf16LE.

public void testUtf16LE() throws Exception {
    fs = new POIFSFileSystem(_dgTests.openResourceAsStream("Test_Visio-Some_Random_Text.vsd"));
    hdgf = new HDGFDiagram(fs);
    assertNotNull(hdgf);
    textExtractor = new VisioTextExtractor(hdgf);
    String text = textExtractor.getText().trim();
    assertEquals("text\nView\nTest View\nI am a test view\nSome random text, on a page", text);
}
Also used : POIFSFileSystem(org.apache.poi.poifs.filesystem.POIFSFileSystem) VisioTextExtractor(org.apache.poi.hdgf.extractor.VisioTextExtractor)

Example 7 with VisioTextExtractor

use of org.apache.poi.hdgf.extractor.VisioTextExtractor in project poi by apache.

the class TestHDGFCore method testV6NonUtf16LE.

public void testV6NonUtf16LE() throws Exception {
    fs = new POIFSFileSystem(_dgTests.openResourceAsStream("v6-non-utf16le.vsd"));
    hdgf = new HDGFDiagram(fs);
    assertNotNull(hdgf);
    textExtractor = new VisioTextExtractor(hdgf);
    String text = textExtractor.getText().replace("", "").trim();
    assertEquals("Table\n\n\nPropertySheet\n\n\n\nPropertySheetField", text);
}
Also used : POIFSFileSystem(org.apache.poi.poifs.filesystem.POIFSFileSystem) VisioTextExtractor(org.apache.poi.hdgf.extractor.VisioTextExtractor)

Example 8 with VisioTextExtractor

use of org.apache.poi.hdgf.extractor.VisioTextExtractor in project poi by apache.

the class TestExtractorFactory method testPOIFS.

@Test
public void testPOIFS() throws Exception {
    // Excel
    assertTrue(ExtractorFactory.createExtractor(new POIFSFileSystem(new FileInputStream(xls))) instanceof ExcelExtractor);
    assertTrue(ExtractorFactory.createExtractor(new POIFSFileSystem(new FileInputStream(xls))).getText().length() > 200);
    // Word
    assertTrue(ExtractorFactory.createExtractor(new POIFSFileSystem(new FileInputStream(doc))) instanceof WordExtractor);
    assertTrue(ExtractorFactory.createExtractor(new POIFSFileSystem(new FileInputStream(doc))).getText().length() > 120);
    assertTrue(ExtractorFactory.createExtractor(new POIFSFileSystem(new FileInputStream(doc6))) instanceof Word6Extractor);
    assertTrue(ExtractorFactory.createExtractor(new POIFSFileSystem(new FileInputStream(doc6))).getText().length() > 20);
    assertTrue(ExtractorFactory.createExtractor(new POIFSFileSystem(new FileInputStream(doc95))) instanceof Word6Extractor);
    assertTrue(ExtractorFactory.createExtractor(new POIFSFileSystem(new FileInputStream(doc95))).getText().length() > 120);
    // PowerPoint
    assertTrue(ExtractorFactory.createExtractor(new POIFSFileSystem(new FileInputStream(ppt))) instanceof PowerPointExtractor);
    assertTrue(ExtractorFactory.createExtractor(new POIFSFileSystem(new FileInputStream(ppt))).getText().length() > 120);
    // Visio
    assertTrue(ExtractorFactory.createExtractor(new POIFSFileSystem(new FileInputStream(vsd))) instanceof VisioTextExtractor);
    assertTrue(ExtractorFactory.createExtractor(new POIFSFileSystem(new FileInputStream(vsd))).getText().length() > 50);
    // Publisher
    assertTrue(ExtractorFactory.createExtractor(new POIFSFileSystem(new FileInputStream(pub))) instanceof PublisherTextExtractor);
    assertTrue(ExtractorFactory.createExtractor(new POIFSFileSystem(new FileInputStream(pub))).getText().length() > 50);
    // Outlook msg
    assertTrue(ExtractorFactory.createExtractor(new POIFSFileSystem(new FileInputStream(msg))) instanceof OutlookTextExtactor);
    assertTrue(ExtractorFactory.createExtractor(new POIFSFileSystem(new FileInputStream(msg))).getText().length() > 50);
    // Text
    try {
        ExtractorFactory.createExtractor(new POIFSFileSystem(new FileInputStream(txt)));
        fail();
    } catch (IOException e) {
    // Good
    }
}
Also used : OutlookTextExtactor(org.apache.poi.hsmf.extractor.OutlookTextExtactor) XSSFExcelExtractor(org.apache.poi.xssf.extractor.XSSFExcelExtractor) ExcelExtractor(org.apache.poi.hssf.extractor.ExcelExtractor) XSSFEventBasedExcelExtractor(org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor) EventBasedExcelExtractor(org.apache.poi.hssf.extractor.EventBasedExcelExtractor) OPOIFSFileSystem(org.apache.poi.poifs.filesystem.OPOIFSFileSystem) POIFSFileSystem(org.apache.poi.poifs.filesystem.POIFSFileSystem) Word6Extractor(org.apache.poi.hwpf.extractor.Word6Extractor) PowerPointExtractor(org.apache.poi.hslf.extractor.PowerPointExtractor) XSLFPowerPointExtractor(org.apache.poi.xslf.extractor.XSLFPowerPointExtractor) PublisherTextExtractor(org.apache.poi.hpbf.extractor.PublisherTextExtractor) IOException(java.io.IOException) VisioTextExtractor(org.apache.poi.hdgf.extractor.VisioTextExtractor) FileInputStream(java.io.FileInputStream) WordExtractor(org.apache.poi.hwpf.extractor.WordExtractor) XWPFWordExtractor(org.apache.poi.xwpf.extractor.XWPFWordExtractor) Test(org.junit.Test)

Example 9 with VisioTextExtractor

use of org.apache.poi.hdgf.extractor.VisioTextExtractor in project poi by apache.

the class TestExtractorFactory method testFile.

@Test
public void testFile() throws Exception {
    // Excel
    POITextExtractor xlsExtractor = ExtractorFactory.createExtractor(xls);
    assertNotNull("Had empty extractor for " + xls, xlsExtractor);
    assertTrue("Expected instanceof ExcelExtractor, but had: " + xlsExtractor.getClass(), xlsExtractor instanceof ExcelExtractor);
    assertTrue(xlsExtractor.getText().length() > 200);
    xlsExtractor.close();
    POITextExtractor extractor = ExtractorFactory.createExtractor(xlsx);
    assertTrue(extractor.getClass().getName(), extractor instanceof XSSFExcelExtractor);
    extractor.close();
    extractor = ExtractorFactory.createExtractor(xlsx);
    assertTrue(extractor.getText().length() > 200);
    extractor.close();
    extractor = ExtractorFactory.createExtractor(xltx);
    assertTrue(extractor.getClass().getName(), extractor instanceof XSSFExcelExtractor);
    extractor.close();
    extractor = ExtractorFactory.createExtractor(xlsb);
    assertContains(extractor.getText(), "test");
    extractor.close();
    extractor = ExtractorFactory.createExtractor(xltx);
    assertContains(extractor.getText(), "test");
    extractor.close();
    // TODO Support OOXML-Strict, see bug #57699
    try {
        /*extractor =*/
        ExtractorFactory.createExtractor(xlsxStrict);
        fail("OOXML-Strict isn't yet supported");
    } catch (POIXMLException e) {
    // Expected, for now
    }
    //        extractor = ExtractorFactory.createExtractor(xlsxStrict);
    //        assertTrue(
    //                extractor
    //                instanceof XSSFExcelExtractor
    //        );
    //        extractor.close();
    //
    //        extractor = ExtractorFactory.createExtractor(xlsxStrict);
    //        assertTrue(
    //                extractor.getText().contains("test")
    //        );
    //        extractor.close();
    // Word
    extractor = ExtractorFactory.createExtractor(doc);
    assertTrue(extractor instanceof WordExtractor);
    assertTrue(extractor.getText().length() > 120);
    extractor.close();
    extractor = ExtractorFactory.createExtractor(doc6);
    assertTrue(extractor instanceof Word6Extractor);
    assertTrue(extractor.getText().length() > 20);
    extractor.close();
    extractor = ExtractorFactory.createExtractor(doc95);
    assertTrue(extractor instanceof Word6Extractor);
    assertTrue(extractor.getText().length() > 120);
    extractor.close();
    extractor = ExtractorFactory.createExtractor(docx);
    assertTrue(extractor instanceof XWPFWordExtractor);
    extractor.close();
    extractor = ExtractorFactory.createExtractor(docx);
    assertTrue(extractor.getText().length() > 120);
    extractor.close();
    extractor = ExtractorFactory.createExtractor(dotx);
    assertTrue(extractor instanceof XWPFWordExtractor);
    extractor.close();
    extractor = ExtractorFactory.createExtractor(dotx);
    assertContains(extractor.getText(), "Test");
    extractor.close();
    // PowerPoint (PPT)
    extractor = ExtractorFactory.createExtractor(ppt);
    assertTrue(extractor instanceof PowerPointExtractor);
    assertTrue(extractor.getText().length() > 120);
    extractor.close();
    // PowerPoint (PPTX)
    extractor = ExtractorFactory.createExtractor(pptx);
    assertTrue(extractor instanceof XSLFPowerPointExtractor);
    assertTrue(extractor.getText().length() > 120);
    extractor.close();
    // Visio - binary
    extractor = ExtractorFactory.createExtractor(vsd);
    assertTrue(extractor instanceof VisioTextExtractor);
    assertTrue(extractor.getText().length() > 50);
    extractor.close();
    // Visio - vsdx
    extractor = ExtractorFactory.createExtractor(vsdx);
    assertTrue(extractor instanceof XDGFVisioExtractor);
    assertTrue(extractor.getText().length() > 20);
    extractor.close();
    // Publisher
    extractor = ExtractorFactory.createExtractor(pub);
    assertTrue(extractor instanceof PublisherTextExtractor);
    assertTrue(extractor.getText().length() > 50);
    extractor.close();
    // Outlook msg
    extractor = ExtractorFactory.createExtractor(msg);
    assertTrue(extractor instanceof OutlookTextExtactor);
    assertTrue(extractor.getText().length() > 50);
    extractor.close();
    // Text
    try {
        ExtractorFactory.createExtractor(txt);
        fail();
    } catch (IllegalArgumentException e) {
    // Good
    }
}
Also used : XDGFVisioExtractor(org.apache.poi.xdgf.extractor.XDGFVisioExtractor) XSSFExcelExtractor(org.apache.poi.xssf.extractor.XSSFExcelExtractor) Word6Extractor(org.apache.poi.hwpf.extractor.Word6Extractor) PowerPointExtractor(org.apache.poi.hslf.extractor.PowerPointExtractor) XSLFPowerPointExtractor(org.apache.poi.xslf.extractor.XSLFPowerPointExtractor) XWPFWordExtractor(org.apache.poi.xwpf.extractor.XWPFWordExtractor) PublisherTextExtractor(org.apache.poi.hpbf.extractor.PublisherTextExtractor) POIXMLException(org.apache.poi.POIXMLException) WordExtractor(org.apache.poi.hwpf.extractor.WordExtractor) XWPFWordExtractor(org.apache.poi.xwpf.extractor.XWPFWordExtractor) OutlookTextExtactor(org.apache.poi.hsmf.extractor.OutlookTextExtactor) XSLFPowerPointExtractor(org.apache.poi.xslf.extractor.XSLFPowerPointExtractor) POITextExtractor(org.apache.poi.POITextExtractor) XSSFExcelExtractor(org.apache.poi.xssf.extractor.XSSFExcelExtractor) ExcelExtractor(org.apache.poi.hssf.extractor.ExcelExtractor) XSSFEventBasedExcelExtractor(org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor) EventBasedExcelExtractor(org.apache.poi.hssf.extractor.EventBasedExcelExtractor) VisioTextExtractor(org.apache.poi.hdgf.extractor.VisioTextExtractor) Test(org.junit.Test)

Aggregations

VisioTextExtractor (org.apache.poi.hdgf.extractor.VisioTextExtractor)9 PublisherTextExtractor (org.apache.poi.hpbf.extractor.PublisherTextExtractor)5 Test (org.junit.Test)5 FileInputStream (java.io.FileInputStream)4 PowerPointExtractor (org.apache.poi.hslf.extractor.PowerPointExtractor)4 OutlookTextExtactor (org.apache.poi.hsmf.extractor.OutlookTextExtactor)4 EventBasedExcelExtractor (org.apache.poi.hssf.extractor.EventBasedExcelExtractor)4 ExcelExtractor (org.apache.poi.hssf.extractor.ExcelExtractor)4 Word6Extractor (org.apache.poi.hwpf.extractor.Word6Extractor)4 WordExtractor (org.apache.poi.hwpf.extractor.WordExtractor)4 POIFSFileSystem (org.apache.poi.poifs.filesystem.POIFSFileSystem)4 XSLFPowerPointExtractor (org.apache.poi.xslf.extractor.XSLFPowerPointExtractor)4 XSSFEventBasedExcelExtractor (org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor)4 XSSFExcelExtractor (org.apache.poi.xssf.extractor.XSSFExcelExtractor)4 XWPFWordExtractor (org.apache.poi.xwpf.extractor.XWPFWordExtractor)4 IOException (java.io.IOException)2 POITextExtractor (org.apache.poi.POITextExtractor)2 OPOIFSFileSystem (org.apache.poi.poifs.filesystem.OPOIFSFileSystem)2 XDGFVisioExtractor (org.apache.poi.xdgf.extractor.XDGFVisioExtractor)2 File (java.io.File)1