Search in sources :

Example 11 with OfficeParserConfig

use of org.apache.tika.parser.microsoft.OfficeParserConfig in project tika by apache.

the class SXWPFExtractorTest method testTurningOffTextBoxExtraction.

//TIKA-2346
@Test
public void testTurningOffTextBoxExtraction() throws Exception {
    ParseContext pc = new ParseContext();
    OfficeParserConfig officeParserConfig = new OfficeParserConfig();
    officeParserConfig.setIncludeShapeBasedContent(false);
    officeParserConfig.setUseSAXDocxExtractor(true);
    pc.set(OfficeParserConfig.class, officeParserConfig);
    String xml = getXML("testWORD_text_box.docx", pc).xml;
    assertContains("This text is directly in the body of the document.", xml);
    assertNotContained("This text is inside of a text box in the body of the document.", xml);
    assertNotContained("This text is inside of a text box in the header of the document.", xml);
    assertNotContained("This text is inside of a text box in the footer of the document.", xml);
}
Also used : ParseContext(org.apache.tika.parser.ParseContext) OfficeParserConfig(org.apache.tika.parser.microsoft.OfficeParserConfig) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest)

Example 12 with OfficeParserConfig

use of org.apache.tika.parser.microsoft.OfficeParserConfig in project tika by apache.

the class OOXMLParserTest method testMacroinXlsm.

@Test
public void testMacroinXlsm() throws Exception {
    //test default is "don't extract macros"
    for (Metadata metadata : getRecursiveMetadata("testEXCEL_macro.xlsm")) {
        if (metadata.get(Metadata.CONTENT_TYPE).equals("text/x-vbasic")) {
            fail("Shouldn't have extracted macros as default");
        }
    }
    //now test that they were extracted
    ParseContext context = new ParseContext();
    OfficeParserConfig officeParserConfig = new OfficeParserConfig();
    officeParserConfig.setExtractMacros(true);
    context.set(OfficeParserConfig.class, officeParserConfig);
    Metadata minExpected = new Metadata();
    minExpected.add(RecursiveParserWrapper.TIKA_CONTENT.getName(), "Sub Dirty()");
    minExpected.add(RecursiveParserWrapper.TIKA_CONTENT.getName(), "dirty dirt dirt");
    minExpected.add(Metadata.CONTENT_TYPE, "text/x-vbasic");
    minExpected.add(TikaCoreProperties.EMBEDDED_RESOURCE_TYPE, TikaCoreProperties.EmbeddedResourceType.MACRO.toString());
    assertContainsAtLeast(minExpected, getRecursiveMetadata("testEXCEL_macro.xlsm", context));
    //test configuring via config file
    TikaConfig tikaConfig = new TikaConfig(this.getClass().getResourceAsStream("tika-config-dom-macros.xml"));
    AutoDetectParser parser = new AutoDetectParser(tikaConfig);
    assertContainsAtLeast(minExpected, getRecursiveMetadata("testEXCEL_macro.xlsm", parser));
}
Also used : TikaConfig(org.apache.tika.config.TikaConfig) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) OfficeParserConfig(org.apache.tika.parser.microsoft.OfficeParserConfig) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) ExcelParserTest(org.apache.tika.parser.microsoft.ExcelParserTest) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest) WordParserTest(org.apache.tika.parser.microsoft.WordParserTest)

Example 13 with OfficeParserConfig

use of org.apache.tika.parser.microsoft.OfficeParserConfig in project tika by apache.

the class OOXMLParserTest method testTurningOffTextBoxExtraction.

//TIKA-2346
@Test
public void testTurningOffTextBoxExtraction() throws Exception {
    ParseContext pc = new ParseContext();
    OfficeParserConfig officeParserConfig = new OfficeParserConfig();
    officeParserConfig.setIncludeShapeBasedContent(false);
    pc.set(OfficeParserConfig.class, officeParserConfig);
    String xml = getXML("testWORD_text_box.docx", pc).xml;
    assertContains("This text is directly in the body of the document.", xml);
    assertNotContained("This text is inside of a text box in the body of the document.", xml);
    assertNotContained("This text is inside of a text box in the header of the document.", xml);
    assertNotContained("This text is inside of a text box in the footer of the document.", xml);
}
Also used : ParseContext(org.apache.tika.parser.ParseContext) OfficeParserConfig(org.apache.tika.parser.microsoft.OfficeParserConfig) ExcelParserTest(org.apache.tika.parser.microsoft.ExcelParserTest) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest) WordParserTest(org.apache.tika.parser.microsoft.WordParserTest)

Example 14 with OfficeParserConfig

use of org.apache.tika.parser.microsoft.OfficeParserConfig in project tika by apache.

the class OOXMLParserTest method testBatch.

//@Test //use this for lightweight benchmarking to compare xwpf options
public void testBatch() throws Exception {
    OfficeParserConfig officeParserConfig = new OfficeParserConfig();
    officeParserConfig.setUseSAXDocxExtractor(true);
    long started = new Date().getTime();
    int ex = 0;
    for (int i = 0; i < 100; i++) {
        for (File f : getResourceAsFile("/test-documents").listFiles()) {
            if (!f.getName().endsWith(".docx")) {
                continue;
            }
            try (InputStream is = TikaInputStream.get(f)) {
                ParseContext parseContext = new ParseContext();
                parseContext.set(OfficeParserConfig.class, officeParserConfig);
                //test only the extraction of the main docx content, not embedded docs
                parseContext.set(Parser.class, new EmptyParser());
                Metadata metadata = new Metadata();
                XMLResult r = getXML(is, parser, metadata, parseContext);
            } catch (Exception e) {
                ex++;
            }
        }
    }
    System.out.println("elapsed: " + (new Date().getTime() - started) + " with " + ex + " exceptions");
}
Also used : TikaInputStream(org.apache.tika.io.TikaInputStream) InputStream(java.io.InputStream) OfficeParserConfig(org.apache.tika.parser.microsoft.OfficeParserConfig) ParseContext(org.apache.tika.parser.ParseContext) EmptyParser(org.apache.tika.parser.EmptyParser) Metadata(org.apache.tika.metadata.Metadata) File(java.io.File) Date(java.util.Date) EncryptedDocumentException(org.apache.tika.exception.EncryptedDocumentException)

Example 15 with OfficeParserConfig

use of org.apache.tika.parser.microsoft.OfficeParserConfig in project tika by apache.

the class SXWPFExtractorTest method testSkipDeleted.

@Test
public void testSkipDeleted() throws Exception {
    ParseContext pc = new ParseContext();
    OfficeParserConfig officeParserConfig = new OfficeParserConfig();
    officeParserConfig.setIncludeDeletedContent(true);
    officeParserConfig.setUseSAXDocxExtractor(true);
    officeParserConfig.setIncludeMoveFromContent(true);
    pc.set(OfficeParserConfig.class, officeParserConfig);
    XMLResult r = getXML("testWORD_2006ml.docx", pc);
    assertContains("frog", r.xml);
    assertContainsCount("Second paragraph", r.xml, 2);
}
Also used : ParseContext(org.apache.tika.parser.ParseContext) OfficeParserConfig(org.apache.tika.parser.microsoft.OfficeParserConfig) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest)

Aggregations

OfficeParserConfig (org.apache.tika.parser.microsoft.OfficeParserConfig)16 ParseContext (org.apache.tika.parser.ParseContext)15 TikaTest (org.apache.tika.TikaTest)13 Test (org.junit.Test)13 Metadata (org.apache.tika.metadata.Metadata)9 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)6 ExcelParserTest (org.apache.tika.parser.microsoft.ExcelParserTest)6 WordParserTest (org.apache.tika.parser.microsoft.WordParserTest)6 TikaConfig (org.apache.tika.config.TikaConfig)5 InputStream (java.io.InputStream)2 EncryptedDocumentException (org.apache.tika.exception.EncryptedDocumentException)2 TikaInputStream (org.apache.tika.io.TikaInputStream)2 File (java.io.File)1 Date (java.util.Date)1 HashMap (java.util.HashMap)1 Locale (java.util.Locale)1 Map (java.util.Map)1 CloseShieldInputStream (org.apache.commons.io.input.CloseShieldInputStream)1 POIXMLDocument (org.apache.poi.POIXMLDocument)1 POIXMLTextExtractor (org.apache.poi.POIXMLTextExtractor)1