Search in sources :

Example 6 with OfficeParser

use of org.apache.tika.parser.microsoft.OfficeParser in project tika by apache.

the class OOXMLParserTest method testExcelXLSB.

@Test
public void testExcelXLSB() throws Exception {
    Detector detector = new DefaultDetector();
    AutoDetectParser parser = new AutoDetectParser();
    Metadata m = new Metadata();
    m.add(Metadata.RESOURCE_NAME_KEY, "excel.xlsb");
    // Should be detected correctly
    MediaType type;
    try (InputStream input = ExcelParserTest.class.getResourceAsStream("/test-documents/testEXCEL.xlsb")) {
        type = detector.detect(input, m);
        assertEquals("application/vnd.ms-excel.sheet.binary.macroenabled.12", type.toString());
    }
    // OfficeParser won't handle it
    assertEquals(false, (new OfficeParser()).getSupportedTypes(new ParseContext()).contains(type));
    // OOXMLParser will (soon) handle it
    assertTrue((new OOXMLParser()).getSupportedTypes(new ParseContext()).contains(type));
    // AutoDetectParser doesn't break on it
    try (InputStream input = ExcelParserTest.class.getResourceAsStream("/test-documents/testEXCEL.xlsb")) {
        ContentHandler handler = new BodyContentHandler(-1);
        ParseContext context = new ParseContext();
        context.set(Locale.class, Locale.US);
        parser.parse(input, handler, m, context);
        String content = handler.toString();
        assertContains("This is an example spreadsheet", content);
    }
}
Also used : DefaultDetector(org.apache.tika.detect.DefaultDetector) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) Detector(org.apache.tika.detect.Detector) DefaultDetector(org.apache.tika.detect.DefaultDetector) OfficeParser(org.apache.tika.parser.microsoft.OfficeParser) TikaInputStream(org.apache.tika.io.TikaInputStream) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) MediaType(org.apache.tika.mime.MediaType) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) ExcelParserTest(org.apache.tika.parser.microsoft.ExcelParserTest) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest) WordParserTest(org.apache.tika.parser.microsoft.WordParserTest)

Example 7 with OfficeParser

use of org.apache.tika.parser.microsoft.OfficeParser in project tika by apache.

the class SolidworksParserTest method testAssembly2014SP0Parser.

/**
     * Test the parsing of an solidWorks assembly in version 2014SP0
     */
@Test
public void testAssembly2014SP0Parser() throws Exception {
    InputStream input = SolidworksParserTest.class.getResourceAsStream("/test-documents/testsolidworksAssembly2014SP0.SLDASM");
    try {
        ContentHandler handler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        new OfficeParser().parse(input, handler, metadata, new ParseContext());
        //Check content type
        assertEquals("application/sldworks", metadata.get(Metadata.CONTENT_TYPE));
        //Check properties
        assertEquals("2012-04-25T09:51:38Z", metadata.get(TikaCoreProperties.CREATED));
        assertEquals(null, metadata.get(TikaCoreProperties.CONTRIBUTOR));
        assertEquals("2013-11-28T12:41:49Z", metadata.get(Metadata.MODIFIED));
        assertEquals("solidworks-dcom_dev", metadata.get(TikaCoreProperties.MODIFIER));
        assertEquals(null, metadata.get(TikaCoreProperties.RELATION));
        assertEquals(null, metadata.get(TikaCoreProperties.RIGHTS));
        assertEquals(null, metadata.get(TikaCoreProperties.SOURCE));
        assertEquals("", metadata.get(TikaCoreProperties.TITLE));
        assertEquals("", metadata.get(TikaCoreProperties.KEYWORDS));
    } finally {
        input.close();
    }
}
Also used : BodyContentHandler(org.apache.tika.sax.BodyContentHandler) OfficeParser(org.apache.tika.parser.microsoft.OfficeParser) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) BodyContentHandler(org.apache.tika.sax.BodyContentHandler) ContentHandler(org.xml.sax.ContentHandler) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest)

Example 8 with OfficeParser

use of org.apache.tika.parser.microsoft.OfficeParser in project tika by apache.

the class TikaToXMP method initialize.

/**
     * Initializes the map with supported converters.
     */
private static void initialize() {
    // No particular parsing context is needed
    ParseContext parseContext = new ParseContext();
    // MS Office Binary File Format
    addConverter(new OfficeParser().getSupportedTypes(parseContext), MSOfficeBinaryConverter.class);
    // Rich Text Format
    addConverter(new RTFParser().getSupportedTypes(parseContext), RTFConverter.class);
    // MS Open XML Format
    addConverter(new OOXMLParser().getSupportedTypes(parseContext), MSOfficeXMLConverter.class);
    // Open document format
    addConverter(new OpenDocumentParser().getSupportedTypes(parseContext), OpenDocumentConverter.class);
}
Also used : RTFParser(org.apache.tika.parser.rtf.RTFParser) OOXMLParser(org.apache.tika.parser.microsoft.ooxml.OOXMLParser) OpenDocumentParser(org.apache.tika.parser.odf.OpenDocumentParser) OfficeParser(org.apache.tika.parser.microsoft.OfficeParser) ParseContext(org.apache.tika.parser.ParseContext)

Aggregations

ParseContext (org.apache.tika.parser.ParseContext)8 OfficeParser (org.apache.tika.parser.microsoft.OfficeParser)8 InputStream (java.io.InputStream)7 TikaTest (org.apache.tika.TikaTest)7 Metadata (org.apache.tika.metadata.Metadata)7 BodyContentHandler (org.apache.tika.sax.BodyContentHandler)7 Test (org.junit.Test)7 ContentHandler (org.xml.sax.ContentHandler)7 DefaultDetector (org.apache.tika.detect.DefaultDetector)1 Detector (org.apache.tika.detect.Detector)1 TikaInputStream (org.apache.tika.io.TikaInputStream)1 MediaType (org.apache.tika.mime.MediaType)1 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)1 ExcelParserTest (org.apache.tika.parser.microsoft.ExcelParserTest)1 WordParserTest (org.apache.tika.parser.microsoft.WordParserTest)1 OOXMLParser (org.apache.tika.parser.microsoft.ooxml.OOXMLParser)1 OpenDocumentParser (org.apache.tika.parser.odf.OpenDocumentParser)1 RTFParser (org.apache.tika.parser.rtf.RTFParser)1