Search in sources :

Example 41 with TikaConfig

use of org.apache.tika.config.TikaConfig in project tika by apache.

the class EmbeddedDocumentUtil method getExtension.

public String getExtension(TikaInputStream is, Metadata metadata) {
    String mimeString = metadata.get(Metadata.CONTENT_TYPE);
    TikaConfig config = getConfig();
    MimeType mimeType = null;
    MimeTypes types = config.getMimeRepository();
    boolean detected = false;
    if (mimeString != null) {
        try {
            mimeType = types.forName(mimeString);
        } catch (MimeTypeException e) {
        //swallow
        }
    }
    if (mimeType == null) {
        Detector detector = config.getDetector();
        try {
            MediaType mediaType = detector.detect(is, metadata);
            mimeType = types.forName(mediaType.toString());
            detected = true;
            is.reset();
        } catch (IOException e) {
        //swallow
        } catch (MimeTypeException e) {
        //swallow
        }
    }
    if (mimeType != null) {
        if (detected) {
            //set or correct the mime type
            metadata.set(Metadata.CONTENT_TYPE, mimeType.toString());
        }
        return mimeType.getExtension();
    }
    return ".bin";
}
Also used : Detector(org.apache.tika.detect.Detector) TikaConfig(org.apache.tika.config.TikaConfig) MimeTypeException(org.apache.tika.mime.MimeTypeException) MediaType(org.apache.tika.mime.MediaType) IOException(java.io.IOException) MimeTypes(org.apache.tika.mime.MimeTypes) MimeType(org.apache.tika.mime.MimeType)

Example 42 with TikaConfig

use of org.apache.tika.config.TikaConfig in project tika by apache.

the class EmbeddedDocumentUtil method getEmbeddedDocumentExtractor.

/**
     * This offers a uniform way to get an EmbeddedDocumentExtractor from a ParseContext.
     * As of Tika 1.15, an AutoDetectParser will automatically be added to parse
     * embedded documents if no Parser.class is specified in the ParseContext.
     * <p/>
     * If you'd prefer not to parse embedded documents, set Parser.class
     * to {@link org.apache.tika.parser.EmptyParser} in the ParseContext.
     *
     * @param context
     * @return EmbeddedDocumentExtractor
     */
public static EmbeddedDocumentExtractor getEmbeddedDocumentExtractor(ParseContext context) {
    EmbeddedDocumentExtractor extractor = context.get(EmbeddedDocumentExtractor.class);
    if (extractor == null) {
        //ensure that an AutoDetectParser is
        //available for parsing embedded docs TIKA-2096
        Parser embeddedParser = context.get(Parser.class);
        if (embeddedParser == null) {
            TikaConfig tikaConfig = context.get(TikaConfig.class);
            if (tikaConfig == null) {
                context.set(Parser.class, new AutoDetectParser());
            } else {
                context.set(Parser.class, new AutoDetectParser(tikaConfig));
            }
        }
        extractor = new ParsingEmbeddedDocumentExtractor(context);
    }
    return extractor;
}
Also used : TikaConfig(org.apache.tika.config.TikaConfig) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) Parser(org.apache.tika.parser.Parser) CompositeParser(org.apache.tika.parser.CompositeParser) AutoDetectParser(org.apache.tika.parser.AutoDetectParser)

Example 43 with TikaConfig

use of org.apache.tika.config.TikaConfig in project tika by apache.

the class EnviHeaderParser method parse.

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    // Only outputting the MIME type as metadata
    metadata.set(Metadata.CONTENT_TYPE, ENVI_MIME_TYPE);
    // The following code was taken from the TXTParser
    // Automatically detect the character encoding
    TikaConfig tikaConfig = context.get(TikaConfig.class);
    if (tikaConfig == null) {
        tikaConfig = TikaConfig.getDefaultConfig();
    }
    try (AutoDetectReader reader = new AutoDetectReader(new CloseShieldInputStream(stream), metadata, getEncodingDetector(context))) {
        Charset charset = reader.getCharset();
        MediaType type = new MediaType(MediaType.TEXT_PLAIN, charset);
        // deprecated, see TIKA-431
        metadata.set(Metadata.CONTENT_ENCODING, charset.name());
        XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
        xhtml.startDocument();
        // text contents of the xhtml
        String line;
        while ((line = reader.readLine()) != null) {
            xhtml.startElement("p");
            xhtml.characters(line);
            xhtml.endElement("p");
        }
        xhtml.endDocument();
    }
}
Also used : TikaConfig(org.apache.tika.config.TikaConfig) AutoDetectReader(org.apache.tika.detect.AutoDetectReader) Charset(java.nio.charset.Charset) MediaType(org.apache.tika.mime.MediaType) XHTMLContentHandler(org.apache.tika.sax.XHTMLContentHandler) CloseShieldInputStream(org.apache.commons.io.input.CloseShieldInputStream)

Example 44 with TikaConfig

use of org.apache.tika.config.TikaConfig in project tika by apache.

the class OOXMLParserTest method testMacroinXlsm.

@Test
public void testMacroinXlsm() throws Exception {
    //test default is "don't extract macros"
    for (Metadata metadata : getRecursiveMetadata("testEXCEL_macro.xlsm")) {
        if (metadata.get(Metadata.CONTENT_TYPE).equals("text/x-vbasic")) {
            fail("Shouldn't have extracted macros as default");
        }
    }
    //now test that they were extracted
    ParseContext context = new ParseContext();
    OfficeParserConfig officeParserConfig = new OfficeParserConfig();
    officeParserConfig.setExtractMacros(true);
    context.set(OfficeParserConfig.class, officeParserConfig);
    Metadata minExpected = new Metadata();
    minExpected.add(RecursiveParserWrapper.TIKA_CONTENT.getName(), "Sub Dirty()");
    minExpected.add(RecursiveParserWrapper.TIKA_CONTENT.getName(), "dirty dirt dirt");
    minExpected.add(Metadata.CONTENT_TYPE, "text/x-vbasic");
    minExpected.add(TikaCoreProperties.EMBEDDED_RESOURCE_TYPE, TikaCoreProperties.EmbeddedResourceType.MACRO.toString());
    assertContainsAtLeast(minExpected, getRecursiveMetadata("testEXCEL_macro.xlsm", context));
    //test configuring via config file
    TikaConfig tikaConfig = new TikaConfig(this.getClass().getResourceAsStream("tika-config-dom-macros.xml"));
    AutoDetectParser parser = new AutoDetectParser(tikaConfig);
    assertContainsAtLeast(minExpected, getRecursiveMetadata("testEXCEL_macro.xlsm", parser));
}
Also used : TikaConfig(org.apache.tika.config.TikaConfig) Metadata(org.apache.tika.metadata.Metadata) ParseContext(org.apache.tika.parser.ParseContext) OfficeParserConfig(org.apache.tika.parser.microsoft.OfficeParserConfig) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) ExcelParserTest(org.apache.tika.parser.microsoft.ExcelParserTest) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest) WordParserTest(org.apache.tika.parser.microsoft.WordParserTest)

Example 45 with TikaConfig

use of org.apache.tika.config.TikaConfig in project tika by apache.

the class OOXMLParserTest method testInitializationViaConfig.

@Test
public void testInitializationViaConfig() throws Exception {
    //NOTE: this test relies on a bug in the DOM extractor that
    //is passing over the title information.
    //once we fix that, this test will no longer be meaningful!
    InputStream is = getClass().getResourceAsStream("/org/apache/tika/parser/microsoft/tika-config-sax-docx.xml");
    assertNotNull(is);
    TikaConfig tikaConfig = new TikaConfig(is);
    AutoDetectParser p = new AutoDetectParser(tikaConfig);
    XMLResult xml = getXML("testWORD_2006ml.docx", p, new Metadata());
    assertContains("engaging title", xml.xml);
}
Also used : TikaConfig(org.apache.tika.config.TikaConfig) TikaInputStream(org.apache.tika.io.TikaInputStream) InputStream(java.io.InputStream) Metadata(org.apache.tika.metadata.Metadata) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) ExcelParserTest(org.apache.tika.parser.microsoft.ExcelParserTest) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest) WordParserTest(org.apache.tika.parser.microsoft.WordParserTest)

Aggregations

TikaConfig (org.apache.tika.config.TikaConfig)62 Test (org.junit.Test)32 Metadata (org.apache.tika.metadata.Metadata)26 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)20 TikaTest (org.apache.tika.TikaTest)16 InputStream (java.io.InputStream)12 Tika (org.apache.tika.Tika)12 IOException (java.io.IOException)10 URL (java.net.URL)10 TikaException (org.apache.tika.exception.TikaException)9 TikaInputStream (org.apache.tika.io.TikaInputStream)9 ParseContext (org.apache.tika.parser.ParseContext)9 Parser (org.apache.tika.parser.Parser)9 MediaType (org.apache.tika.mime.MediaType)8 CompositeParser (org.apache.tika.parser.CompositeParser)8 ByteArrayInputStream (java.io.ByteArrayInputStream)7 File (java.io.File)6 TikaConfigTest (org.apache.tika.config.TikaConfigTest)6 HashSet (java.util.HashSet)5 SAXException (org.xml.sax.SAXException)5