Search in sources :

Example 26 with TikaConfig

use of org.apache.tika.config.TikaConfig in project jackrabbit-oak by apache.

the class BinaryTextExtractor method initializeTikaConfig.

private static TikaConfigHolder initializeTikaConfig(@Nullable IndexDefinition definition) {
    ClassLoader current = Thread.currentThread().getContextClassLoader();
    InputStream configStream = null;
    String configSource = null;
    try {
        Thread.currentThread().setContextClassLoader(LuceneIndexEditorContext.class.getClassLoader());
        if (definition != null && definition.hasCustomTikaConfig()) {
            log.debug("[{}] Using custom tika config", definition.getIndexName());
            configSource = "Custom config at " + definition.getIndexPath();
            configStream = definition.getTikaConfig();
        } else {
            URL configUrl = LuceneIndexEditorContext.class.getResource("tika-config.xml");
            if (configUrl != null) {
                configSource = configUrl.toString();
                configStream = configUrl.openStream();
            }
        }
        if (configStream != null) {
            return new TikaConfigHolder(new TikaConfig(configStream), configSource);
        }
    } catch (TikaException | IOException | SAXException e) {
        log.warn("Tika configuration not available : " + configSource, e);
    } finally {
        IOUtils.closeQuietly(configStream);
        Thread.currentThread().setContextClassLoader(current);
    }
    return new TikaConfigHolder(TikaConfig.getDefaultConfig(), "Default Config");
}
Also used : TikaException(org.apache.tika.exception.TikaException) TikaConfig(org.apache.tika.config.TikaConfig) LazyInputStream(org.apache.jackrabbit.oak.commons.io.LazyInputStream) CountingInputStream(com.google.common.io.CountingInputStream) InputStream(java.io.InputStream) LuceneIndexEditorContext(org.apache.jackrabbit.oak.plugins.index.lucene.LuceneIndexEditorContext) IOException(java.io.IOException) URL(java.net.URL) SAXException(org.xml.sax.SAXException)

Example 27 with TikaConfig

use of org.apache.tika.config.TikaConfig in project gate-core by GateNLP.

the class TikaFormat method unpackMarkup.

@Override
public void unpackMarkup(Document doc, RepositioningInfo repInfo, RepositioningInfo ampCodingInfo) throws DocumentFormatException {
    if (doc == null || doc.getSourceUrl() == null) {
        throw new DocumentFormatException("GATE document is null or no content found. Nothing to parse!");
    }
    // End if
    // Create a status listener
    StatusListener statusListener = new StatusListener() {

        @Override
        public void statusChanged(String text) {
            // This is implemented in DocumentFormat.java and inherited here
            fireStatusChanged(text);
        }
    };
    XmlDocumentHandler ch = new XmlDocumentHandler(doc, this.markupElementsMap, this.element2StringMap);
    Metadata metadata = extractParserTips(doc);
    ch.addStatusListener(statusListener);
    ch.setRepositioningInfo(repInfo);
    // set the object with ampersand coding positions
    ch.setAmpCodingInfo(ampCodingInfo);
    InputStream input = null;
    try {
        Parser tikaParser = new TikaConfig().getParser();
        input = doc.getSourceUrl().openStream();
        tikaParser.parse(input, ch, metadata, new ParseContext());
        setDocumentFeatures(metadata, doc);
    } catch (IOException e) {
        throw new DocumentFormatException(e);
    } catch (SAXException e) {
        throw new DocumentFormatException(e);
    } catch (TikaException e) {
        throw new DocumentFormatException(e);
    } finally {
        // null safe
        IOUtils.closeQuietly(input);
        ch.removeStatusListener(statusListener);
    }
    if (doc instanceof DocumentImpl) {
        ((DocumentImpl) doc).setNextAnnotationId(ch.getCustomObjectsId());
    }
}
Also used : TikaException(org.apache.tika.exception.TikaException) TikaConfig(org.apache.tika.config.TikaConfig) InputStream(java.io.InputStream) XmlDocumentHandler(gate.xml.XmlDocumentHandler) Metadata(org.apache.tika.metadata.Metadata) IOException(java.io.IOException) Parser(org.apache.tika.parser.Parser) SAXException(org.xml.sax.SAXException) DocumentFormatException(gate.util.DocumentFormatException) ParseContext(org.apache.tika.parser.ParseContext) StatusListener(gate.event.StatusListener)

Example 28 with TikaConfig

use of org.apache.tika.config.TikaConfig in project camel by apache.

the class TikaComponent method createEndpoint.

@Override
protected Endpoint createEndpoint(String uri, String remaining, Map<String, Object> parameters) throws Exception {
    TikaConfiguration tikaConfiguration = new TikaConfiguration();
    setProperties(tikaConfiguration, parameters);
    TikaConfig config = resolveAndRemoveReferenceParameter(parameters, TIKA_CONFIG, TikaConfig.class);
    if (config != null) {
        tikaConfiguration.setTikaConfig(config);
    }
    tikaConfiguration.setOperation(new URI(uri).getHost());
    return new TikaEndpoint(uri, this, tikaConfiguration);
}
Also used : TikaConfig(org.apache.tika.config.TikaConfig) URI(java.net.URI)

Example 29 with TikaConfig

use of org.apache.tika.config.TikaConfig in project camel by apache.

the class TikaConfiguration method setTikaConfigUri.

/**
     * 
     * Tika Config Uri: The URI of  tika-config.xml
     * 
     */
public void setTikaConfigUri(String tikaConfigUri) throws TikaException, IOException, SAXException {
    this.tikaConfigUri = tikaConfigUri;
    this.tikaConfig = new TikaConfig(tikaConfigUri);
}
Also used : TikaConfig(org.apache.tika.config.TikaConfig)

Example 30 with TikaConfig

use of org.apache.tika.config.TikaConfig in project jackrabbit-oak by apache.

the class TikaHelper method getTikaConfig.

private static TikaConfig getTikaConfig(File tikaConfig) throws TikaException, IOException, SAXException {
    TikaConfig config;
    if (tikaConfig == null) {
        URL configUrl = TextExtractor.class.getResource(DEFAULT_TIKA_CONFIG);
        if (configUrl != null) {
            log.info("Loading default Tika config from {}", configUrl);
            config = new TikaConfig(configUrl);
        } else {
            log.info("Using default Tika config");
            config = TikaConfig.getDefaultConfig();
        }
    } else {
        log.info("Loading external Tika config from {}", tikaConfig);
        config = new TikaConfig(tikaConfig);
    }
    return config;
}
Also used : TikaConfig(org.apache.tika.config.TikaConfig) URL(java.net.URL)

Aggregations

TikaConfig (org.apache.tika.config.TikaConfig)62 Test (org.junit.Test)32 Metadata (org.apache.tika.metadata.Metadata)26 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)20 TikaTest (org.apache.tika.TikaTest)16 InputStream (java.io.InputStream)12 Tika (org.apache.tika.Tika)12 IOException (java.io.IOException)10 URL (java.net.URL)10 TikaException (org.apache.tika.exception.TikaException)9 TikaInputStream (org.apache.tika.io.TikaInputStream)9 ParseContext (org.apache.tika.parser.ParseContext)9 Parser (org.apache.tika.parser.Parser)9 MediaType (org.apache.tika.mime.MediaType)8 CompositeParser (org.apache.tika.parser.CompositeParser)8 ByteArrayInputStream (java.io.ByteArrayInputStream)7 File (java.io.File)6 TikaConfigTest (org.apache.tika.config.TikaConfigTest)6 HashSet (java.util.HashSet)5 SAXException (org.xml.sax.SAXException)5