Search in sources :

Example 6 with MediaType

use of org.apache.tika.mime.MediaType in project tika by apache.

the class AutoDetectParser method parse.

public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
    TemporaryResources tmp = new TemporaryResources();
    try {
        TikaInputStream tis = TikaInputStream.get(stream, tmp);
        // Automatically detect the MIME type of the document
        MediaType type = detector.detect(tis, metadata);
        metadata.set(Metadata.CONTENT_TYPE, type.toString());
        // TIKA-216: Zip bomb prevention
        SecureContentHandler sch = handler != null ? new SecureContentHandler(handler, tis) : null;
        //the caller hasn't specified one.
        if (context.get(EmbeddedDocumentExtractor.class) == null) {
            Parser p = context.get(Parser.class);
            if (p == null) {
                context.set(Parser.class, this);
            }
            context.set(EmbeddedDocumentExtractor.class, new ParsingEmbeddedDocumentExtractor(context));
        }
        try {
            // Parse the document
            super.parse(tis, sch, metadata, context);
        } catch (SAXException e) {
            // Convert zip bomb exceptions to TikaExceptions
            sch.throwIfCauseOf(e);
            throw e;
        }
    } finally {
        tmp.dispose();
    }
}
Also used : ParsingEmbeddedDocumentExtractor(org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor) ParsingEmbeddedDocumentExtractor(org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor) EmbeddedDocumentExtractor(org.apache.tika.extractor.EmbeddedDocumentExtractor) TemporaryResources(org.apache.tika.io.TemporaryResources) TikaInputStream(org.apache.tika.io.TikaInputStream) MediaType(org.apache.tika.mime.MediaType) SecureContentHandler(org.apache.tika.sax.SecureContentHandler) SAXException(org.xml.sax.SAXException)

Example 7 with MediaType

use of org.apache.tika.mime.MediaType in project tika by apache.

the class CompositeParser method findDuplicateParsers.

/**
     * Utility method that goes through all the component parsers and finds
     * all media types for which more than one parser declares support. This
     * is useful in tracking down conflicting parser definitions.
     *
     * @since Apache Tika 0.10
     * @see <a href="https://issues.apache.org/jira/browse/TIKA-660">TIKA-660</a>
     * @param context parsing context
     * @return media types that are supported by at least two component parsers
     */
public Map<MediaType, List<Parser>> findDuplicateParsers(ParseContext context) {
    Map<MediaType, Parser> types = new HashMap<MediaType, Parser>();
    Map<MediaType, List<Parser>> duplicates = new HashMap<MediaType, List<Parser>>();
    for (Parser parser : parsers) {
        for (MediaType type : parser.getSupportedTypes(context)) {
            MediaType canonicalType = registry.normalize(type);
            if (types.containsKey(canonicalType)) {
                List<Parser> list = duplicates.get(canonicalType);
                if (list == null) {
                    list = new ArrayList<Parser>();
                    list.add(types.get(canonicalType));
                    duplicates.put(canonicalType, list);
                }
                list.add(parser);
            } else {
                types.put(canonicalType, parser);
            }
        }
    }
    return duplicates;
}
Also used : HashMap(java.util.HashMap) MediaType(org.apache.tika.mime.MediaType) ArrayList(java.util.ArrayList) List(java.util.List)

Example 8 with MediaType

use of org.apache.tika.mime.MediaType in project tika by apache.

the class ExternalParsersConfigReader method readMimeTypes.

private static Set<MediaType> readMimeTypes(Element mimeTypes) {
    Set<MediaType> types = new HashSet<MediaType>();
    NodeList children = mimeTypes.getChildNodes();
    for (int i = 0; i < children.getLength(); i++) {
        Node node = children.item(i);
        if (node.getNodeType() == Node.ELEMENT_NODE) {
            Element child = (Element) node;
            if (child.getTagName().equals(MIMETYPE_TAG)) {
                types.add(MediaType.parse(getString(child)));
            }
        }
    }
    return types;
}
Also used : NodeList(org.w3c.dom.NodeList) Node(org.w3c.dom.Node) Element(org.w3c.dom.Element) MediaType(org.apache.tika.mime.MediaType) HashSet(java.util.HashSet)

Example 9 with MediaType

use of org.apache.tika.mime.MediaType in project tika by apache.

the class ExternalParsersFactory method attachExternalParsers.

public static void attachExternalParsers(List<ExternalParser> parsers, TikaConfig config) {
    Parser parser = config.getParser();
    if (parser instanceof CompositeParser) {
        CompositeParser cParser = (CompositeParser) parser;
        Map<MediaType, Parser> parserMap = cParser.getParsers();
    }
// TODO
}
Also used : CompositeParser(org.apache.tika.parser.CompositeParser) MediaType(org.apache.tika.mime.MediaType) Parser(org.apache.tika.parser.Parser) CompositeParser(org.apache.tika.parser.CompositeParser)

Example 10 with MediaType

use of org.apache.tika.mime.MediaType in project tika by apache.

the class ZipContainerDetector method detectZipFormat.

private static MediaType detectZipFormat(TikaInputStream tis) {
    try {
        //try opc first because opening a package
        //will not necessarily throw an exception for
        //truncated files.
        MediaType type = detectOPCBased(tis);
        if (type != null) {
            return type;
        }
        // TODO: hasFile()?
        ZipFile zip = new ZipFile(tis.getFile());
        try {
            type = detectOpenDocument(zip);
            if (type == null) {
                type = detectIWork13(zip);
            }
            if (type == null) {
                type = detectIWork(zip);
            }
            if (type == null) {
                type = detectJar(zip);
            }
            if (type == null) {
                type = detectKmz(zip);
            }
            if (type == null) {
                type = detectIpa(zip);
            }
            if (type != null) {
                return type;
            }
        } finally {
            // tis.setOpenContainer(zip);
            try {
                zip.close();
            } catch (IOException e) {
            // ignore
            }
        }
    } catch (IOException e) {
    // ignore
    }
    // Fallback: it's still a zip file, we just don't know what kind of one
    return MediaType.APPLICATION_ZIP;
}
Also used : ZipFile(org.apache.commons.compress.archivers.zip.ZipFile) MediaType(org.apache.tika.mime.MediaType) IOException(java.io.IOException)

Aggregations

MediaType (org.apache.tika.mime.MediaType)95 Metadata (org.apache.tika.metadata.Metadata)29 Test (org.junit.Test)28 InputStream (java.io.InputStream)26 IOException (java.io.IOException)18 Parser (org.apache.tika.parser.Parser)18 TikaInputStream (org.apache.tika.io.TikaInputStream)17 ParseContext (org.apache.tika.parser.ParseContext)17 TikaException (org.apache.tika.exception.TikaException)14 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)14 CompositeParser (org.apache.tika.parser.CompositeParser)13 ContentHandler (org.xml.sax.ContentHandler)13 BodyContentHandler (org.apache.tika.sax.BodyContentHandler)12 Detector (org.apache.tika.detect.Detector)11 TikaTest (org.apache.tika.TikaTest)10 HashSet (java.util.HashSet)8 ByteArrayInputStream (java.io.ByteArrayInputStream)7 ArrayList (java.util.ArrayList)7 TikaConfig (org.apache.tika.config.TikaConfig)7 MediaTypeRegistry (org.apache.tika.mime.MediaTypeRegistry)7