Search in sources :

Example 1 with TextDetector

use of org.apache.tika.detect.TextDetector in project tika by apache.

the class MimeTypes method getMimeType.

/**
     * Returns the MIME type that best matches the given first few bytes
     * of a document stream. Returns application/octet-stream if no better
     * match is found. 
     * <p>
     * If multiple matches are found, the best (highest priority) matching
     * type is returned. If multiple matches are found with the same priority,
     * then all of these are returned.
     * <p>
     * The given byte array is expected to be at least {@link #getMinLength()}
     * long, or shorter only if the document stream itself is shorter.
     *
     * @param data first few bytes of a document stream
     * @return matching MIME type
     */
List<MimeType> getMimeType(byte[] data) {
    if (data == null) {
        throw new IllegalArgumentException("Data is missing");
    } else if (data.length == 0) {
        // See https://issues.apache.org/jira/browse/TIKA-483
        return rootMimeTypeL;
    }
    // Then, check for magic bytes
    List<MimeType> result = new ArrayList<MimeType>(1);
    int currentPriority = -1;
    for (Magic magic : magics) {
        if (currentPriority > 0 && currentPriority > magic.getPriority()) {
            break;
        }
        if (magic.eval(data)) {
            result.add(magic.getType());
            currentPriority = magic.getPriority();
        }
    }
    if (!result.isEmpty()) {
        for (int i = 0; i < result.size(); i++) {
            final MimeType matched = result.get(i);
            // extract the root element and match it against known types
            if ("application/xml".equals(matched.getName()) || "text/html".equals(matched.getName())) {
                XmlRootExtractor extractor = new XmlRootExtractor();
                QName rootElement = extractor.extractRootElement(data);
                if (rootElement != null) {
                    for (MimeType type : xmls) {
                        if (type.matchesXML(rootElement.getNamespaceURI(), rootElement.getLocalPart())) {
                            result.set(i, type);
                            break;
                        }
                    }
                } else if ("application/xml".equals(matched.getName())) {
                    // Downgrade from application/xml to text/plain since
                    // the document seems not to be well-formed.
                    result.set(i, textMimeType);
                }
            }
        }
        return result;
    }
    // Finally, assume plain text if no control bytes are found
    try {
        TextDetector detector = new TextDetector(getMinLength());
        ByteArrayInputStream stream = new ByteArrayInputStream(data);
        MimeType type = forName(detector.detect(stream, new Metadata()).toString());
        return Collections.singletonList(type);
    } catch (Exception e) {
        return rootMimeTypeL;
    }
}
Also used : ByteArrayInputStream(java.io.ByteArrayInputStream) QName(javax.xml.namespace.QName) ArrayList(java.util.ArrayList) Metadata(org.apache.tika.metadata.Metadata) XmlRootExtractor(org.apache.tika.detect.XmlRootExtractor) TextDetector(org.apache.tika.detect.TextDetector) URISyntaxException(java.net.URISyntaxException) IOException(java.io.IOException)

Aggregations

ByteArrayInputStream (java.io.ByteArrayInputStream)1 IOException (java.io.IOException)1 URISyntaxException (java.net.URISyntaxException)1 ArrayList (java.util.ArrayList)1 QName (javax.xml.namespace.QName)1 TextDetector (org.apache.tika.detect.TextDetector)1 XmlRootExtractor (org.apache.tika.detect.XmlRootExtractor)1 Metadata (org.apache.tika.metadata.Metadata)1