Search in sources :

Example 6 with EncodingDetector

use of org.apache.tika.detect.EncodingDetector in project tika by apache.

the class DBFParser method getCharset.

private Charset getCharset(List<DBFRow> firstRows, DBFFileHeader header) throws IOException, TikaException {
    //TODO: potentially use codepage info in the header
    Charset charset = DEFAULT_CHARSET;
    ByteArrayOutputStream bos = new ByteArrayOutputStream();
    for (DBFRow row : firstRows) {
        for (DBFCell cell : row.cells) {
            if (cell.getColType().equals(DBFColumnHeader.ColType.C)) {
                byte[] bytes = cell.getBytes();
                bos.write(bytes);
                if (bos.size() > MAX_CHARS_FOR_CHARSET_DETECTION) {
                    break;
                }
            }
        }
    }
    byte[] bytes = bos.toByteArray();
    if (bytes.length > 20) {
        EncodingDetector detector = new Icu4jEncodingDetector();
        detector.detect(TikaInputStream.get(bytes), new Metadata());
        charset = detector.detect(new ByteArrayInputStream(bytes), new Metadata());
    }
    return charset;
}
Also used : Icu4jEncodingDetector(org.apache.tika.parser.txt.Icu4jEncodingDetector) EncodingDetector(org.apache.tika.detect.EncodingDetector) ByteArrayInputStream(java.io.ByteArrayInputStream) Metadata(org.apache.tika.metadata.Metadata) Charset(java.nio.charset.Charset) ByteArrayOutputStream(java.io.ByteArrayOutputStream) Icu4jEncodingDetector(org.apache.tika.parser.txt.Icu4jEncodingDetector)

Example 7 with EncodingDetector

use of org.apache.tika.detect.EncodingDetector in project tika by apache.

the class HtmlParserTest method testMultiThreadingEncodingDetection.

@Test
public void testMultiThreadingEncodingDetection() throws Exception {
    List<EncodingDetector> detectors = new ArrayList<>();
    ServiceLoader loader = new ServiceLoader(AutoDetectReader.class.getClassLoader());
    detectors.addAll(loader.loadServiceProviders(EncodingDetector.class));
    for (EncodingDetector detector : detectors) {
        testDetector(detector);
    }
}
Also used : ServiceLoader(org.apache.tika.config.ServiceLoader) EncodingDetector(org.apache.tika.detect.EncodingDetector) AutoDetectReader(org.apache.tika.detect.AutoDetectReader) ArrayList(java.util.ArrayList) Test(org.junit.Test) TikaTest(org.apache.tika.TikaTest)

Example 8 with EncodingDetector

use of org.apache.tika.detect.EncodingDetector in project tika by apache.

the class TikaConfigSerializer method addEncodingDetectors.

private static void addEncodingDetectors(Mode mode, Element rootElement, Document doc, TikaConfig config) throws Exception {
    EncodingDetector encDetector = config.getEncodingDetector();
    if (mode == Mode.MINIMAL && encDetector instanceof DefaultEncodingDetector) {
        // Don't output anything, all using defaults
        Node detComment = doc.createComment("for example: <encodingDetectors><encodingDetector class=\"" + "org.apache.tika.detect.DefaultEncodingDetector\"></encodingDetectors>");
        rootElement.appendChild(detComment);
        return;
    }
    Element encDetectorsElement = doc.createElement("encodingDetectors");
    if (mode == Mode.CURRENT && encDetector instanceof DefaultEncodingDetector || !(encDetector instanceof CompositeEncodingDetector)) {
        Element encDetectorElement = doc.createElement("encodingDetector");
        encDetectorElement.setAttribute("class", encDetector.getClass().getCanonicalName());
        encDetectorsElement.appendChild(encDetectorElement);
    } else {
        List<EncodingDetector> children = ((CompositeEncodingDetector) encDetector).getDetectors();
        for (EncodingDetector d : children) {
            Element encDetectorElement = doc.createElement("encodingDetector");
            encDetectorElement.setAttribute("class", d.getClass().getCanonicalName());
            encDetectorsElement.appendChild(encDetectorElement);
        }
    }
    rootElement.appendChild(encDetectorsElement);
}
Also used : DefaultEncodingDetector(org.apache.tika.detect.DefaultEncodingDetector) CompositeEncodingDetector(org.apache.tika.detect.CompositeEncodingDetector) EncodingDetector(org.apache.tika.detect.EncodingDetector) CompositeEncodingDetector(org.apache.tika.detect.CompositeEncodingDetector) Node(org.w3c.dom.Node) Element(org.w3c.dom.Element) DefaultEncodingDetector(org.apache.tika.detect.DefaultEncodingDetector)

Example 9 with EncodingDetector

use of org.apache.tika.detect.EncodingDetector in project tika by apache.

the class TikaEncodingDetectorTest method testEncodingDetectorsAreLoaded.

@Test
public void testEncodingDetectorsAreLoaded() {
    EncodingDetector encodingDetector = ((AbstractEncodingDetectorParser) new TXTParser()).getEncodingDetector();
    assertTrue(encodingDetector instanceof CompositeEncodingDetector);
}
Also used : Icu4jEncodingDetector(org.apache.tika.parser.txt.Icu4jEncodingDetector) NonDetectingEncodingDetector(org.apache.tika.detect.NonDetectingEncodingDetector) UniversalEncodingDetector(org.apache.tika.parser.txt.UniversalEncodingDetector) CompositeEncodingDetector(org.apache.tika.detect.CompositeEncodingDetector) EncodingDetector(org.apache.tika.detect.EncodingDetector) HtmlEncodingDetector(org.apache.tika.parser.html.HtmlEncodingDetector) CompositeEncodingDetector(org.apache.tika.detect.CompositeEncodingDetector) TXTParser(org.apache.tika.parser.txt.TXTParser) AbstractEncodingDetectorParser(org.apache.tika.parser.AbstractEncodingDetectorParser) Test(org.junit.Test)

Aggregations

EncodingDetector (org.apache.tika.detect.EncodingDetector)9 CompositeEncodingDetector (org.apache.tika.detect.CompositeEncodingDetector)7 Icu4jEncodingDetector (org.apache.tika.parser.txt.Icu4jEncodingDetector)7 Test (org.junit.Test)7 NonDetectingEncodingDetector (org.apache.tika.detect.NonDetectingEncodingDetector)6 HtmlEncodingDetector (org.apache.tika.parser.html.HtmlEncodingDetector)6 UniversalEncodingDetector (org.apache.tika.parser.txt.UniversalEncodingDetector)6 ArrayList (java.util.ArrayList)3 AbstractEncodingDetectorParser (org.apache.tika.parser.AbstractEncodingDetectorParser)3 TXTParser (org.apache.tika.parser.txt.TXTParser)3 Metadata (org.apache.tika.metadata.Metadata)2 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)2 CompositeParser (org.apache.tika.parser.CompositeParser)2 Parser (org.apache.tika.parser.Parser)2 ByteArrayInputStream (java.io.ByteArrayInputStream)1 ByteArrayOutputStream (java.io.ByteArrayOutputStream)1 Charset (java.nio.charset.Charset)1 TikaTest (org.apache.tika.TikaTest)1 ServiceLoader (org.apache.tika.config.ServiceLoader)1 AutoDetectReader (org.apache.tika.detect.AutoDetectReader)1