Search in sources :

Example 1 with CompositeEncodingDetector

use of org.apache.tika.detect.CompositeEncodingDetector in project tika by apache.

the class TikaEncodingDetectorTest method testParameterization.

@Test
public void testParameterization() throws Exception {
    TikaConfig config = getConfig("TIKA-2273-parameterize-encoding-detector.xml");
    EncodingDetector detector = config.getEncodingDetector();
    assertTrue(detector instanceof CompositeEncodingDetector);
    List<EncodingDetector> detectors = ((CompositeEncodingDetector) detector).getDetectors();
    assertEquals(2, detectors.size());
    assertTrue(((Icu4jEncodingDetector) detectors.get(0)).getStripMarkup());
    assertTrue(detectors.get(1) instanceof NonDetectingEncodingDetector);
}
Also used : Icu4jEncodingDetector(org.apache.tika.parser.txt.Icu4jEncodingDetector) NonDetectingEncodingDetector(org.apache.tika.detect.NonDetectingEncodingDetector) UniversalEncodingDetector(org.apache.tika.parser.txt.UniversalEncodingDetector) CompositeEncodingDetector(org.apache.tika.detect.CompositeEncodingDetector) EncodingDetector(org.apache.tika.detect.EncodingDetector) HtmlEncodingDetector(org.apache.tika.parser.html.HtmlEncodingDetector) CompositeEncodingDetector(org.apache.tika.detect.CompositeEncodingDetector) NonDetectingEncodingDetector(org.apache.tika.detect.NonDetectingEncodingDetector) Test(org.junit.Test)

Example 2 with CompositeEncodingDetector

use of org.apache.tika.detect.CompositeEncodingDetector in project tika by apache.

the class TikaEncodingDetectorTest method testConfigurabilityOfUserSpecified.

@Test
public void testConfigurabilityOfUserSpecified() throws Exception {
    TikaConfig tikaConfig = new TikaConfig(getResourceAsStream("/org/apache/tika/config/TIKA-2273-encoding-detector-outside-static-init.xml"));
    AutoDetectParser p = new AutoDetectParser(tikaConfig);
    //make sure that all static and non-static parsers are using the same encoding detector!
    List<Parser> parsers = new ArrayList<>();
    findEncodingDetectionParsers(p, parsers);
    assertEquals(3, parsers.size());
    for (Parser encodingDetectingParser : parsers) {
        EncodingDetector encodingDetector = ((AbstractEncodingDetectorParser) encodingDetectingParser).getEncodingDetector();
        assertTrue(encodingDetector instanceof CompositeEncodingDetector);
        assertEquals(2, ((CompositeEncodingDetector) encodingDetector).getDetectors().size());
        for (EncodingDetector child : ((CompositeEncodingDetector) encodingDetector).getDetectors()) {
            assertNotContained("cu4j", child.getClass().getCanonicalName());
        }
    }
    //also just make sure this is still true
    try {
        Metadata metadata = getXML("english.cp500.txt", p).metadata;
        fail("can't detect w/out ICU");
    } catch (TikaException e) {
        assertContains("Failed to detect", e.getMessage());
    }
}
Also used : Icu4jEncodingDetector(org.apache.tika.parser.txt.Icu4jEncodingDetector) NonDetectingEncodingDetector(org.apache.tika.detect.NonDetectingEncodingDetector) UniversalEncodingDetector(org.apache.tika.parser.txt.UniversalEncodingDetector) CompositeEncodingDetector(org.apache.tika.detect.CompositeEncodingDetector) EncodingDetector(org.apache.tika.detect.EncodingDetector) HtmlEncodingDetector(org.apache.tika.parser.html.HtmlEncodingDetector) CompositeEncodingDetector(org.apache.tika.detect.CompositeEncodingDetector) TikaException(org.apache.tika.exception.TikaException) ArrayList(java.util.ArrayList) Metadata(org.apache.tika.metadata.Metadata) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) AbstractEncodingDetectorParser(org.apache.tika.parser.AbstractEncodingDetectorParser) Parser(org.apache.tika.parser.Parser) CompositeParser(org.apache.tika.parser.CompositeParser) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) TXTParser(org.apache.tika.parser.txt.TXTParser) AbstractEncodingDetectorParser(org.apache.tika.parser.AbstractEncodingDetectorParser) Test(org.junit.Test)

Example 3 with CompositeEncodingDetector

use of org.apache.tika.detect.CompositeEncodingDetector in project tika by apache.

the class TikaEncodingDetectorTest method testDefault.

@Test
public void testDefault() {
    EncodingDetector detector = TikaConfig.getDefaultConfig().getEncodingDetector();
    assertTrue(detector instanceof CompositeEncodingDetector);
    List<EncodingDetector> detectors = ((CompositeEncodingDetector) detector).getDetectors();
    assertEquals(3, detectors.size());
    assertTrue(detectors.get(0) instanceof HtmlEncodingDetector);
    assertTrue(detectors.get(1) instanceof UniversalEncodingDetector);
    assertTrue(detectors.get(2) instanceof Icu4jEncodingDetector);
}
Also used : Icu4jEncodingDetector(org.apache.tika.parser.txt.Icu4jEncodingDetector) NonDetectingEncodingDetector(org.apache.tika.detect.NonDetectingEncodingDetector) UniversalEncodingDetector(org.apache.tika.parser.txt.UniversalEncodingDetector) CompositeEncodingDetector(org.apache.tika.detect.CompositeEncodingDetector) EncodingDetector(org.apache.tika.detect.EncodingDetector) HtmlEncodingDetector(org.apache.tika.parser.html.HtmlEncodingDetector) CompositeEncodingDetector(org.apache.tika.detect.CompositeEncodingDetector) UniversalEncodingDetector(org.apache.tika.parser.txt.UniversalEncodingDetector) Icu4jEncodingDetector(org.apache.tika.parser.txt.Icu4jEncodingDetector) HtmlEncodingDetector(org.apache.tika.parser.html.HtmlEncodingDetector) Test(org.junit.Test)

Example 4 with CompositeEncodingDetector

use of org.apache.tika.detect.CompositeEncodingDetector in project tika by apache.

the class TikaEncodingDetectorTest method testNonDetectingDetectorParams.

@Test
public void testNonDetectingDetectorParams() throws Exception {
    TikaConfig tikaConfig = new TikaConfig(getResourceAsStream("/org/apache/tika/config/TIKA-2273-non-detecting-params.xml"));
    AutoDetectParser p = new AutoDetectParser(tikaConfig);
    List<Parser> parsers = new ArrayList<>();
    findEncodingDetectionParsers(p, parsers);
    assertEquals(3, parsers.size());
    EncodingDetector encodingDetector = ((AbstractEncodingDetectorParser) parsers.get(0)).getEncodingDetector();
    assertTrue(encodingDetector instanceof CompositeEncodingDetector);
    assertEquals(1, ((CompositeEncodingDetector) encodingDetector).getDetectors().size());
    EncodingDetector child = ((CompositeEncodingDetector) encodingDetector).getDetectors().get(0);
    assertTrue(child instanceof NonDetectingEncodingDetector);
    assertEquals(StandardCharsets.UTF_16LE, ((NonDetectingEncodingDetector) child).getCharset());
}
Also used : Icu4jEncodingDetector(org.apache.tika.parser.txt.Icu4jEncodingDetector) NonDetectingEncodingDetector(org.apache.tika.detect.NonDetectingEncodingDetector) UniversalEncodingDetector(org.apache.tika.parser.txt.UniversalEncodingDetector) CompositeEncodingDetector(org.apache.tika.detect.CompositeEncodingDetector) EncodingDetector(org.apache.tika.detect.EncodingDetector) HtmlEncodingDetector(org.apache.tika.parser.html.HtmlEncodingDetector) CompositeEncodingDetector(org.apache.tika.detect.CompositeEncodingDetector) ArrayList(java.util.ArrayList) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) AbstractEncodingDetectorParser(org.apache.tika.parser.AbstractEncodingDetectorParser) NonDetectingEncodingDetector(org.apache.tika.detect.NonDetectingEncodingDetector) Parser(org.apache.tika.parser.Parser) CompositeParser(org.apache.tika.parser.CompositeParser) AutoDetectParser(org.apache.tika.parser.AutoDetectParser) TXTParser(org.apache.tika.parser.txt.TXTParser) AbstractEncodingDetectorParser(org.apache.tika.parser.AbstractEncodingDetectorParser) Test(org.junit.Test)

Example 5 with CompositeEncodingDetector

use of org.apache.tika.detect.CompositeEncodingDetector in project tika by apache.

the class TikaEncodingDetectorTest method testBlackList.

@Test
public void testBlackList() throws Exception {
    TikaConfig config = getConfig("TIKA-2273-blacklist-encoding-detector-default.xml");
    EncodingDetector detector = config.getEncodingDetector();
    assertTrue(detector instanceof CompositeEncodingDetector);
    List<EncodingDetector> detectors = ((CompositeEncodingDetector) detector).getDetectors();
    assertEquals(2, detectors.size());
    EncodingDetector detector1 = detectors.get(0);
    assertTrue(detector1 instanceof CompositeEncodingDetector);
    List<EncodingDetector> detectors1Children = ((CompositeEncodingDetector) detector1).getDetectors();
    assertEquals(2, detectors1Children.size());
    assertTrue(detectors1Children.get(0) instanceof UniversalEncodingDetector);
    assertTrue(detectors1Children.get(1) instanceof Icu4jEncodingDetector);
    assertTrue(detectors.get(1) instanceof NonDetectingEncodingDetector);
}
Also used : Icu4jEncodingDetector(org.apache.tika.parser.txt.Icu4jEncodingDetector) NonDetectingEncodingDetector(org.apache.tika.detect.NonDetectingEncodingDetector) UniversalEncodingDetector(org.apache.tika.parser.txt.UniversalEncodingDetector) CompositeEncodingDetector(org.apache.tika.detect.CompositeEncodingDetector) EncodingDetector(org.apache.tika.detect.EncodingDetector) HtmlEncodingDetector(org.apache.tika.parser.html.HtmlEncodingDetector) CompositeEncodingDetector(org.apache.tika.detect.CompositeEncodingDetector) UniversalEncodingDetector(org.apache.tika.parser.txt.UniversalEncodingDetector) Icu4jEncodingDetector(org.apache.tika.parser.txt.Icu4jEncodingDetector) NonDetectingEncodingDetector(org.apache.tika.detect.NonDetectingEncodingDetector) Test(org.junit.Test)

Aggregations

CompositeEncodingDetector (org.apache.tika.detect.CompositeEncodingDetector)7 EncodingDetector (org.apache.tika.detect.EncodingDetector)7 NonDetectingEncodingDetector (org.apache.tika.detect.NonDetectingEncodingDetector)6 HtmlEncodingDetector (org.apache.tika.parser.html.HtmlEncodingDetector)6 Icu4jEncodingDetector (org.apache.tika.parser.txt.Icu4jEncodingDetector)6 UniversalEncodingDetector (org.apache.tika.parser.txt.UniversalEncodingDetector)6 Test (org.junit.Test)6 AbstractEncodingDetectorParser (org.apache.tika.parser.AbstractEncodingDetectorParser)3 TXTParser (org.apache.tika.parser.txt.TXTParser)3 ArrayList (java.util.ArrayList)2 AutoDetectParser (org.apache.tika.parser.AutoDetectParser)2 CompositeParser (org.apache.tika.parser.CompositeParser)2 Parser (org.apache.tika.parser.Parser)2 DefaultEncodingDetector (org.apache.tika.detect.DefaultEncodingDetector)1 TikaException (org.apache.tika.exception.TikaException)1 Metadata (org.apache.tika.metadata.Metadata)1 Element (org.w3c.dom.Element)1 Node (org.w3c.dom.Node)1