Search in sources :

Example 1 with AnalyzerManager

use of org.apache.tika.eval.tokens.AnalyzerManager in project tika by apache.

the class AnalyzerManagerTest method testTokenCountFilter.

@Test
public void testTokenCountFilter() throws Exception {
    AnalyzerManager analyzerManager = AnalyzerManager.newInstance(1000000);
    StringBuilder sb = new StringBuilder();
    for (int i = 0; i < 1001000; i++) {
        sb.append("the ");
    }
    TokenStream ts = analyzerManager.getGeneralAnalyzer().tokenStream("f", sb.toString());
    ts.reset();
    CharTermAttribute termAtt = ts.getAttribute(CharTermAttribute.class);
    int tokens = 0;
    while (ts.incrementToken()) {
        tokens++;
    }
    assertEquals(1000000, tokens);
}
Also used : TokenStream(org.apache.lucene.analysis.TokenStream) CharTermAttribute(org.apache.lucene.analysis.tokenattributes.CharTermAttribute) AnalyzerManager(org.apache.tika.eval.tokens.AnalyzerManager) Test(org.junit.Test)

Example 2 with AnalyzerManager

use of org.apache.tika.eval.tokens.AnalyzerManager in project tika by apache.

the class AnalyzerManagerTest method testCommon.

@Test
public void testCommon() throws Exception {
    AnalyzerManager analyzerManager = AnalyzerManager.newInstance(100000);
    Analyzer common = analyzerManager.getCommonTokensAnalyzer();
    TokenStream ts = common.tokenStream("f", "the 5,000.12 and dirty dog");
    ts.reset();
    CharTermAttribute termAtt = ts.getAttribute(CharTermAttribute.class);
    Set<String> seen = new HashSet<>();
    while (ts.incrementToken()) {
        String t = termAtt.toString();
        if (AlphaIdeographFilterFactory.isAlphabetic(t.toCharArray()) && t.contains("5")) {
            fail("Shouldn't have found a numeric");
        }
        seen.add(termAtt.toString());
    }
    ts.end();
    ts.close();
    assertTrue(seen.contains("dirty"));
    assertFalse(seen.contains("the"));
}
Also used : TokenStream(org.apache.lucene.analysis.TokenStream) CharTermAttribute(org.apache.lucene.analysis.tokenattributes.CharTermAttribute) Analyzer(org.apache.lucene.analysis.Analyzer) AnalyzerManager(org.apache.tika.eval.tokens.AnalyzerManager) HashSet(java.util.HashSet) Test(org.junit.Test)

Example 3 with AnalyzerManager

use of org.apache.tika.eval.tokens.AnalyzerManager in project tika by apache.

the class AnalyzerManagerTest method testGeneral.

@Test
public void testGeneral() throws Exception {
    AnalyzerManager analyzerManager = AnalyzerManager.newInstance(100000);
    Analyzer general = analyzerManager.getGeneralAnalyzer();
    TokenStream ts = general.tokenStream("f", "tHe quick aaaa aaa anD dirty dog");
    ts.reset();
    CharTermAttribute termAtt = ts.getAttribute(CharTermAttribute.class);
    Set<String> seen = new HashSet<>();
    while (ts.incrementToken()) {
        seen.add(termAtt.toString());
    }
    ts.end();
    ts.close();
    assertTrue(seen.contains("the"));
    assertTrue(seen.contains("and"));
    assertTrue(seen.contains("dog"));
}
Also used : TokenStream(org.apache.lucene.analysis.TokenStream) CharTermAttribute(org.apache.lucene.analysis.tokenattributes.CharTermAttribute) Analyzer(org.apache.lucene.analysis.Analyzer) AnalyzerManager(org.apache.tika.eval.tokens.AnalyzerManager) HashSet(java.util.HashSet) Test(org.junit.Test)

Aggregations

TokenStream (org.apache.lucene.analysis.TokenStream)3 CharTermAttribute (org.apache.lucene.analysis.tokenattributes.CharTermAttribute)3 AnalyzerManager (org.apache.tika.eval.tokens.AnalyzerManager)3 Test (org.junit.Test)3 HashSet (java.util.HashSet)2 Analyzer (org.apache.lucene.analysis.Analyzer)2