Search in sources :

Example 16 with WhitespaceTokenizer

use of org.apache.lucene.analysis.core.WhitespaceTokenizer in project lucene-solr by apache.

the class EdgeNGramTokenFilterTest method testReset.

public void testReset() throws Exception {
    WhitespaceTokenizer tokenizer = new WhitespaceTokenizer();
    tokenizer.setReader(new StringReader("abcde"));
    EdgeNGramTokenFilter filter = new EdgeNGramTokenFilter(tokenizer, 1, 3);
    assertTokenStreamContents(filter, new String[] { "a", "ab", "abc" }, new int[] { 0, 0, 0 }, new int[] { 5, 5, 5 });
    tokenizer.setReader(new StringReader("abcde"));
    assertTokenStreamContents(filter, new String[] { "a", "ab", "abc" }, new int[] { 0, 0, 0 }, new int[] { 5, 5, 5 });
}
Also used : WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) StringReader(java.io.StringReader)

Example 17 with WhitespaceTokenizer

use of org.apache.lucene.analysis.core.WhitespaceTokenizer in project lucene-solr by apache.

the class NGramTokenFilterTest method testReset.

public void testReset() throws Exception {
    WhitespaceTokenizer tokenizer = new WhitespaceTokenizer();
    tokenizer.setReader(new StringReader("abcde"));
    NGramTokenFilter filter = new NGramTokenFilter(tokenizer, 1, 1);
    assertTokenStreamContents(filter, new String[] { "a", "b", "c", "d", "e" }, new int[] { 0, 0, 0, 0, 0 }, new int[] { 5, 5, 5, 5, 5 }, new int[] { 1, 0, 0, 0, 0 });
    tokenizer.setReader(new StringReader("abcde"));
    assertTokenStreamContents(filter, new String[] { "a", "b", "c", "d", "e" }, new int[] { 0, 0, 0, 0, 0 }, new int[] { 5, 5, 5, 5, 5 }, new int[] { 1, 0, 0, 0, 0 });
}
Also used : WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) StringReader(java.io.StringReader)

Example 18 with WhitespaceTokenizer

use of org.apache.lucene.analysis.core.WhitespaceTokenizer in project lucene-solr by apache.

the class CommonGramsFilterTest method testReset.

public void testReset() throws Exception {
    final String input = "How the s a brown s cow d like A B thing?";
    WhitespaceTokenizer wt = new WhitespaceTokenizer();
    wt.setReader(new StringReader(input));
    CommonGramsFilter cgf = new CommonGramsFilter(wt, commonWords);
    CharTermAttribute term = cgf.addAttribute(CharTermAttribute.class);
    cgf.reset();
    assertTrue(cgf.incrementToken());
    assertEquals("How", term.toString());
    assertTrue(cgf.incrementToken());
    assertEquals("How_the", term.toString());
    assertTrue(cgf.incrementToken());
    assertEquals("the", term.toString());
    assertTrue(cgf.incrementToken());
    assertEquals("the_s", term.toString());
    cgf.close();
    wt.setReader(new StringReader(input));
    cgf.reset();
    assertTrue(cgf.incrementToken());
    assertEquals("How", term.toString());
}
Also used : WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) CharTermAttribute(org.apache.lucene.analysis.tokenattributes.CharTermAttribute) StringReader(java.io.StringReader)

Example 19 with WhitespaceTokenizer

use of org.apache.lucene.analysis.core.WhitespaceTokenizer in project Anserini by castorini.

the class TweetAnalyzer method createComponents.

@Override
protected TokenStreamComponents createComponents(String fieldName) {
    Tokenizer source = new WhitespaceTokenizer();
    TokenStream filter = new TweetLowerCaseEntityPreservingFilter(source);
    if (stemming) {
        // Porter stemmer ignores words which are marked as keywords
        filter = new PorterStemFilter(filter);
    }
    return new TokenStreamComponents(source, filter);
}
Also used : WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) TokenStream(org.apache.lucene.analysis.TokenStream) PorterStemFilter(org.apache.lucene.analysis.en.PorterStemFilter) WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) Tokenizer(org.apache.lucene.analysis.Tokenizer)

Example 20 with WhitespaceTokenizer

use of org.apache.lucene.analysis.core.WhitespaceTokenizer in project Anserini by castorini.

the class TRECAnalyzer method createComponents.

@Override
protected TokenStreamComponents createComponents(String fieldName) {
    Tokenizer source = new WhitespaceTokenizer();
    TokenStream filter = new TweetLowerCaseEntityPreservingFilter(source);
    return new TokenStreamComponents(source, filter);
}
Also used : WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) TweetLowerCaseEntityPreservingFilter(io.anserini.analysis.TweetLowerCaseEntityPreservingFilter) TokenStream(org.apache.lucene.analysis.TokenStream) WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) Tokenizer(org.apache.lucene.analysis.Tokenizer)

Aggregations

WhitespaceTokenizer (org.apache.lucene.analysis.core.WhitespaceTokenizer)44 Tokenizer (org.apache.lucene.analysis.Tokenizer)38 StringReader (java.io.StringReader)37 ESTestCase (org.elasticsearch.test.ESTestCase)25 TokenStream (org.apache.lucene.analysis.TokenStream)16 Settings (org.elasticsearch.common.settings.Settings)8 Analyzer (org.apache.lucene.analysis.Analyzer)4 KeywordTokenizer (org.apache.lucene.analysis.core.KeywordTokenizer)4 CharTermAttribute (org.apache.lucene.analysis.tokenattributes.CharTermAttribute)4 IOException (java.io.IOException)3 HashMap (java.util.HashMap)3 LowerCaseFilter (org.apache.lucene.analysis.core.LowerCaseFilter)3 PorterStemFilter (org.apache.lucene.analysis.en.PorterStemFilter)3 ParseException (java.text.ParseException)2 LowerCaseFilter (org.apache.lucene.analysis.LowerCaseFilter)2 MockTokenizer (org.apache.lucene.analysis.MockTokenizer)2 StopFilter (org.apache.lucene.analysis.StopFilter)2 TokenizerFactory (org.apache.lucene.analysis.util.TokenizerFactory)2 SuggestStopFilter (org.apache.lucene.search.suggest.analyzing.SuggestStopFilter)2 Version (org.elasticsearch.Version)2