Search in sources :

Example 1 with CharArraySet

use of org.apache.lucene.analysis.CharArraySet in project elasticsearch by elastic.

the class AnalysisTests method testParseStemExclusion.

public void testParseStemExclusion() {
    /* Comma separated list */
    Settings settings = Settings.builder().put("stem_exclusion", "foo,bar").build();
    CharArraySet set = Analysis.parseStemExclusion(settings, CharArraySet.EMPTY_SET);
    assertThat(set.contains("foo"), is(true));
    assertThat(set.contains("bar"), is(true));
    assertThat(set.contains("baz"), is(false));
    /* Array */
    settings = Settings.builder().putArray("stem_exclusion", "foo", "bar").build();
    set = Analysis.parseStemExclusion(settings, CharArraySet.EMPTY_SET);
    assertThat(set.contains("foo"), is(true));
    assertThat(set.contains("bar"), is(true));
    assertThat(set.contains("baz"), is(false));
}
Also used : CharArraySet(org.apache.lucene.analysis.CharArraySet) Settings(org.elasticsearch.common.settings.Settings)

Example 2 with CharArraySet

use of org.apache.lucene.analysis.CharArraySet in project lucene-solr by apache.

the class TestFreeTextSuggester method testEndingHole.

// With one ending hole, ShingleFilter produces "of _" and
// we should properly predict from that:
public void testEndingHole() throws Exception {
    // Just deletes "of"
    Analyzer a = new Analyzer() {

        @Override
        public TokenStreamComponents createComponents(String field) {
            Tokenizer tokenizer = new MockTokenizer();
            CharArraySet stopSet = StopFilter.makeStopSet("of");
            return new TokenStreamComponents(tokenizer, new StopFilter(tokenizer, stopSet));
        }
    };
    Iterable<Input> keys = AnalyzingSuggesterTest.shuffle(new Input("wizard of oz", 50));
    FreeTextSuggester sug = new FreeTextSuggester(a, a, 3, (byte) 0x20);
    sug.build(new InputArrayIterator(keys));
    assertEquals("wizard _ oz/1.00", toString(sug.lookup("wizard of", 10)));
    // Falls back to unigram model, with backoff 0.4 times
    // prop 0.5:
    assertEquals("oz/0.20", toString(sug.lookup("wizard o", 10)));
    a.close();
}
Also used : MockTokenizer(org.apache.lucene.analysis.MockTokenizer) CharArraySet(org.apache.lucene.analysis.CharArraySet) Input(org.apache.lucene.search.suggest.Input) InputArrayIterator(org.apache.lucene.search.suggest.InputArrayIterator) StopFilter(org.apache.lucene.analysis.StopFilter) Analyzer(org.apache.lucene.analysis.Analyzer) MockAnalyzer(org.apache.lucene.analysis.MockAnalyzer) Tokenizer(org.apache.lucene.analysis.Tokenizer) MockTokenizer(org.apache.lucene.analysis.MockTokenizer)

Example 3 with CharArraySet

use of org.apache.lucene.analysis.CharArraySet in project lucene-solr by apache.

the class TestFreeTextSuggester method testTwoEndingHoles.

// If the number of ending holes exceeds the ngrams window
// then there are no predictions, because ShingleFilter
// does not produce e.g. a hole only "_ _" token:
public void testTwoEndingHoles() throws Exception {
    // Just deletes "of"
    Analyzer a = new Analyzer() {

        @Override
        public TokenStreamComponents createComponents(String field) {
            Tokenizer tokenizer = new MockTokenizer();
            CharArraySet stopSet = StopFilter.makeStopSet("of");
            return new TokenStreamComponents(tokenizer, new StopFilter(tokenizer, stopSet));
        }
    };
    Iterable<Input> keys = AnalyzingSuggesterTest.shuffle(new Input("wizard of of oz", 50));
    FreeTextSuggester sug = new FreeTextSuggester(a, a, 3, (byte) 0x20);
    sug.build(new InputArrayIterator(keys));
    assertEquals("", toString(sug.lookup("wizard of of", 10)));
    a.close();
}
Also used : MockTokenizer(org.apache.lucene.analysis.MockTokenizer) CharArraySet(org.apache.lucene.analysis.CharArraySet) Input(org.apache.lucene.search.suggest.Input) InputArrayIterator(org.apache.lucene.search.suggest.InputArrayIterator) StopFilter(org.apache.lucene.analysis.StopFilter) Analyzer(org.apache.lucene.analysis.Analyzer) MockAnalyzer(org.apache.lucene.analysis.MockAnalyzer) Tokenizer(org.apache.lucene.analysis.Tokenizer) MockTokenizer(org.apache.lucene.analysis.MockTokenizer)

Example 4 with CharArraySet

use of org.apache.lucene.analysis.CharArraySet in project lucene-solr by apache.

the class TestSuggestStopFilter method testMultipleStopWordsEnd2.

public void testMultipleStopWordsEnd2() throws Exception {
    CharArraySet stopWords = StopFilter.makeStopSet("to", "the", "a");
    Tokenizer stream = new MockTokenizer();
    stream.setReader(new StringReader("go to a the "));
    TokenStream filter = new SuggestStopFilter(stream, stopWords);
    filter = new SuggestStopFilter(stream, stopWords);
    assertTokenStreamContents(filter, new String[] { "go" }, new int[] { 0 }, new int[] { 2 }, null, new int[] { 1 }, null, 12, new boolean[] { false }, true);
}
Also used : MockTokenizer(org.apache.lucene.analysis.MockTokenizer) CharArraySet(org.apache.lucene.analysis.CharArraySet) TokenStream(org.apache.lucene.analysis.TokenStream) StringReader(java.io.StringReader) Tokenizer(org.apache.lucene.analysis.Tokenizer) MockTokenizer(org.apache.lucene.analysis.MockTokenizer)

Example 5 with CharArraySet

use of org.apache.lucene.analysis.CharArraySet in project lucene-solr by apache.

the class TestSuggestStopFilter method testMultipleStopWords.

public void testMultipleStopWords() throws Exception {
    CharArraySet stopWords = StopFilter.makeStopSet("to", "the", "a");
    Tokenizer stream = new MockTokenizer();
    stream.setReader(new StringReader("go to a the school"));
    TokenStream filter = new SuggestStopFilter(stream, stopWords);
    filter = new SuggestStopFilter(stream, stopWords);
    assertTokenStreamContents(filter, new String[] { "go", "school" }, new int[] { 0, 12 }, new int[] { 2, 18 }, null, new int[] { 1, 4 }, null, 18, new boolean[] { false, false }, true);
}
Also used : MockTokenizer(org.apache.lucene.analysis.MockTokenizer) CharArraySet(org.apache.lucene.analysis.CharArraySet) TokenStream(org.apache.lucene.analysis.TokenStream) StringReader(java.io.StringReader) Tokenizer(org.apache.lucene.analysis.Tokenizer) MockTokenizer(org.apache.lucene.analysis.MockTokenizer)

Aggregations

CharArraySet (org.apache.lucene.analysis.CharArraySet)153 Analyzer (org.apache.lucene.analysis.Analyzer)57 MockTokenizer (org.apache.lucene.analysis.MockTokenizer)46 Tokenizer (org.apache.lucene.analysis.Tokenizer)44 TokenStream (org.apache.lucene.analysis.TokenStream)38 KeywordTokenizer (org.apache.lucene.analysis.core.KeywordTokenizer)34 SetKeywordMarkerFilter (org.apache.lucene.analysis.miscellaneous.SetKeywordMarkerFilter)26 StringReader (java.io.StringReader)24 StandardAnalyzer (org.apache.lucene.analysis.standard.StandardAnalyzer)12 Test (org.junit.Test)10 StopFilter (org.apache.lucene.analysis.StopFilter)8 TokenFilter (org.apache.lucene.analysis.TokenFilter)6 WordDelimiterFilter (org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter)5 WordDelimiterGraphFilter (org.apache.lucene.analysis.miscellaneous.WordDelimiterGraphFilter)5 Reader (java.io.Reader)4 ArrayList (java.util.ArrayList)4 MockAnalyzer (org.apache.lucene.analysis.MockAnalyzer)4 HyphenationTree (org.apache.lucene.analysis.compound.hyphenation.HyphenationTree)4 ClasspathResourceLoader (org.apache.lucene.analysis.util.ClasspathResourceLoader)4 ResourceLoader (org.apache.lucene.analysis.util.ResourceLoader)4