Search in sources :

Example 71 with KeywordTokenizer

use of org.apache.lucene.analysis.core.KeywordTokenizer in project lucene-solr by apache.

the class TestStemmerOverrideFilter method testIgnoreCase.

public void testIgnoreCase() throws IOException {
    // lets make booked stem to books
    // the override filter will convert "booked" to "books",
    // but also mark it with KeywordAttribute so Porter will not change it.
    StemmerOverrideFilter.Builder builder = new StemmerOverrideFilter.Builder(true);
    builder.add("boOkEd", "books");
    Tokenizer tokenizer = keywordTokenizer("BooKeD");
    TokenStream stream = new PorterStemFilter(new StemmerOverrideFilter(tokenizer, builder.build()));
    assertTokenStreamContents(stream, new String[] { "books" });
}
Also used : TokenStream(org.apache.lucene.analysis.TokenStream) PorterStemFilter(org.apache.lucene.analysis.en.PorterStemFilter) WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) Tokenizer(org.apache.lucene.analysis.Tokenizer) KeywordTokenizer(org.apache.lucene.analysis.core.KeywordTokenizer)

Example 72 with KeywordTokenizer

use of org.apache.lucene.analysis.core.KeywordTokenizer in project lucene-solr by apache.

the class TestStemmerOverrideFilter method testNoOverrides.

public void testNoOverrides() throws IOException {
    StemmerOverrideFilter.Builder builder = new StemmerOverrideFilter.Builder(true);
    Tokenizer tokenizer = keywordTokenizer("book");
    TokenStream stream = new PorterStemFilter(new StemmerOverrideFilter(tokenizer, builder.build()));
    assertTokenStreamContents(stream, new String[] { "book" });
}
Also used : TokenStream(org.apache.lucene.analysis.TokenStream) PorterStemFilter(org.apache.lucene.analysis.en.PorterStemFilter) WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) Tokenizer(org.apache.lucene.analysis.Tokenizer) KeywordTokenizer(org.apache.lucene.analysis.core.KeywordTokenizer)

Example 73 with KeywordTokenizer

use of org.apache.lucene.analysis.core.KeywordTokenizer in project lucene-solr by apache.

the class TestStemmerOverrideFilter method keywordTokenizer.

private KeywordTokenizer keywordTokenizer(String data) throws IOException {
    KeywordTokenizer tokenizer = new KeywordTokenizer();
    tokenizer.setReader(new StringReader(data));
    return tokenizer;
}
Also used : StringReader(java.io.StringReader) KeywordTokenizer(org.apache.lucene.analysis.core.KeywordTokenizer)

Example 74 with KeywordTokenizer

use of org.apache.lucene.analysis.core.KeywordTokenizer in project lucene-solr by apache.

the class TestTrimFilter method testEmptyTerm.

public void testEmptyTerm() throws IOException {
    Analyzer a = new Analyzer() {

        @Override
        protected TokenStreamComponents createComponents(String fieldName) {
            Tokenizer tokenizer = new KeywordTokenizer();
            return new TokenStreamComponents(tokenizer, new TrimFilter(tokenizer));
        }
    };
    checkOneTerm(a, "", "");
    a.close();
}
Also used : Analyzer(org.apache.lucene.analysis.Analyzer) KeywordTokenizer(org.apache.lucene.analysis.core.KeywordTokenizer) Tokenizer(org.apache.lucene.analysis.Tokenizer) MockTokenizer(org.apache.lucene.analysis.MockTokenizer) KeywordTokenizer(org.apache.lucene.analysis.core.KeywordTokenizer)

Example 75 with KeywordTokenizer

use of org.apache.lucene.analysis.core.KeywordTokenizer in project lucene-solr by apache.

the class TestWordDelimiterGraphFilter method testEmptyTerm.

public void testEmptyTerm() throws IOException {
    Random random = random();
    for (int i = 0; i < 512; i++) {
        final int flags = i;
        final CharArraySet protectedWords;
        if (random.nextBoolean()) {
            protectedWords = new CharArraySet(new HashSet<>(Arrays.asList("a", "b", "cd")), false);
        } else {
            protectedWords = null;
        }
        Analyzer a = new Analyzer() {

            @Override
            protected TokenStreamComponents createComponents(String fieldName) {
                Tokenizer tokenizer = new KeywordTokenizer();
                return new TokenStreamComponents(tokenizer, new WordDelimiterGraphFilter(tokenizer, flags, protectedWords));
            }
        };
        // depending upon options, this thing may or may not preserve the empty term
        checkAnalysisConsistency(random, a, random.nextBoolean(), "");
        a.close();
    }
}
Also used : CharArraySet(org.apache.lucene.analysis.CharArraySet) WordDelimiterGraphFilter(org.apache.lucene.analysis.miscellaneous.WordDelimiterGraphFilter) StandardAnalyzer(org.apache.lucene.analysis.standard.StandardAnalyzer) KeywordTokenizer(org.apache.lucene.analysis.core.KeywordTokenizer) KeywordTokenizer(org.apache.lucene.analysis.core.KeywordTokenizer)

Aggregations

KeywordTokenizer (org.apache.lucene.analysis.core.KeywordTokenizer)95 Tokenizer (org.apache.lucene.analysis.Tokenizer)86 Analyzer (org.apache.lucene.analysis.Analyzer)75 MockTokenizer (org.apache.lucene.analysis.MockTokenizer)64 TokenStream (org.apache.lucene.analysis.TokenStream)14 StringReader (java.io.StringReader)11 WhitespaceTokenizer (org.apache.lucene.analysis.core.WhitespaceTokenizer)11 LowerCaseFilter (org.apache.lucene.analysis.core.LowerCaseFilter)4 PorterStemFilter (org.apache.lucene.analysis.en.PorterStemFilter)4 Random (java.util.Random)3 CharArraySet (org.apache.lucene.analysis.CharArraySet)3 LetterTokenizer (org.apache.lucene.analysis.core.LetterTokenizer)3 StandardAnalyzer (org.apache.lucene.analysis.standard.StandardAnalyzer)3 StandardTokenizer (org.apache.lucene.analysis.standard.StandardTokenizer)3 CharTermAttribute (org.apache.lucene.analysis.tokenattributes.CharTermAttribute)3 Transliterator (com.ibm.icu.text.Transliterator)2 UnicodeSet (com.ibm.icu.text.UnicodeSet)2 MockAnalyzer (org.apache.lucene.analysis.MockAnalyzer)2 LowerCaseTokenizer (org.apache.lucene.analysis.core.LowerCaseTokenizer)2 RemoveDuplicatesTokenFilter (org.apache.lucene.analysis.miscellaneous.RemoveDuplicatesTokenFilter)2