Search in sources :

Example 1 with ICUNormalizer2Filter

use of org.apache.lucene.analysis.icu.ICUNormalizer2Filter in project lucene-solr by apache.

the class TestICUTokenizer method setUp.

@Override
public void setUp() throws Exception {
    super.setUp();
    a = new Analyzer() {

        @Override
        protected TokenStreamComponents createComponents(String fieldName) {
            Tokenizer tokenizer = new ICUTokenizer(newAttributeFactory(), new DefaultICUTokenizerConfig(false, true));
            TokenFilter filter = new ICUNormalizer2Filter(tokenizer);
            return new TokenStreamComponents(tokenizer, filter);
        }
    };
}
Also used : ICUNormalizer2Filter(org.apache.lucene.analysis.icu.ICUNormalizer2Filter) Analyzer(org.apache.lucene.analysis.Analyzer) Tokenizer(org.apache.lucene.analysis.Tokenizer) TokenFilter(org.apache.lucene.analysis.TokenFilter)

Example 2 with ICUNormalizer2Filter

use of org.apache.lucene.analysis.icu.ICUNormalizer2Filter in project lucene-solr by apache.

the class TestWithCJKBigramFilter method setUp.

@Override
public void setUp() throws Exception {
    super.setUp();
    /*
     * ICUTokenizer+CJKBigramFilter
     */
    analyzer = new Analyzer() {

        @Override
        protected TokenStreamComponents createComponents(String fieldName) {
            Tokenizer source = new ICUTokenizer(newAttributeFactory(), new DefaultICUTokenizerConfig(false, true));
            TokenStream result = new CJKBigramFilter(source);
            return new TokenStreamComponents(source, new StopFilter(result, CharArraySet.EMPTY_SET));
        }
    };
    /*
     * ICUTokenizer+ICUNormalizer2Filter+CJKBigramFilter.
     * 
     * ICUNormalizer2Filter uses nfkc_casefold by default, so this is a language-independent
     * superset of CJKWidthFilter's foldings.
     */
    analyzer2 = new Analyzer() {

        @Override
        protected TokenStreamComponents createComponents(String fieldName) {
            Tokenizer source = new ICUTokenizer(newAttributeFactory(), new DefaultICUTokenizerConfig(false, true));
            // we put this before the CJKBigramFilter, because the normalization might combine
            // some halfwidth katakana forms, which will affect the bigramming.
            TokenStream result = new ICUNormalizer2Filter(source);
            result = new CJKBigramFilter(result);
            return new TokenStreamComponents(source, new StopFilter(result, CharArraySet.EMPTY_SET));
        }
    };
}
Also used : TokenStream(org.apache.lucene.analysis.TokenStream) StopFilter(org.apache.lucene.analysis.StopFilter) CJKBigramFilter(org.apache.lucene.analysis.cjk.CJKBigramFilter) ICUNormalizer2Filter(org.apache.lucene.analysis.icu.ICUNormalizer2Filter) Analyzer(org.apache.lucene.analysis.Analyzer) Tokenizer(org.apache.lucene.analysis.Tokenizer)

Aggregations

Analyzer (org.apache.lucene.analysis.Analyzer)2 Tokenizer (org.apache.lucene.analysis.Tokenizer)2 ICUNormalizer2Filter (org.apache.lucene.analysis.icu.ICUNormalizer2Filter)2 StopFilter (org.apache.lucene.analysis.StopFilter)1 TokenFilter (org.apache.lucene.analysis.TokenFilter)1 TokenStream (org.apache.lucene.analysis.TokenStream)1 CJKBigramFilter (org.apache.lucene.analysis.cjk.CJKBigramFilter)1