Search in sources :

Example 6 with NormalizeCharMap

use of org.apache.lucene.analysis.charfilter.NormalizeCharMap in project lucene-solr by apache.

the class TestCompoundWordTokenFilter method testInvalidOffsets.

// SOLR-2891
// *CompoundWordTokenFilter blindly adds term length to offset, but this can take things out of bounds
// wrt original text if a previous filter increases the length of the word (in this case ü -> ue)
// so in this case we behave like WDF, and preserve any modified offsets
public void testInvalidOffsets() throws Exception {
    final CharArraySet dict = makeDictionary("fall");
    final NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
    builder.add("ü", "ue");
    final NormalizeCharMap normMap = builder.build();
    Analyzer analyzer = new Analyzer() {

        @Override
        protected TokenStreamComponents createComponents(String fieldName) {
            Tokenizer tokenizer = new MockTokenizer(MockTokenizer.WHITESPACE, false);
            TokenFilter filter = new DictionaryCompoundWordTokenFilter(tokenizer, dict);
            return new TokenStreamComponents(tokenizer, filter);
        }

        @Override
        protected Reader initReader(String fieldName, Reader reader) {
            return new MappingCharFilter(normMap, reader);
        }
    };
    assertAnalyzesTo(analyzer, "banküberfall", new String[] { "bankueberfall", "fall" }, new int[] { 0, 0 }, new int[] { 12, 12 });
    analyzer.close();
}
Also used : CharArraySet(org.apache.lucene.analysis.CharArraySet) Reader(java.io.Reader) StringReader(java.io.StringReader) Analyzer(org.apache.lucene.analysis.Analyzer) MockTokenizer(org.apache.lucene.analysis.MockTokenizer) MappingCharFilter(org.apache.lucene.analysis.charfilter.MappingCharFilter) NormalizeCharMap(org.apache.lucene.analysis.charfilter.NormalizeCharMap) Tokenizer(org.apache.lucene.analysis.Tokenizer) MockTokenizer(org.apache.lucene.analysis.MockTokenizer) KeywordTokenizer(org.apache.lucene.analysis.core.KeywordTokenizer) TokenFilter(org.apache.lucene.analysis.TokenFilter)

Example 7 with NormalizeCharMap

use of org.apache.lucene.analysis.charfilter.NormalizeCharMap in project lucene-solr by apache.

the class UkrainianMorfologikAnalyzer method initReader.

@Override
protected Reader initReader(String fieldName, Reader reader) {
    NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
    // different apostrophes
    builder.add("’", "'");
    builder.add("‘", "'");
    builder.add("ʼ", "'");
    builder.add("`", "'");
    builder.add("´", "'");
    // ignored characters
    builder.add("́", "");
    builder.add("­", "");
    builder.add("ґ", "г");
    builder.add("Ґ", "Г");
    NormalizeCharMap normMap = builder.build();
    reader = new MappingCharFilter(normMap, reader);
    return reader;
}
Also used : MappingCharFilter(org.apache.lucene.analysis.charfilter.MappingCharFilter) NormalizeCharMap(org.apache.lucene.analysis.charfilter.NormalizeCharMap)

Example 8 with NormalizeCharMap

use of org.apache.lucene.analysis.charfilter.NormalizeCharMap in project lucene-solr by apache.

the class TestPathHierarchyTokenizer method testNormalizeWinDelimToLinuxDelim.

public void testNormalizeWinDelimToLinuxDelim() throws Exception {
    NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
    builder.add("\\", "/");
    NormalizeCharMap normMap = builder.build();
    String path = "c:\\a\\b\\c";
    Reader cs = new MappingCharFilter(normMap, new StringReader(path));
    PathHierarchyTokenizer t = new PathHierarchyTokenizer(newAttributeFactory(), DEFAULT_DELIMITER, DEFAULT_DELIMITER, DEFAULT_SKIP);
    t.setReader(cs);
    assertTokenStreamContents(t, new String[] { "c:", "c:/a", "c:/a/b", "c:/a/b/c" }, new int[] { 0, 0, 0, 0 }, new int[] { 2, 4, 6, 8 }, new int[] { 1, 0, 0, 0 }, path.length());
}
Also used : StringReader(java.io.StringReader) StringReader(java.io.StringReader) Reader(java.io.Reader) MappingCharFilter(org.apache.lucene.analysis.charfilter.MappingCharFilter) NormalizeCharMap(org.apache.lucene.analysis.charfilter.NormalizeCharMap)

Aggregations

MappingCharFilter (org.apache.lucene.analysis.charfilter.MappingCharFilter)8 NormalizeCharMap (org.apache.lucene.analysis.charfilter.NormalizeCharMap)8 StringReader (java.io.StringReader)6 Tokenizer (org.apache.lucene.analysis.Tokenizer)6 Reader (java.io.Reader)4 ArrayList (java.util.ArrayList)3 Analyzer (org.apache.lucene.analysis.Analyzer)3 CharFilter (org.apache.lucene.analysis.CharFilter)3 MockTokenizer (org.apache.lucene.analysis.MockTokenizer)3 CharArraySet (org.apache.lucene.analysis.CharArraySet)2 TokenFilter (org.apache.lucene.analysis.TokenFilter)2 KeywordTokenizer (org.apache.lucene.analysis.core.KeywordTokenizer)2 MockCharFilter (org.apache.lucene.analysis.MockCharFilter)1 MockTokenFilter (org.apache.lucene.analysis.MockTokenFilter)1 CommonGramsFilter (org.apache.lucene.analysis.commongrams.CommonGramsFilter)1 EdgeNGramTokenizer (org.apache.lucene.analysis.ngram.EdgeNGramTokenizer)1 NGramTokenFilter (org.apache.lucene.analysis.ngram.NGramTokenFilter)1 StandardTokenizer (org.apache.lucene.analysis.standard.StandardTokenizer)1 WikipediaTokenizer (org.apache.lucene.analysis.wikipedia.WikipediaTokenizer)1