Search in sources :

Example 1 with NormalizeCharMap

use of org.apache.lucene.analysis.charfilter.NormalizeCharMap in project lucene-solr by apache.

the class TestSimplePatternTokenizer method testOffsetCorrection.

public void testOffsetCorrection() throws Exception {
    final String INPUT = "Günther Günther is here";
    // create MappingCharFilter
    List<String> mappingRules = new ArrayList<>();
    mappingRules.add("\"&uuml;\" => \"ü\"");
    NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
    builder.add("&uuml;", "ü");
    NormalizeCharMap normMap = builder.build();
    CharFilter charStream = new MappingCharFilter(normMap, new StringReader(INPUT));
    // create SimplePatternTokenizer
    Tokenizer stream = new SimplePatternTokenizer("Günther");
    stream.setReader(charStream);
    assertTokenStreamContents(stream, new String[] { "Günther", "Günther" }, new int[] { 0, 13 }, new int[] { 12, 25 }, INPUT.length());
}
Also used : MappingCharFilter(org.apache.lucene.analysis.charfilter.MappingCharFilter) CharFilter(org.apache.lucene.analysis.CharFilter) ArrayList(java.util.ArrayList) StringReader(java.io.StringReader) MappingCharFilter(org.apache.lucene.analysis.charfilter.MappingCharFilter) NormalizeCharMap(org.apache.lucene.analysis.charfilter.NormalizeCharMap) Tokenizer(org.apache.lucene.analysis.Tokenizer)

Example 2 with NormalizeCharMap

use of org.apache.lucene.analysis.charfilter.NormalizeCharMap in project lucene-solr by apache.

the class TestSimplePatternSplitTokenizer method testOffsetCorrection.

public void testOffsetCorrection() throws Exception {
    final String INPUT = "G&uuml;nther G&uuml;nther is here";
    // create MappingCharFilter
    List<String> mappingRules = new ArrayList<>();
    mappingRules.add("\"&uuml;\" => \"ü\"");
    NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
    builder.add("&uuml;", "ü");
    NormalizeCharMap normMap = builder.build();
    CharFilter charStream = new MappingCharFilter(normMap, new StringReader(INPUT));
    // create SimplePatternSplitTokenizer
    Tokenizer stream = new SimplePatternSplitTokenizer("Günther");
    stream.setReader(charStream);
    assertTokenStreamContents(stream, new String[] { " ", " is here" }, new int[] { 12, 25 }, new int[] { 13, 33 }, INPUT.length());
}
Also used : MappingCharFilter(org.apache.lucene.analysis.charfilter.MappingCharFilter) CharFilter(org.apache.lucene.analysis.CharFilter) ArrayList(java.util.ArrayList) StringReader(java.io.StringReader) MappingCharFilter(org.apache.lucene.analysis.charfilter.MappingCharFilter) NormalizeCharMap(org.apache.lucene.analysis.charfilter.NormalizeCharMap) Tokenizer(org.apache.lucene.analysis.Tokenizer)

Example 3 with NormalizeCharMap

use of org.apache.lucene.analysis.charfilter.NormalizeCharMap in project lucene-solr by apache.

the class TestPatternTokenizer method testOffsetCorrection.

public void testOffsetCorrection() throws Exception {
    final String INPUT = "G&uuml;nther G&uuml;nther is here";
    // create MappingCharFilter
    List<String> mappingRules = new ArrayList<>();
    mappingRules.add("\"&uuml;\" => \"ü\"");
    NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
    builder.add("&uuml;", "ü");
    NormalizeCharMap normMap = builder.build();
    CharFilter charStream = new MappingCharFilter(normMap, new StringReader(INPUT));
    // create PatternTokenizer
    Tokenizer stream = new PatternTokenizer(newAttributeFactory(), Pattern.compile("[,;/\\s]+"), -1);
    stream.setReader(charStream);
    assertTokenStreamContents(stream, new String[] { "Günther", "Günther", "is", "here" }, new int[] { 0, 13, 26, 29 }, new int[] { 12, 25, 28, 33 }, INPUT.length());
    charStream = new MappingCharFilter(normMap, new StringReader(INPUT));
    stream = new PatternTokenizer(newAttributeFactory(), Pattern.compile("Günther"), 0);
    stream.setReader(charStream);
    assertTokenStreamContents(stream, new String[] { "Günther", "Günther" }, new int[] { 0, 13 }, new int[] { 12, 25 }, INPUT.length());
}
Also used : MappingCharFilter(org.apache.lucene.analysis.charfilter.MappingCharFilter) CharFilter(org.apache.lucene.analysis.CharFilter) ArrayList(java.util.ArrayList) StringReader(java.io.StringReader) MappingCharFilter(org.apache.lucene.analysis.charfilter.MappingCharFilter) NormalizeCharMap(org.apache.lucene.analysis.charfilter.NormalizeCharMap) Tokenizer(org.apache.lucene.analysis.Tokenizer)

Example 4 with NormalizeCharMap

use of org.apache.lucene.analysis.charfilter.NormalizeCharMap in project lucene-solr by apache.

the class TestCJKAnalyzer method testChangedOffsets.

/** test that offsets are correct when mappingcharfilter is previously applied */
public void testChangedOffsets() throws IOException {
    final NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
    builder.add("a", "一二");
    builder.add("b", "二三");
    final NormalizeCharMap norm = builder.build();
    Analyzer analyzer = new Analyzer() {

        @Override
        protected TokenStreamComponents createComponents(String fieldName) {
            Tokenizer tokenizer = new StandardTokenizer();
            return new TokenStreamComponents(tokenizer, new CJKBigramFilter(tokenizer));
        }

        @Override
        protected Reader initReader(String fieldName, Reader reader) {
            return new MappingCharFilter(norm, reader);
        }
    };
    assertAnalyzesTo(analyzer, "ab", new String[] { "一二", "二二", "二三" }, new int[] { 0, 0, 1 }, new int[] { 1, 1, 2 });
    // note: offsets are strange since this is how the charfilter maps them... 
    // before bigramming, the 4 tokens look like:
    //   { 0, 0, 1, 1 },
    //   { 0, 1, 1, 2 }
    analyzer.close();
}
Also used : StandardTokenizer(org.apache.lucene.analysis.standard.StandardTokenizer) Reader(java.io.Reader) MappingCharFilter(org.apache.lucene.analysis.charfilter.MappingCharFilter) NormalizeCharMap(org.apache.lucene.analysis.charfilter.NormalizeCharMap) Analyzer(org.apache.lucene.analysis.Analyzer) Tokenizer(org.apache.lucene.analysis.Tokenizer) StandardTokenizer(org.apache.lucene.analysis.standard.StandardTokenizer) MockTokenizer(org.apache.lucene.analysis.MockTokenizer) KeywordTokenizer(org.apache.lucene.analysis.core.KeywordTokenizer)

Example 5 with NormalizeCharMap

use of org.apache.lucene.analysis.charfilter.NormalizeCharMap in project lucene-solr by apache.

the class UkrainianMorfologikAnalyzer method initReader.

@Override
protected Reader initReader(String fieldName, Reader reader) {
    NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
    // different apostrophes
    builder.add("’", "'");
    builder.add("‘", "'");
    builder.add("ʼ", "'");
    builder.add("`", "'");
    builder.add("´", "'");
    // ignored characters
    builder.add("́", "");
    builder.add("­", "");
    builder.add("ґ", "г");
    builder.add("Ґ", "Г");
    NormalizeCharMap normMap = builder.build();
    reader = new MappingCharFilter(normMap, reader);
    return reader;
}
Also used : MappingCharFilter(org.apache.lucene.analysis.charfilter.MappingCharFilter) NormalizeCharMap(org.apache.lucene.analysis.charfilter.NormalizeCharMap)

Aggregations

MappingCharFilter (org.apache.lucene.analysis.charfilter.MappingCharFilter)8 NormalizeCharMap (org.apache.lucene.analysis.charfilter.NormalizeCharMap)8 StringReader (java.io.StringReader)6 Tokenizer (org.apache.lucene.analysis.Tokenizer)6 Reader (java.io.Reader)4 ArrayList (java.util.ArrayList)3 Analyzer (org.apache.lucene.analysis.Analyzer)3 CharFilter (org.apache.lucene.analysis.CharFilter)3 MockTokenizer (org.apache.lucene.analysis.MockTokenizer)3 CharArraySet (org.apache.lucene.analysis.CharArraySet)2 TokenFilter (org.apache.lucene.analysis.TokenFilter)2 KeywordTokenizer (org.apache.lucene.analysis.core.KeywordTokenizer)2 MockCharFilter (org.apache.lucene.analysis.MockCharFilter)1 MockTokenFilter (org.apache.lucene.analysis.MockTokenFilter)1 CommonGramsFilter (org.apache.lucene.analysis.commongrams.CommonGramsFilter)1 EdgeNGramTokenizer (org.apache.lucene.analysis.ngram.EdgeNGramTokenizer)1 NGramTokenFilter (org.apache.lucene.analysis.ngram.NGramTokenFilter)1 StandardTokenizer (org.apache.lucene.analysis.standard.StandardTokenizer)1 WikipediaTokenizer (org.apache.lucene.analysis.wikipedia.WikipediaTokenizer)1