Search in sources :

Example 6 with LowerCaseTokenizer

use of org.apache.lucene.analysis.core.LowerCaseTokenizer in project lucene-solr by apache.

the class TestCharTokenizers method testReadSupplementaryChars.

/*
   * test to read surrogate pairs without loosing the pairing 
   * if the surrogate pair is at the border of the internal IO buffer
   */
public void testReadSupplementaryChars() throws IOException {
    StringBuilder builder = new StringBuilder();
    // create random input
    int num = 1024 + random().nextInt(1024);
    num *= RANDOM_MULTIPLIER;
    for (int i = 1; i < num; i++) {
        builder.append("𐐜abc");
        if ((i % 10) == 0)
            builder.append(" ");
    }
    // internal buffer size is 1024 make sure we have a surrogate pair right at the border
    builder.insert(1023, "𐐜");
    Tokenizer tokenizer = new LowerCaseTokenizer(newAttributeFactory());
    tokenizer.setReader(new StringReader(builder.toString()));
    assertTokenStreamContents(tokenizer, builder.toString().toLowerCase(Locale.ROOT).split(" "));
}
Also used : LowerCaseTokenizer(org.apache.lucene.analysis.core.LowerCaseTokenizer) StringReader(java.io.StringReader) WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) Tokenizer(org.apache.lucene.analysis.Tokenizer) KeywordTokenizer(org.apache.lucene.analysis.core.KeywordTokenizer) LowerCaseTokenizer(org.apache.lucene.analysis.core.LowerCaseTokenizer) LetterTokenizer(org.apache.lucene.analysis.core.LetterTokenizer)

Example 7 with LowerCaseTokenizer

use of org.apache.lucene.analysis.core.LowerCaseTokenizer in project lucene-solr by apache.

the class TestBrazilianAnalyzer method testWithKeywordAttribute.

public void testWithKeywordAttribute() throws IOException {
    CharArraySet set = new CharArraySet(1, true);
    set.add("BrasĂ­lia");
    Tokenizer tokenizer = new LowerCaseTokenizer();
    tokenizer.setReader(new StringReader("BrasĂ­lia Brasilia"));
    BrazilianStemFilter filter = new BrazilianStemFilter(new SetKeywordMarkerFilter(tokenizer, set));
    assertTokenStreamContents(filter, new String[] { "brasĂ­lia", "brasil" });
}
Also used : CharArraySet(org.apache.lucene.analysis.CharArraySet) LowerCaseTokenizer(org.apache.lucene.analysis.core.LowerCaseTokenizer) SetKeywordMarkerFilter(org.apache.lucene.analysis.miscellaneous.SetKeywordMarkerFilter) StringReader(java.io.StringReader) Tokenizer(org.apache.lucene.analysis.Tokenizer) LowerCaseTokenizer(org.apache.lucene.analysis.core.LowerCaseTokenizer) KeywordTokenizer(org.apache.lucene.analysis.core.KeywordTokenizer)

Aggregations

StringReader (java.io.StringReader)7 LowerCaseTokenizer (org.apache.lucene.analysis.core.LowerCaseTokenizer)7 Tokenizer (org.apache.lucene.analysis.Tokenizer)6 KeywordTokenizer (org.apache.lucene.analysis.core.KeywordTokenizer)6 LetterTokenizer (org.apache.lucene.analysis.core.LetterTokenizer)5 WhitespaceTokenizer (org.apache.lucene.analysis.core.WhitespaceTokenizer)5 CharArraySet (org.apache.lucene.analysis.CharArraySet)2 SetKeywordMarkerFilter (org.apache.lucene.analysis.miscellaneous.SetKeywordMarkerFilter)2 IOException (java.io.IOException)1