Search in sources :

Example 21 with TokenFilter

use of org.apache.lucene.analysis.TokenFilter in project lucene-solr by apache.

the class TestCompoundWordTokenFilter method testInvalidOffsets.

// SOLR-2891
// *CompoundWordTokenFilter blindly adds term length to offset, but this can take things out of bounds
// wrt original text if a previous filter increases the length of the word (in this case ü -> ue)
// so in this case we behave like WDF, and preserve any modified offsets
public void testInvalidOffsets() throws Exception {
    final CharArraySet dict = makeDictionary("fall");
    final NormalizeCharMap.Builder builder = new NormalizeCharMap.Builder();
    builder.add("ü", "ue");
    final NormalizeCharMap normMap = builder.build();
    Analyzer analyzer = new Analyzer() {

        @Override
        protected TokenStreamComponents createComponents(String fieldName) {
            Tokenizer tokenizer = new MockTokenizer(MockTokenizer.WHITESPACE, false);
            TokenFilter filter = new DictionaryCompoundWordTokenFilter(tokenizer, dict);
            return new TokenStreamComponents(tokenizer, filter);
        }

        @Override
        protected Reader initReader(String fieldName, Reader reader) {
            return new MappingCharFilter(normMap, reader);
        }
    };
    assertAnalyzesTo(analyzer, "banküberfall", new String[] { "bankueberfall", "fall" }, new int[] { 0, 0 }, new int[] { 12, 12 });
    analyzer.close();
}
Also used : CharArraySet(org.apache.lucene.analysis.CharArraySet) Reader(java.io.Reader) StringReader(java.io.StringReader) Analyzer(org.apache.lucene.analysis.Analyzer) MockTokenizer(org.apache.lucene.analysis.MockTokenizer) MappingCharFilter(org.apache.lucene.analysis.charfilter.MappingCharFilter) NormalizeCharMap(org.apache.lucene.analysis.charfilter.NormalizeCharMap) Tokenizer(org.apache.lucene.analysis.Tokenizer) MockTokenizer(org.apache.lucene.analysis.MockTokenizer) KeywordTokenizer(org.apache.lucene.analysis.core.KeywordTokenizer) TokenFilter(org.apache.lucene.analysis.TokenFilter)

Example 22 with TokenFilter

use of org.apache.lucene.analysis.TokenFilter in project lucene-solr by apache.

the class TestCompoundWordTokenFilter method testEmptyTerm.

public void testEmptyTerm() throws Exception {
    final CharArraySet dict = makeDictionary("a", "e", "i", "o", "u", "y", "bc", "def");
    Analyzer a = new Analyzer() {

        @Override
        protected TokenStreamComponents createComponents(String fieldName) {
            Tokenizer tokenizer = new KeywordTokenizer();
            return new TokenStreamComponents(tokenizer, new DictionaryCompoundWordTokenFilter(tokenizer, dict));
        }
    };
    checkOneTerm(a, "", "");
    a.close();
    InputSource is = new InputSource(getClass().getResource("da_UTF8.xml").toExternalForm());
    final HyphenationTree hyphenator = HyphenationCompoundWordTokenFilter.getHyphenationTree(is);
    Analyzer b = new Analyzer() {

        @Override
        protected TokenStreamComponents createComponents(String fieldName) {
            Tokenizer tokenizer = new KeywordTokenizer();
            TokenFilter filter = new HyphenationCompoundWordTokenFilter(tokenizer, hyphenator);
            return new TokenStreamComponents(tokenizer, filter);
        }
    };
    checkOneTerm(b, "", "");
    b.close();
}
Also used : CharArraySet(org.apache.lucene.analysis.CharArraySet) InputSource(org.xml.sax.InputSource) HyphenationTree(org.apache.lucene.analysis.compound.hyphenation.HyphenationTree) Analyzer(org.apache.lucene.analysis.Analyzer) KeywordTokenizer(org.apache.lucene.analysis.core.KeywordTokenizer) Tokenizer(org.apache.lucene.analysis.Tokenizer) MockTokenizer(org.apache.lucene.analysis.MockTokenizer) KeywordTokenizer(org.apache.lucene.analysis.core.KeywordTokenizer) TokenFilter(org.apache.lucene.analysis.TokenFilter)

Example 23 with TokenFilter

use of org.apache.lucene.analysis.TokenFilter in project lucene-solr by apache.

the class TestElision method testElision.

public void testElision() throws Exception {
    String test = "Plop, juste pour voir l'embrouille avec O'brian. M'enfin.";
    Tokenizer tokenizer = new StandardTokenizer(newAttributeFactory());
    tokenizer.setReader(new StringReader(test));
    CharArraySet articles = new CharArraySet(asSet("l", "M"), false);
    TokenFilter filter = new ElisionFilter(tokenizer, articles);
    List<String> tas = filter(filter);
    assertEquals("embrouille", tas.get(4));
    assertEquals("O'brian", tas.get(6));
    assertEquals("enfin", tas.get(7));
}
Also used : CharArraySet(org.apache.lucene.analysis.CharArraySet) StandardTokenizer(org.apache.lucene.analysis.standard.StandardTokenizer) StringReader(java.io.StringReader) Tokenizer(org.apache.lucene.analysis.Tokenizer) StandardTokenizer(org.apache.lucene.analysis.standard.StandardTokenizer) KeywordTokenizer(org.apache.lucene.analysis.core.KeywordTokenizer) TokenFilter(org.apache.lucene.analysis.TokenFilter)

Aggregations

TokenFilter (org.apache.lucene.analysis.TokenFilter)23 Tokenizer (org.apache.lucene.analysis.Tokenizer)19 Analyzer (org.apache.lucene.analysis.Analyzer)17 MockTokenizer (org.apache.lucene.analysis.MockTokenizer)12 KeywordTokenizer (org.apache.lucene.analysis.core.KeywordTokenizer)9 StringReader (java.io.StringReader)8 CharArraySet (org.apache.lucene.analysis.CharArraySet)6 Document (org.apache.lucene.document.Document)6 StandardTokenizer (org.apache.lucene.analysis.standard.StandardTokenizer)5 IndexWriterConfig (org.apache.lucene.index.IndexWriterConfig)5 HashMap (java.util.HashMap)4 LowerCaseFilter (org.apache.lucene.analysis.LowerCaseFilter)4 Field (org.apache.lucene.document.Field)4 TextField (org.apache.lucene.document.TextField)4 IndexWriter (org.apache.lucene.index.IndexWriter)4 Directory (org.apache.lucene.store.Directory)4 RAMDirectory (org.apache.lucene.store.RAMDirectory)4 BytesRef (org.apache.lucene.util.BytesRef)4 IOException (java.io.IOException)3 MockTokenFilter (org.apache.lucene.analysis.MockTokenFilter)3