Search in sources :

Example 36 with WhitespaceTokenizer

use of org.apache.lucene.analysis.core.WhitespaceTokenizer in project elasticsearch by elastic.

the class AnalysisModuleTests method assertTokenFilter.

private void assertTokenFilter(String name, Class<?> clazz) throws IOException {
    Settings settings = Settings.builder().put(IndexMetaData.SETTING_VERSION_CREATED, Version.CURRENT).put(Environment.PATH_HOME_SETTING.getKey(), createTempDir().toString()).build();
    TestAnalysis analysis = AnalysisTestsHelper.createTestAnalysisFromSettings(settings);
    TokenFilterFactory tokenFilter = analysis.tokenFilter.get(name);
    Tokenizer tokenizer = new WhitespaceTokenizer();
    tokenizer.setReader(new StringReader("foo bar"));
    TokenStream stream = tokenFilter.create(tokenizer);
    assertThat(stream, instanceOf(clazz));
}
Also used : WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) TokenStream(org.apache.lucene.analysis.TokenStream) StringReader(java.io.StringReader) WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) Tokenizer(org.apache.lucene.analysis.Tokenizer) Settings(org.elasticsearch.common.settings.Settings) IndexSettings(org.elasticsearch.index.IndexSettings) MyFilterTokenFilterFactory(org.elasticsearch.index.analysis.filter1.MyFilterTokenFilterFactory) StopTokenFilterFactory(org.elasticsearch.index.analysis.StopTokenFilterFactory) TokenFilterFactory(org.elasticsearch.index.analysis.TokenFilterFactory)

Example 37 with WhitespaceTokenizer

use of org.apache.lucene.analysis.core.WhitespaceTokenizer in project elasticsearch by elastic.

the class KeywordFieldTypeTests method testTermQueryWithNormalizer.

public void testTermQueryWithNormalizer() {
    MappedFieldType ft = createDefaultFieldType();
    ft.setName("field");
    ft.setIndexOptions(IndexOptions.DOCS);
    Analyzer normalizer = new Analyzer() {

        @Override
        protected TokenStreamComponents createComponents(String fieldName) {
            Tokenizer in = new WhitespaceTokenizer();
            TokenFilter out = new LowerCaseFilter(in);
            return new TokenStreamComponents(in, out);
        }

        @Override
        protected TokenStream normalize(String fieldName, TokenStream in) {
            return new LowerCaseFilter(in);
        }
    };
    ft.setSearchAnalyzer(new NamedAnalyzer("my_normalizer", AnalyzerScope.INDEX, normalizer));
    assertEquals(new TermQuery(new Term("field", "foo bar")), ft.termQuery("fOo BaR", null));
    ft.setIndexOptions(IndexOptions.NONE);
    IllegalArgumentException e = expectThrows(IllegalArgumentException.class, () -> ft.termQuery("bar", null));
    assertEquals("Cannot search on field [field] since it is not indexed.", e.getMessage());
}
Also used : WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) TermQuery(org.apache.lucene.search.TermQuery) TokenStream(org.apache.lucene.analysis.TokenStream) NamedAnalyzer(org.elasticsearch.index.analysis.NamedAnalyzer) Term(org.apache.lucene.index.Term) NamedAnalyzer(org.elasticsearch.index.analysis.NamedAnalyzer) Analyzer(org.apache.lucene.analysis.Analyzer) Tokenizer(org.apache.lucene.analysis.Tokenizer) WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) LowerCaseFilter(org.apache.lucene.analysis.LowerCaseFilter) TokenFilter(org.apache.lucene.analysis.TokenFilter)

Example 38 with WhitespaceTokenizer

use of org.apache.lucene.analysis.core.WhitespaceTokenizer in project lucene-solr-analysis-turkish by iorixxx.

the class Zemberek2DeASCIIfyFilterFactory method main.

public static void main(String[] args) throws IOException {
    StringReader reader = new StringReader("kus asisi ortaklar çekişme masali");
    Map<String, String> map = new HashMap<>();
    Zemberek2DeASCIIfyFilterFactory factory = new Zemberek2DeASCIIfyFilterFactory(map);
    WhitespaceTokenizer whitespaceTokenizer = new WhitespaceTokenizer();
    whitespaceTokenizer.setReader(reader);
    TokenStream stream = factory.create(whitespaceTokenizer);
    CharTermAttribute termAttribute = stream.getAttribute(CharTermAttribute.class);
    stream.reset();
    while (stream.incrementToken()) {
        String term = termAttribute.toString();
        System.out.println(term);
    }
    stream.end();
    reader.close();
}
Also used : WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) TokenStream(org.apache.lucene.analysis.TokenStream) CharTermAttribute(org.apache.lucene.analysis.tokenattributes.CharTermAttribute) HashMap(java.util.HashMap) StringReader(java.io.StringReader)

Example 39 with WhitespaceTokenizer

use of org.apache.lucene.analysis.core.WhitespaceTokenizer in project lucene-solr-analysis-turkish by iorixxx.

the class Zemberek2StemFilterFactory method main.

public static void main(String[] args) throws IOException {
    StringReader reader = new StringReader("elması utansın ortaklar çekişme ile");
    Map<String, String> map = new HashMap<>();
    map.put("strategy", "frequency");
    Zemberek2StemFilterFactory factory = new Zemberek2StemFilterFactory(map);
    WhitespaceTokenizer whitespaceTokenizer = new WhitespaceTokenizer();
    whitespaceTokenizer.setReader(reader);
    TokenStream stream = factory.create(whitespaceTokenizer);
    CharTermAttribute termAttribute = stream.getAttribute(CharTermAttribute.class);
    stream.reset();
    while (stream.incrementToken()) {
        String term = termAttribute.toString();
        System.out.println(term);
    }
    stream.end();
    reader.close();
}
Also used : WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) TokenStream(org.apache.lucene.analysis.TokenStream) CharTermAttribute(org.apache.lucene.analysis.tokenattributes.CharTermAttribute) HashMap(java.util.HashMap) StringReader(java.io.StringReader)

Example 40 with WhitespaceTokenizer

use of org.apache.lucene.analysis.core.WhitespaceTokenizer in project neo4j by neo4j.

the class CustomAnalyzer method createComponents.

@Override
protected TokenStreamComponents createComponents(String fieldName) {
    called = true;
    Tokenizer source = new WhitespaceTokenizer();
    return new TokenStreamComponents(source, new LowerCaseFilter(source));
}
Also used : WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) Tokenizer(org.apache.lucene.analysis.Tokenizer) LowerCaseFilter(org.apache.lucene.analysis.core.LowerCaseFilter)

Aggregations

WhitespaceTokenizer (org.apache.lucene.analysis.core.WhitespaceTokenizer)44 Tokenizer (org.apache.lucene.analysis.Tokenizer)38 StringReader (java.io.StringReader)37 ESTestCase (org.elasticsearch.test.ESTestCase)25 TokenStream (org.apache.lucene.analysis.TokenStream)16 Settings (org.elasticsearch.common.settings.Settings)8 Analyzer (org.apache.lucene.analysis.Analyzer)4 KeywordTokenizer (org.apache.lucene.analysis.core.KeywordTokenizer)4 CharTermAttribute (org.apache.lucene.analysis.tokenattributes.CharTermAttribute)4 IOException (java.io.IOException)3 HashMap (java.util.HashMap)3 LowerCaseFilter (org.apache.lucene.analysis.core.LowerCaseFilter)3 PorterStemFilter (org.apache.lucene.analysis.en.PorterStemFilter)3 ParseException (java.text.ParseException)2 LowerCaseFilter (org.apache.lucene.analysis.LowerCaseFilter)2 MockTokenizer (org.apache.lucene.analysis.MockTokenizer)2 StopFilter (org.apache.lucene.analysis.StopFilter)2 TokenizerFactory (org.apache.lucene.analysis.util.TokenizerFactory)2 SuggestStopFilter (org.apache.lucene.search.suggest.analyzing.SuggestStopFilter)2 Version (org.elasticsearch.Version)2