Search in sources :

Example 21 with WhitespaceTokenizer

use of org.apache.lucene.analysis.core.WhitespaceTokenizer in project elasticsearch by elastic.

the class ASCIIFoldingTokenFilterFactoryTests method testPreserveOriginal.

public void testPreserveOriginal() throws IOException {
    ESTestCase.TestAnalysis analysis = AnalysisTestsHelper.createTestAnalysisFromSettings(Settings.builder().put(Environment.PATH_HOME_SETTING.getKey(), createTempDir().toString()).put("index.analysis.filter.my_ascii_folding.type", "asciifolding").put("index.analysis.filter.my_ascii_folding.preserve_original", true).build());
    TokenFilterFactory tokenFilter = analysis.tokenFilter.get("my_ascii_folding");
    String source = "Ansprüche";
    String[] expected = new String[] { "Anspruche", "Ansprüche" };
    Tokenizer tokenizer = new WhitespaceTokenizer();
    tokenizer.setReader(new StringReader(source));
    assertTokenStreamContents(tokenFilter.create(tokenizer), expected);
    // but the multi-term aware component still emits a single token
    tokenFilter = (TokenFilterFactory) ((MultiTermAwareComponent) tokenFilter).getMultiTermComponent();
    tokenizer = new WhitespaceTokenizer();
    tokenizer.setReader(new StringReader(source));
    expected = new String[] { "Anspruche" };
    assertTokenStreamContents(tokenFilter.create(tokenizer), expected);
}
Also used : WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) ESTestCase(org.elasticsearch.test.ESTestCase) StringReader(java.io.StringReader) WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) Tokenizer(org.apache.lucene.analysis.Tokenizer)

Example 22 with WhitespaceTokenizer

use of org.apache.lucene.analysis.core.WhitespaceTokenizer in project elasticsearch by elastic.

the class ASCIIFoldingTokenFilterFactoryTests method testDefault.

public void testDefault() throws IOException {
    ESTestCase.TestAnalysis analysis = AnalysisTestsHelper.createTestAnalysisFromSettings(Settings.builder().put(Environment.PATH_HOME_SETTING.getKey(), createTempDir().toString()).put("index.analysis.filter.my_ascii_folding.type", "asciifolding").build());
    TokenFilterFactory tokenFilter = analysis.tokenFilter.get("my_ascii_folding");
    String source = "Ansprüche";
    String[] expected = new String[] { "Anspruche" };
    Tokenizer tokenizer = new WhitespaceTokenizer();
    tokenizer.setReader(new StringReader(source));
    assertTokenStreamContents(tokenFilter.create(tokenizer), expected);
}
Also used : WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) ESTestCase(org.elasticsearch.test.ESTestCase) StringReader(java.io.StringReader) WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) Tokenizer(org.apache.lucene.analysis.Tokenizer)

Example 23 with WhitespaceTokenizer

use of org.apache.lucene.analysis.core.WhitespaceTokenizer in project elasticsearch by elastic.

the class ShingleTokenFilterFactoryTests method testDefault.

public void testDefault() throws IOException {
    ESTestCase.TestAnalysis analysis = AnalysisTestsHelper.createTestAnalysisFromClassPath(createTempDir(), RESOURCE);
    TokenFilterFactory tokenFilter = analysis.tokenFilter.get("shingle");
    String source = "the quick brown fox";
    String[] expected = new String[] { "the", "the quick", "quick", "quick brown", "brown", "brown fox", "fox" };
    Tokenizer tokenizer = new WhitespaceTokenizer();
    tokenizer.setReader(new StringReader(source));
    assertTokenStreamContents(tokenFilter.create(tokenizer), expected);
}
Also used : WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) ESTestCase(org.elasticsearch.test.ESTestCase) StringReader(java.io.StringReader) WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) Tokenizer(org.apache.lucene.analysis.Tokenizer)

Example 24 with WhitespaceTokenizer

use of org.apache.lucene.analysis.core.WhitespaceTokenizer in project elasticsearch by elastic.

the class ShingleTokenFilterFactoryTests method testFillerToken.

public void testFillerToken() throws IOException {
    ESTestCase.TestAnalysis analysis = AnalysisTestsHelper.createTestAnalysisFromClassPath(createTempDir(), RESOURCE);
    TokenFilterFactory tokenFilter = analysis.tokenFilter.get("shingle_filler");
    String source = "simon the sorcerer";
    String[] expected = new String[] { "simon FILLER", "simon FILLER sorcerer", "FILLER sorcerer" };
    Tokenizer tokenizer = new WhitespaceTokenizer();
    tokenizer.setReader(new StringReader(source));
    TokenStream stream = new StopFilter(tokenizer, StopFilter.makeStopSet("the"));
    assertTokenStreamContents(tokenFilter.create(stream), expected);
}
Also used : WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) TokenStream(org.apache.lucene.analysis.TokenStream) ESTestCase(org.elasticsearch.test.ESTestCase) StopFilter(org.apache.lucene.analysis.StopFilter) StringReader(java.io.StringReader) WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) Tokenizer(org.apache.lucene.analysis.Tokenizer)

Example 25 with WhitespaceTokenizer

use of org.apache.lucene.analysis.core.WhitespaceTokenizer in project elasticsearch by elastic.

the class ShingleTokenFilterFactoryTests method testInverseMapping.

public void testInverseMapping() throws IOException {
    ESTestCase.TestAnalysis analysis = AnalysisTestsHelper.createTestAnalysisFromClassPath(createTempDir(), RESOURCE);
    TokenFilterFactory tokenFilter = analysis.tokenFilter.get("shingle_inverse");
    assertThat(tokenFilter, instanceOf(ShingleTokenFilterFactory.class));
    String source = "the quick brown fox";
    String[] expected = new String[] { "the_quick_brown", "quick_brown_fox" };
    Tokenizer tokenizer = new WhitespaceTokenizer();
    tokenizer.setReader(new StringReader(source));
    assertTokenStreamContents(tokenFilter.create(tokenizer), expected);
}
Also used : WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) ESTestCase(org.elasticsearch.test.ESTestCase) StringReader(java.io.StringReader) WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) Tokenizer(org.apache.lucene.analysis.Tokenizer)

Aggregations

WhitespaceTokenizer (org.apache.lucene.analysis.core.WhitespaceTokenizer)44 Tokenizer (org.apache.lucene.analysis.Tokenizer)38 StringReader (java.io.StringReader)37 ESTestCase (org.elasticsearch.test.ESTestCase)25 TokenStream (org.apache.lucene.analysis.TokenStream)16 Settings (org.elasticsearch.common.settings.Settings)8 Analyzer (org.apache.lucene.analysis.Analyzer)4 KeywordTokenizer (org.apache.lucene.analysis.core.KeywordTokenizer)4 CharTermAttribute (org.apache.lucene.analysis.tokenattributes.CharTermAttribute)4 IOException (java.io.IOException)3 HashMap (java.util.HashMap)3 LowerCaseFilter (org.apache.lucene.analysis.core.LowerCaseFilter)3 PorterStemFilter (org.apache.lucene.analysis.en.PorterStemFilter)3 ParseException (java.text.ParseException)2 LowerCaseFilter (org.apache.lucene.analysis.LowerCaseFilter)2 MockTokenizer (org.apache.lucene.analysis.MockTokenizer)2 StopFilter (org.apache.lucene.analysis.StopFilter)2 TokenizerFactory (org.apache.lucene.analysis.util.TokenizerFactory)2 SuggestStopFilter (org.apache.lucene.search.suggest.analyzing.SuggestStopFilter)2 Version (org.elasticsearch.Version)2