Search in sources :

Example 41 with WhitespaceTokenizer

use of org.apache.lucene.analysis.core.WhitespaceTokenizer in project lucene-solr by apache.

the class TestCharTokenizers method testCustomMaxTokenLength.

/*
   * tests the max word length passed as parameter - tokenizer will split at the passed position char no matter what happens
   */
public void testCustomMaxTokenLength() throws IOException {
    StringBuilder builder = new StringBuilder();
    for (int i = 0; i < 100; i++) {
        builder.append("A");
    }
    Tokenizer tokenizer = new LowerCaseTokenizer(newAttributeFactory(), 100);
    // Tricky, passing two copies of the string to the reader....
    tokenizer.setReader(new StringReader(builder.toString() + builder.toString()));
    assertTokenStreamContents(tokenizer, new String[] { builder.toString().toLowerCase(Locale.ROOT), builder.toString().toLowerCase(Locale.ROOT) });
    Exception e = expectThrows(IllegalArgumentException.class, () -> new LowerCaseTokenizer(newAttributeFactory(), -1));
    assertEquals("maxTokenLen must be greater than 0 and less than 1048576 passed: -1", e.getMessage());
    tokenizer = new LetterTokenizer(newAttributeFactory(), 100);
    tokenizer.setReader(new StringReader(builder.toString() + builder.toString()));
    assertTokenStreamContents(tokenizer, new String[] { builder.toString(), builder.toString() });
    // Let's test that we can get a token longer than 255 through.
    builder.setLength(0);
    for (int i = 0; i < 500; i++) {
        builder.append("Z");
    }
    tokenizer = new LetterTokenizer(newAttributeFactory(), 500);
    tokenizer.setReader(new StringReader(builder.toString()));
    assertTokenStreamContents(tokenizer, new String[] { builder.toString() });
    // Just to be sure what is happening here, token lengths of zero make no sense, 
    // Let's try the edge cases, token > I/O buffer (4096)
    builder.setLength(0);
    for (int i = 0; i < 600; i++) {
        // 600 * 8 = 4800 chars.
        builder.append("aUrOkIjq");
    }
    e = expectThrows(IllegalArgumentException.class, () -> new LowerCaseTokenizer(newAttributeFactory(), 0));
    assertEquals("maxTokenLen must be greater than 0 and less than 1048576 passed: 0", e.getMessage());
    e = expectThrows(IllegalArgumentException.class, () -> new LowerCaseTokenizer(newAttributeFactory(), 10_000_000));
    assertEquals("maxTokenLen must be greater than 0 and less than 1048576 passed: 10000000", e.getMessage());
    tokenizer = new LowerCaseTokenizer(newAttributeFactory(), 4800);
    tokenizer.setReader(new StringReader(builder.toString()));
    assertTokenStreamContents(tokenizer, new String[] { builder.toString().toLowerCase(Locale.ROOT) });
    e = expectThrows(IllegalArgumentException.class, () -> new KeywordTokenizer(newAttributeFactory(), 0));
    assertEquals("maxTokenLen must be greater than 0 and less than 1048576 passed: 0", e.getMessage());
    e = expectThrows(IllegalArgumentException.class, () -> new KeywordTokenizer(newAttributeFactory(), 10_000_000));
    assertEquals("maxTokenLen must be greater than 0 and less than 1048576 passed: 10000000", e.getMessage());
    tokenizer = new KeywordTokenizer(newAttributeFactory(), 4800);
    tokenizer.setReader(new StringReader(builder.toString()));
    assertTokenStreamContents(tokenizer, new String[] { builder.toString() });
    e = expectThrows(IllegalArgumentException.class, () -> new LetterTokenizer(newAttributeFactory(), 0));
    assertEquals("maxTokenLen must be greater than 0 and less than 1048576 passed: 0", e.getMessage());
    e = expectThrows(IllegalArgumentException.class, () -> new LetterTokenizer(newAttributeFactory(), 2_000_000));
    assertEquals("maxTokenLen must be greater than 0 and less than 1048576 passed: 2000000", e.getMessage());
    tokenizer = new LetterTokenizer(newAttributeFactory(), 4800);
    tokenizer.setReader(new StringReader(builder.toString()));
    assertTokenStreamContents(tokenizer, new String[] { builder.toString() });
    e = expectThrows(IllegalArgumentException.class, () -> new WhitespaceTokenizer(newAttributeFactory(), 0));
    assertEquals("maxTokenLen must be greater than 0 and less than 1048576 passed: 0", e.getMessage());
    e = expectThrows(IllegalArgumentException.class, () -> new WhitespaceTokenizer(newAttributeFactory(), 3_000_000));
    assertEquals("maxTokenLen must be greater than 0 and less than 1048576 passed: 3000000", e.getMessage());
    tokenizer = new WhitespaceTokenizer(newAttributeFactory(), 4800);
    tokenizer.setReader(new StringReader(builder.toString()));
    assertTokenStreamContents(tokenizer, new String[] { builder.toString() });
}
Also used : WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) LowerCaseTokenizer(org.apache.lucene.analysis.core.LowerCaseTokenizer) StringReader(java.io.StringReader) LetterTokenizer(org.apache.lucene.analysis.core.LetterTokenizer) KeywordTokenizer(org.apache.lucene.analysis.core.KeywordTokenizer) WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) Tokenizer(org.apache.lucene.analysis.Tokenizer) KeywordTokenizer(org.apache.lucene.analysis.core.KeywordTokenizer) LowerCaseTokenizer(org.apache.lucene.analysis.core.LowerCaseTokenizer) LetterTokenizer(org.apache.lucene.analysis.core.LetterTokenizer) IOException(java.io.IOException)

Example 42 with WhitespaceTokenizer

use of org.apache.lucene.analysis.core.WhitespaceTokenizer in project lucene-solr by apache.

the class TestDaitchMokotoffSoundexFilter method assertAlgorithm.

static void assertAlgorithm(boolean inject, String input, String[] expected) throws Exception {
    Tokenizer tokenizer = new WhitespaceTokenizer();
    tokenizer.setReader(new StringReader(input));
    DaitchMokotoffSoundexFilter filter = new DaitchMokotoffSoundexFilter(tokenizer, inject);
    assertTokenStreamContents(filter, expected);
}
Also used : WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) StringReader(java.io.StringReader) WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) Tokenizer(org.apache.lucene.analysis.Tokenizer) MockTokenizer(org.apache.lucene.analysis.MockTokenizer) KeywordTokenizer(org.apache.lucene.analysis.core.KeywordTokenizer)

Example 43 with WhitespaceTokenizer

use of org.apache.lucene.analysis.core.WhitespaceTokenizer in project lucene-solr by apache.

the class TestStemmerOverrideFilter method testRandomRealisticWhiteSpace.

public void testRandomRealisticWhiteSpace() throws IOException {
    Map<String, String> map = new HashMap<>();
    Set<String> seen = new HashSet<>();
    int numTerms = atLeast(50);
    boolean ignoreCase = random().nextBoolean();
    for (int i = 0; i < numTerms; i++) {
        String randomRealisticUnicodeString = TestUtil.randomRealisticUnicodeString(random());
        char[] charArray = randomRealisticUnicodeString.toCharArray();
        StringBuilder builder = new StringBuilder();
        for (int j = 0; j < charArray.length; ) {
            int cp = Character.codePointAt(charArray, j, charArray.length);
            if (!Character.isWhitespace(cp)) {
                builder.appendCodePoint(cp);
            }
            j += Character.charCount(cp);
        }
        if (builder.length() > 0) {
            String inputValue = builder.toString();
            // Make sure we don't try to add two inputs that vary only by case:
            String seenInputValue;
            if (ignoreCase) {
                // TODO: can we simply use inputValue.toLowerCase(Locale.ROOT)???
                char[] buffer = inputValue.toCharArray();
                CharacterUtils.toLowerCase(buffer, 0, buffer.length);
                seenInputValue = buffer.toString();
            } else {
                seenInputValue = inputValue;
            }
            if (seen.contains(seenInputValue) == false) {
                seen.add(seenInputValue);
                String value = TestUtil.randomSimpleString(random());
                map.put(inputValue, value.isEmpty() ? "a" : value);
            }
        }
    }
    if (map.isEmpty()) {
        map.put("booked", "books");
    }
    StemmerOverrideFilter.Builder builder = new StemmerOverrideFilter.Builder(ignoreCase);
    Set<Entry<String, String>> entrySet = map.entrySet();
    StringBuilder input = new StringBuilder();
    List<String> output = new ArrayList<>();
    for (Entry<String, String> entry : entrySet) {
        builder.add(entry.getKey(), entry.getValue());
        if (random().nextBoolean() || output.isEmpty()) {
            input.append(entry.getKey()).append(" ");
            output.add(entry.getValue());
        }
    }
    Tokenizer tokenizer = new WhitespaceTokenizer();
    tokenizer.setReader(new StringReader(input.toString()));
    TokenStream stream = new PorterStemFilter(new StemmerOverrideFilter(tokenizer, builder.build()));
    assertTokenStreamContents(stream, output.toArray(new String[0]));
}
Also used : WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) TokenStream(org.apache.lucene.analysis.TokenStream) HashMap(java.util.HashMap) ArrayList(java.util.ArrayList) PorterStemFilter(org.apache.lucene.analysis.en.PorterStemFilter) Entry(java.util.Map.Entry) StringReader(java.io.StringReader) WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) Tokenizer(org.apache.lucene.analysis.Tokenizer) KeywordTokenizer(org.apache.lucene.analysis.core.KeywordTokenizer) HashSet(java.util.HashSet)

Example 44 with WhitespaceTokenizer

use of org.apache.lucene.analysis.core.WhitespaceTokenizer in project Anserini by castorini.

the class TRECAnalyzer method createComponents.

@Override
protected TokenStreamComponents createComponents(String fieldName) {
    Tokenizer source = new WhitespaceTokenizer();
    TokenStream filter = new TweetLowerCaseEntityPreservingFilter(source);
    return new TokenStreamComponents(source, filter);
}
Also used : WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) TweetLowerCaseEntityPreservingFilter(io.anserini.analysis.TweetLowerCaseEntityPreservingFilter) TokenStream(org.apache.lucene.analysis.TokenStream) WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) Tokenizer(org.apache.lucene.analysis.Tokenizer)

Aggregations

WhitespaceTokenizer (org.apache.lucene.analysis.core.WhitespaceTokenizer)44 Tokenizer (org.apache.lucene.analysis.Tokenizer)38 StringReader (java.io.StringReader)37 ESTestCase (org.elasticsearch.test.ESTestCase)25 TokenStream (org.apache.lucene.analysis.TokenStream)16 Settings (org.elasticsearch.common.settings.Settings)8 Analyzer (org.apache.lucene.analysis.Analyzer)4 KeywordTokenizer (org.apache.lucene.analysis.core.KeywordTokenizer)4 CharTermAttribute (org.apache.lucene.analysis.tokenattributes.CharTermAttribute)4 IOException (java.io.IOException)3 HashMap (java.util.HashMap)3 LowerCaseFilter (org.apache.lucene.analysis.core.LowerCaseFilter)3 PorterStemFilter (org.apache.lucene.analysis.en.PorterStemFilter)3 ParseException (java.text.ParseException)2 LowerCaseFilter (org.apache.lucene.analysis.LowerCaseFilter)2 MockTokenizer (org.apache.lucene.analysis.MockTokenizer)2 StopFilter (org.apache.lucene.analysis.StopFilter)2 TokenizerFactory (org.apache.lucene.analysis.util.TokenizerFactory)2 SuggestStopFilter (org.apache.lucene.search.suggest.analyzing.SuggestStopFilter)2 Version (org.elasticsearch.Version)2