Search in sources :

Example 6 with ShingleFilter

use of org.apache.lucene.analysis.shingle.ShingleFilter in project cogcomp-nlp by CogComp.

the class CharacterShingleAnalyzer method createComponents.

@Override
protected TokenStreamComponents createComponents(String fieldName) {
    final Tokenizer source = new CharacterShingleTokenizer();
    TokenStream result = new StandardFilter(source);
    result = new ASCIIFoldingFilter(result);
    result = new LowerCaseFilter(result);
    result = new ShingleFilter(result, 3);
    return new TokenStreamComponents(source, result);
}
Also used : TokenStream(org.apache.lucene.analysis.TokenStream) ShingleFilter(org.apache.lucene.analysis.shingle.ShingleFilter) StandardFilter(org.apache.lucene.analysis.standard.StandardFilter) ASCIIFoldingFilter(org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter) CharTokenizer(org.apache.lucene.analysis.util.CharTokenizer) Tokenizer(org.apache.lucene.analysis.Tokenizer) LowerCaseFilter(org.apache.lucene.analysis.core.LowerCaseFilter)

Example 7 with ShingleFilter

use of org.apache.lucene.analysis.shingle.ShingleFilter in project lucene-solr by apache.

the class FreeTextSuggester method addShingles.

private Analyzer addShingles(final Analyzer other) {
    if (grams == 1) {
        return other;
    } else {
        // Tack on ShingleFilter to the end, to generate token ngrams:
        return new AnalyzerWrapper(other.getReuseStrategy()) {

            @Override
            protected Analyzer getWrappedAnalyzer(String fieldName) {
                return other;
            }

            @Override
            protected TokenStreamComponents wrapComponents(String fieldName, TokenStreamComponents components) {
                ShingleFilter shingles = new ShingleFilter(components.getTokenStream(), 2, grams);
                shingles.setTokenSeparator(Character.toString((char) separator));
                return new TokenStreamComponents(components.getTokenizer(), shingles);
            }
        };
    }
}
Also used : ShingleFilter(org.apache.lucene.analysis.shingle.ShingleFilter) AnalyzerWrapper(org.apache.lucene.analysis.AnalyzerWrapper)

Example 8 with ShingleFilter

use of org.apache.lucene.analysis.shingle.ShingleFilter in project lucene-solr by apache.

the class EdgeNGramTokenFilterTest method testGraphs.

public void testGraphs() throws IOException {
    TokenStream tk = new LetterTokenizer();
    ((Tokenizer) tk).setReader(new StringReader("abc d efgh ij klmno p q"));
    tk = new ShingleFilter(tk);
    tk = new EdgeNGramTokenFilter(tk, 7, 10);
    assertTokenStreamContents(tk, new String[] { "efgh ij", "ij klmn", "ij klmno", "klmno p" }, new int[] { 6, 11, 11, 14 }, new int[] { 13, 19, 19, 21 }, new int[] { 3, 1, 0, 1 }, new int[] { 2, 2, 2, 2 }, 23);
}
Also used : TokenStream(org.apache.lucene.analysis.TokenStream) ShingleFilter(org.apache.lucene.analysis.shingle.ShingleFilter) StringReader(java.io.StringReader) LetterTokenizer(org.apache.lucene.analysis.core.LetterTokenizer) WhitespaceTokenizer(org.apache.lucene.analysis.core.WhitespaceTokenizer) Tokenizer(org.apache.lucene.analysis.Tokenizer) MockTokenizer(org.apache.lucene.analysis.MockTokenizer) KeywordTokenizer(org.apache.lucene.analysis.core.KeywordTokenizer) LetterTokenizer(org.apache.lucene.analysis.core.LetterTokenizer)

Aggregations

ShingleFilter (org.apache.lucene.analysis.shingle.ShingleFilter)8 Tokenizer (org.apache.lucene.analysis.Tokenizer)6 Analyzer (org.apache.lucene.analysis.Analyzer)4 TokenStream (org.apache.lucene.analysis.TokenStream)4 StringReader (java.io.StringReader)3 HashMap (java.util.HashMap)3 LowerCaseFilter (org.apache.lucene.analysis.LowerCaseFilter)3 WhitespaceAnalyzer (org.apache.lucene.analysis.core.WhitespaceAnalyzer)3 PerFieldAnalyzerWrapper (org.apache.lucene.analysis.miscellaneous.PerFieldAnalyzerWrapper)3 StandardTokenizer (org.apache.lucene.analysis.standard.StandardTokenizer)3 Document (org.apache.lucene.document.Document)3 Field (org.apache.lucene.document.Field)3 TextField (org.apache.lucene.document.TextField)3 DirectoryReader (org.apache.lucene.index.DirectoryReader)3 IndexWriter (org.apache.lucene.index.IndexWriter)3 IndexWriterConfig (org.apache.lucene.index.IndexWriterConfig)3 DirectSpellChecker (org.apache.lucene.search.spell.DirectSpellChecker)3 RAMDirectory (org.apache.lucene.store.RAMDirectory)3 BytesRef (org.apache.lucene.util.BytesRef)3 IOException (java.io.IOException)2