Search in sources :

Example 21 with LowerCaseFilter

use of org.apache.lucene.analysis.core.LowerCaseFilter in project cogcomp-nlp by CogComp.

the class WikiURLAnalyzer method createComponents.

@Override
protected TokenStreamComponents createComponents(final String fieldName) {
    final Tokenizer source = new KeywordTokenizer();
    TokenStream result = new StandardFilter(source);
    result = new CharacterFilter(result);
    result = new ASCIIFoldingFilter(result);
    result = new LowerCaseFilter(result);
    return new TokenStreamComponents(source, result);
}
Also used : TokenStream(org.apache.lucene.analysis.TokenStream) StandardFilter(org.apache.lucene.analysis.standard.StandardFilter) ASCIIFoldingFilter(org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter) KeywordTokenizer(org.apache.lucene.analysis.core.KeywordTokenizer) Tokenizer(org.apache.lucene.analysis.Tokenizer) KeywordTokenizer(org.apache.lucene.analysis.core.KeywordTokenizer) LowerCaseFilter(org.apache.lucene.analysis.core.LowerCaseFilter)

Example 22 with LowerCaseFilter

use of org.apache.lucene.analysis.core.LowerCaseFilter in project vertigo by KleeGroup.

the class DefaultAnalyzer method createComponents.

/**
 * Creates a TokenStream which tokenizes all the text in the provided Reader.
 *
 * @return A TokenStream build from a StandardTokenizer filtered with
 *         StandardFilter, StopFilter, FrenchStemFilter and LowerCaseFilter
 */
@Override
protected TokenStreamComponents createComponents(final String fieldName) {
    /* initialisation du token */
    final Tokenizer source = new StandardTokenizer();
    // -----
    /* on retire les élisions*/
    final CharArraySet elisionSet = new CharArraySet(Arrays.asList(LuceneConstants.ELISION_ARTICLES), true);
    TokenStream filter = new ElisionFilter(source, elisionSet);
    /* on retire article adjectif */
    filter = new StopFilter(filter, stopWords);
    /* on retire les accents */
    filter = new ASCIIFoldingFilter(filter);
    /* on met en minuscule */
    filter = new LowerCaseFilter(filter);
    return new TokenStreamComponents(source, filter);
}
Also used : CharArraySet(org.apache.lucene.analysis.CharArraySet) TokenStream(org.apache.lucene.analysis.TokenStream) ElisionFilter(org.apache.lucene.analysis.util.ElisionFilter) StandardTokenizer(org.apache.lucene.analysis.standard.StandardTokenizer) StopFilter(org.apache.lucene.analysis.StopFilter) ASCIIFoldingFilter(org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter) Tokenizer(org.apache.lucene.analysis.Tokenizer) StandardTokenizer(org.apache.lucene.analysis.standard.StandardTokenizer) LowerCaseFilter(org.apache.lucene.analysis.core.LowerCaseFilter)

Aggregations

LowerCaseFilter (org.apache.lucene.analysis.core.LowerCaseFilter)22 TokenStream (org.apache.lucene.analysis.TokenStream)17 Tokenizer (org.apache.lucene.analysis.Tokenizer)15 StandardTokenizer (org.apache.lucene.analysis.standard.StandardTokenizer)10 ASCIIFoldingFilter (org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter)7 StopFilter (org.apache.lucene.analysis.core.StopFilter)6 WhitespaceTokenizer (org.apache.lucene.analysis.core.WhitespaceTokenizer)5 StandardFilter (org.apache.lucene.analysis.standard.StandardFilter)5 KeywordTokenizer (org.apache.lucene.analysis.core.KeywordTokenizer)4 PorterStemFilter (org.apache.lucene.analysis.en.PorterStemFilter)3 WordDelimiterFilter (org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter)3 Analyzer (org.apache.lucene.analysis.Analyzer)2 TokenStreamComponents (org.apache.lucene.analysis.Analyzer.TokenStreamComponents)2 EnglishPossessiveFilter (org.apache.lucene.analysis.en.EnglishPossessiveFilter)2 ShingleFilter (org.apache.lucene.analysis.shingle.ShingleFilter)2 ClassicTokenizer (org.apache.lucene.analysis.standard.ClassicTokenizer)2 ElisionFilter (org.apache.lucene.analysis.util.ElisionFilter)2 IOException (java.io.IOException)1 Reader (java.io.Reader)1 StringReader (java.io.StringReader)1