Search in sources :

Example 16 with LowerCaseFilter

use of org.apache.lucene.analysis.core.LowerCaseFilter in project jena by apache.

the class LowerCaseKeywordAnalyzer method createComponents.

@Override
protected TokenStreamComponents createComponents(String fieldName) {
    KeywordTokenizer source = new KeywordTokenizer();
    LowerCaseFilter filter = new LowerCaseFilter(source);
    return new TokenStreamComponents(source, filter);
}
Also used : KeywordTokenizer(org.apache.lucene.analysis.core.KeywordTokenizer) LowerCaseFilter(org.apache.lucene.analysis.core.LowerCaseFilter)

Example 17 with LowerCaseFilter

use of org.apache.lucene.analysis.core.LowerCaseFilter in project nutch by apache.

the class LuceneTokenizer method createNGramTokenStream.

private TokenStream createNGramTokenStream(String content, int mingram, int maxgram) {
    Tokenizer tokenizer = new StandardTokenizer();
    tokenizer.setReader(new StringReader(content));
    tokenStream = new LowerCaseFilter(tokenizer);
    tokenStream = applyStemmer(stemFilterType);
    ShingleFilter shingleFilter = new ShingleFilter(tokenStream, mingram, maxgram);
    shingleFilter.setOutputUnigrams(false);
    tokenStream = (TokenStream) shingleFilter;
    return tokenStream;
}
Also used : ShingleFilter(org.apache.lucene.analysis.shingle.ShingleFilter) StandardTokenizer(org.apache.lucene.analysis.standard.StandardTokenizer) StringReader(java.io.StringReader) Tokenizer(org.apache.lucene.analysis.Tokenizer) StandardTokenizer(org.apache.lucene.analysis.standard.StandardTokenizer) ClassicTokenizer(org.apache.lucene.analysis.standard.ClassicTokenizer) LowerCaseFilter(org.apache.lucene.analysis.core.LowerCaseFilter)

Example 18 with LowerCaseFilter

use of org.apache.lucene.analysis.core.LowerCaseFilter in project nutch by apache.

the class LuceneTokenizer method createTokenStream.

private TokenStream createTokenStream(String content) {
    tokenStream = generateTokenStreamFromText(content, tokenizer);
    tokenStream = new LowerCaseFilter(tokenStream);
    if (stopSet != null) {
        tokenStream = applyStopFilter(stopSet);
    }
    tokenStream = applyStemmer(stemFilterType);
    return tokenStream;
}
Also used : LowerCaseFilter(org.apache.lucene.analysis.core.LowerCaseFilter)

Example 19 with LowerCaseFilter

use of org.apache.lucene.analysis.core.LowerCaseFilter in project nutch by apache.

the class LuceneAnalyzerUtil method createComponents.

@Override
protected TokenStreamComponents createComponents(String fieldName) {
    Tokenizer source = new ClassicTokenizer();
    TokenStream filter = new LowerCaseFilter(source);
    if (stopSet != null) {
        filter = new StopFilter(filter, stopSet);
    }
    switch(stemFilterType) {
        case PORTERSTEM_FILTER:
            filter = new PorterStemFilter(filter);
            break;
        case ENGLISHMINIMALSTEM_FILTER:
            filter = new EnglishMinimalStemFilter(filter);
            break;
        default:
            break;
    }
    return new TokenStreamComponents(source, filter);
}
Also used : TokenStream(org.apache.lucene.analysis.TokenStream) StopFilter(org.apache.lucene.analysis.core.StopFilter) PorterStemFilter(org.apache.lucene.analysis.en.PorterStemFilter) ClassicTokenizer(org.apache.lucene.analysis.standard.ClassicTokenizer) Tokenizer(org.apache.lucene.analysis.Tokenizer) ClassicTokenizer(org.apache.lucene.analysis.standard.ClassicTokenizer) LowerCaseFilter(org.apache.lucene.analysis.core.LowerCaseFilter) EnglishMinimalStemFilter(org.apache.lucene.analysis.en.EnglishMinimalStemFilter)

Example 20 with LowerCaseFilter

use of org.apache.lucene.analysis.core.LowerCaseFilter in project cogcomp-nlp by CogComp.

the class CharacterShingleAnalyzer method createComponents.

@Override
protected TokenStreamComponents createComponents(String fieldName) {
    final Tokenizer source = new CharacterShingleTokenizer();
    TokenStream result = new StandardFilter(source);
    result = new ASCIIFoldingFilter(result);
    result = new LowerCaseFilter(result);
    result = new ShingleFilter(result, 3);
    return new TokenStreamComponents(source, result);
}
Also used : TokenStream(org.apache.lucene.analysis.TokenStream) ShingleFilter(org.apache.lucene.analysis.shingle.ShingleFilter) StandardFilter(org.apache.lucene.analysis.standard.StandardFilter) ASCIIFoldingFilter(org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter) CharTokenizer(org.apache.lucene.analysis.util.CharTokenizer) Tokenizer(org.apache.lucene.analysis.Tokenizer) LowerCaseFilter(org.apache.lucene.analysis.core.LowerCaseFilter)

Aggregations

LowerCaseFilter (org.apache.lucene.analysis.core.LowerCaseFilter)22 TokenStream (org.apache.lucene.analysis.TokenStream)17 Tokenizer (org.apache.lucene.analysis.Tokenizer)15 StandardTokenizer (org.apache.lucene.analysis.standard.StandardTokenizer)10 ASCIIFoldingFilter (org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter)7 StopFilter (org.apache.lucene.analysis.core.StopFilter)6 WhitespaceTokenizer (org.apache.lucene.analysis.core.WhitespaceTokenizer)5 StandardFilter (org.apache.lucene.analysis.standard.StandardFilter)5 KeywordTokenizer (org.apache.lucene.analysis.core.KeywordTokenizer)4 PorterStemFilter (org.apache.lucene.analysis.en.PorterStemFilter)3 WordDelimiterFilter (org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter)3 Analyzer (org.apache.lucene.analysis.Analyzer)2 TokenStreamComponents (org.apache.lucene.analysis.Analyzer.TokenStreamComponents)2 EnglishPossessiveFilter (org.apache.lucene.analysis.en.EnglishPossessiveFilter)2 ShingleFilter (org.apache.lucene.analysis.shingle.ShingleFilter)2 ClassicTokenizer (org.apache.lucene.analysis.standard.ClassicTokenizer)2 ElisionFilter (org.apache.lucene.analysis.util.ElisionFilter)2 IOException (java.io.IOException)1 Reader (java.io.Reader)1 StringReader (java.io.StringReader)1