Search in sources :

Example 1 with MinHashFilter

use of org.apache.lucene.analysis.minhash.MinHashFilter in project Anserini by castorini.

the class LexicalLshAnalyzer method createComponents.

@Override
protected TokenStreamComponents createComponents(String fieldName) {
    Tokenizer source = new FeatureVectorsTokenizer();
    TokenFilter truncate = new LexicalLshTruncateTokenFilter(source, decimals);
    TokenFilter featurePos = new LexicalLshFeaturePositionTokenFilter(truncate);
    TokenStream filter;
    if (min > 1) {
        ShingleFilter shingleFilter = new ShingleFilter(featurePos, min, max);
        shingleFilter.setTokenSeparator(" ");
        shingleFilter.setOutputUnigrams(false);
        shingleFilter.setOutputUnigramsIfNoShingles(false);
        filter = new MinHashFilter(shingleFilter, hashCount, bucketCount, hashSetSize, bucketCount > 1);
    } else {
        filter = new MinHashFilter(featurePos, hashCount, bucketCount, hashSetSize, bucketCount > 1);
    }
    return new TokenStreamComponents(source, new RemoveDuplicatesTokenFilter(filter));
}
Also used : RemoveDuplicatesTokenFilter(org.apache.lucene.analysis.miscellaneous.RemoveDuplicatesTokenFilter) TokenStream(org.apache.lucene.analysis.TokenStream) ShingleFilter(org.apache.lucene.analysis.shingle.ShingleFilter) MinHashFilter(org.apache.lucene.analysis.minhash.MinHashFilter) FeatureVectorsTokenizer(io.anserini.ann.FeatureVectorsTokenizer) Tokenizer(org.apache.lucene.analysis.Tokenizer) FeatureVectorsTokenizer(io.anserini.ann.FeatureVectorsTokenizer) RemoveDuplicatesTokenFilter(org.apache.lucene.analysis.miscellaneous.RemoveDuplicatesTokenFilter) TokenFilter(org.apache.lucene.analysis.TokenFilter)

Aggregations

FeatureVectorsTokenizer (io.anserini.ann.FeatureVectorsTokenizer)1 TokenFilter (org.apache.lucene.analysis.TokenFilter)1 TokenStream (org.apache.lucene.analysis.TokenStream)1 Tokenizer (org.apache.lucene.analysis.Tokenizer)1 MinHashFilter (org.apache.lucene.analysis.minhash.MinHashFilter)1 RemoveDuplicatesTokenFilter (org.apache.lucene.analysis.miscellaneous.RemoveDuplicatesTokenFilter)1 ShingleFilter (org.apache.lucene.analysis.shingle.ShingleFilter)1