Search in sources :

Example 16 with FeatureExtractors

use of io.anserini.ltr.feature.FeatureExtractors in project Anserini by castorini.

the class LoadFeatureExtractorFromFileTest method testMultipleExtractorNoParam.

@Test
public void testMultipleExtractorNoParam() throws Exception {
    String jsonFile = "./resources/MixedFeatureExtractor.txt";
    String docText = "document missing token";
    String queryText = "document test";
    float[] expected = { 0.836985f, 1f };
    FeatureExtractors chain = FeatureExtractors.loadExtractor(jsonFile);
    assertFeatureValues(expected, queryText, docText, chain);
}
Also used : FeatureExtractors(io.anserini.ltr.feature.FeatureExtractors) Test(org.junit.Test)

Example 17 with FeatureExtractors

use of io.anserini.ltr.feature.FeatureExtractors in project Anserini by castorini.

the class BigramFeaturesTest method getOrderedChain.

private FeatureExtractors getOrderedChain() {
    FeatureExtractors chain = new FeatureExtractors();
    chain.add(new OrderedSequentialPairsFeatureExtractor(2));
    chain.add(new OrderedSequentialPairsFeatureExtractor(4));
    chain.add(new OrderedSequentialPairsFeatureExtractor(6));
    return chain;
}
Also used : FeatureExtractors(io.anserini.ltr.feature.FeatureExtractors) OrderedSequentialPairsFeatureExtractor(io.anserini.ltr.feature.OrderedSequentialPairsFeatureExtractor)

Example 18 with FeatureExtractors

use of io.anserini.ltr.feature.FeatureExtractors in project Anserini by castorini.

the class DumpTweetsLtrData method main.

public static void main(String[] argv) throws Exception {
    long curTime = System.nanoTime();
    LtrArgs args = new LtrArgs();
    CmdLineParser parser = new CmdLineParser(args, ParserProperties.defaults().withUsageWidth(90));
    try {
        parser.parseArgument(argv);
    } catch (CmdLineException e) {
        System.err.println(e.getMessage());
        parser.printUsage(System.err);
        System.err.println("Example: DumpTweetsLtrData" + parser.printExample(OptionHandlerFilter.REQUIRED));
        return;
    }
    LOG.info("Reading index at " + args.index);
    Directory dir = FSDirectory.open(Paths.get(args.index));
    IndexReader reader = DirectoryReader.open(dir);
    IndexSearcher searcher = new IndexSearcher(reader);
    if (args.ql) {
        LOG.info("Using QL scoring model");
        searcher.setSimilarity(new LMDirichletSimilarity(args.mu));
    } else if (args.bm25) {
        LOG.info("Using BM25 scoring model");
        searcher.setSimilarity(new BM25Similarity(args.k1, args.b));
    } else {
        LOG.error("Error: Must specify scoring model!");
        System.exit(-1);
    }
    Qrels qrels = new Qrels(args.qrels);
    FeatureExtractors extractors = null;
    if (args.extractors != null) {
        extractors = FeatureExtractors.loadExtractor(args.extractors);
    }
    PrintStream out = new PrintStream(new FileOutputStream(new File(args.output)));
    RerankerCascade cascade = new RerankerCascade();
    cascade.add(new RemoveRetweetsTemporalTiebreakReranker());
    cascade.add(new TweetsLtrDataGenerator(out, qrels, extractors));
    MicroblogTopicSet topics = MicroblogTopicSet.fromFile(new File(args.topics));
    LOG.info("Initialized complete! (elapsed time = " + (System.nanoTime() - curTime) / 1000000 + "ms)");
    long totalTime = 0;
    int cnt = 0;
    for (MicroblogTopic topic : topics) {
        long curQueryTime = System.nanoTime();
        Query filter = LongPoint.newRangeQuery(StatusField.ID.name, 0L, topic.getQueryTweetTime());
        Query query = AnalyzerUtils.buildBagOfWordsQuery(StatusField.TEXT.name, IndexTweets.ANALYZER, topic.getQuery());
        BooleanQuery.Builder builder = new BooleanQuery.Builder();
        builder.add(filter, BooleanClause.Occur.FILTER);
        builder.add(query, BooleanClause.Occur.MUST);
        Query q = builder.build();
        TopDocs rs = searcher.search(q, args.hits);
        List<String> queryTokens = AnalyzerUtils.tokenize(IndexTweets.ANALYZER, topic.getQuery());
        RerankerContext context = new RerankerContext(searcher, query, topic.getId(), topic.getQuery(), queryTokens, StatusField.TEXT.name, filter);
        cascade.run(ScoredDocuments.fromTopDocs(rs, searcher), context);
        long qtime = (System.nanoTime() - curQueryTime) / 1000000;
        LOG.info("Query " + topic.getId() + " (elapsed time = " + qtime + "ms)");
        totalTime += qtime;
        cnt++;
    }
    LOG.info("All queries completed!");
    LOG.info("Total elapsed time = " + totalTime + "ms");
    LOG.info("Average query latency = " + (totalTime / cnt) + "ms");
    reader.close();
    out.close();
}
Also used : RemoveRetweetsTemporalTiebreakReranker(io.anserini.rerank.twitter.RemoveRetweetsTemporalTiebreakReranker) RerankerCascade(io.anserini.rerank.RerankerCascade) MicroblogTopicSet(io.anserini.search.MicroblogTopicSet) Directory(org.apache.lucene.store.Directory) FSDirectory(org.apache.lucene.store.FSDirectory) Qrels(io.anserini.util.Qrels) PrintStream(java.io.PrintStream) LongPoint(org.apache.lucene.document.LongPoint) FeatureExtractors(io.anserini.ltr.feature.FeatureExtractors) FileOutputStream(java.io.FileOutputStream) IndexReader(org.apache.lucene.index.IndexReader) BM25Similarity(org.apache.lucene.search.similarities.BM25Similarity) MicroblogTopic(io.anserini.search.MicroblogTopic) LMDirichletSimilarity(org.apache.lucene.search.similarities.LMDirichletSimilarity) File(java.io.File) RerankerContext(io.anserini.rerank.RerankerContext)

Aggregations

FeatureExtractors (io.anserini.ltr.feature.FeatureExtractors)18 Test (org.junit.Test)6 JsonObject (com.google.gson.JsonObject)5 RerankerContext (io.anserini.rerank.RerankerContext)4 Qrels (io.anserini.util.Qrels)4 PrintStream (java.io.PrintStream)4 Directory (org.apache.lucene.store.Directory)4 FSDirectory (org.apache.lucene.store.FSDirectory)4 RerankerCascade (io.anserini.rerank.RerankerCascade)3 File (java.io.File)3 FileOutputStream (java.io.FileOutputStream)3 IndexReader (org.apache.lucene.index.IndexReader)3 BM25Similarity (org.apache.lucene.search.similarities.BM25Similarity)3 LMDirichletSimilarity (org.apache.lucene.search.similarities.LMDirichletSimilarity)3 CmdLineException (org.kohsuke.args4j.CmdLineException)3 CmdLineParser (org.kohsuke.args4j.CmdLineParser)3 OrderedSequentialPairsFeatureExtractor (io.anserini.ltr.feature.OrderedSequentialPairsFeatureExtractor)2 UnorderedSequentialPairsFeatureExtractor (io.anserini.ltr.feature.UnorderedSequentialPairsFeatureExtractor)2 Rm3Reranker (io.anserini.rerank.rm3.Rm3Reranker)2 RemoveRetweetsTemporalTiebreakReranker (io.anserini.rerank.twitter.RemoveRetweetsTemporalTiebreakReranker)2