Search in sources :

Example 1 with RerankerContext

use of io.anserini.rerank.RerankerContext in project Anserini by castorini.

the class BaseFeatureExtractor method buildRerankerContextMap.

// Build all the reranker contexts because they will be reused once per query
private Map<String, RerankerContext> buildRerankerContextMap() throws IOException {
    Map<String, RerankerContext> queryContextMap = new HashMap<>();
    IndexSearcher searcher = new IndexSearcher(reader);
    for (String qid : qrels.getQids()) {
        // Construct the reranker context
        LOG.debug(String.format("Constructing context for QID: %s", qid));
        String queryText = topics.get(qid);
        Query q = null;
        // We will not be checking for nulls here because the input should be correct,
        // and if not it signals other issues
        q = parseQuery(queryText);
        List<String> queryTokens = AnalyzerUtils.tokenize(queryAnalyzer, queryText);
        // Construct the reranker context
        RerankerContext context = new RerankerContext(searcher, q, qid, queryText, queryTokens, getTermVectorField(), null);
        queryContextMap.put(qid, context);
    }
    LOG.debug("Completed constructing context for all qrels");
    return queryContextMap;
}
Also used : IndexSearcher(org.apache.lucene.search.IndexSearcher) Query(org.apache.lucene.search.Query) RerankerContext(io.anserini.rerank.RerankerContext)

Example 2 with RerankerContext

use of io.anserini.rerank.RerankerContext in project Anserini by castorini.

the class BaseFeatureExtractor method printFeatureForAllDocs.

/**
 * Iterates through all the documents and print the features for each of the queries
 * This way we are not iterating over the entire index for each query to save disk access
 * @param out
 * @throws IOException
 */
public void printFeatureForAllDocs(PrintStream out) throws IOException {
    Map<String, RerankerContext> queryContextMap = buildRerankerContextMap();
    FeatureExtractors extractors = getExtractors();
    Bits liveDocs = MultiFields.getLiveDocs(reader);
    Set<String> fieldsToLoad = getFieldsToLoad();
    this.printHeader(out, extractors);
    for (int docId = 0; docId < reader.maxDoc(); docId++) {
        // Only check live docs if we have some
        if (reader.hasDeletions() && (liveDocs == null || !liveDocs.get(docId))) {
            LOG.warn(String.format("Document %d not in live docs", docId));
            continue;
        }
        Document doc = reader.document(docId, fieldsToLoad);
        String docIdString = doc.get(getIdField());
        // NOTE doc frequencies should not be retrieved from here, term vector returned is as if on single document
        // index
        // reader.getTermVector(docId, getTermVectorField());
        Terms terms = MultiFields.getTerms(reader, getTermVectorField());
        if (terms == null) {
            continue;
        }
        for (Map.Entry<String, RerankerContext> entry : queryContextMap.entrySet()) {
            float[] featureValues = extractors.extractAll(doc, terms, entry.getValue());
            writeFeatureVector(out, entry.getKey(), qrels.getRelevanceGrade(entry.getKey(), docIdString), docIdString, featureValues);
        }
        out.flush();
        LOG.debug(String.format("Completed computing feature vectors for doc %d", docId));
    }
}
Also used : FeatureExtractors(io.anserini.ltr.feature.FeatureExtractors) Terms(org.apache.lucene.index.Terms) Bits(org.apache.lucene.util.Bits) Document(org.apache.lucene.document.Document) RerankerContext(io.anserini.rerank.RerankerContext)

Example 3 with RerankerContext

use of io.anserini.rerank.RerankerContext in project Anserini by castorini.

the class BaseFeatureExtractor method printFeatures.

/**
 * Prints feature vectors wrt to the qrels, one vector per qrel
 * @param out
 * @throws IOException
 */
public void printFeatures(PrintStream out) throws IOException {
    Map<String, RerankerContext> queryContextMap = buildRerankerContextMap();
    FeatureExtractors extractors = getExtractors();
    Bits liveDocs = MultiFields.getLiveDocs(reader);
    Set<String> fieldsToLoad = getFieldsToLoad();
    // We need to open a searcher
    IndexSearcher searcher = new IndexSearcher(reader);
    this.printHeader(out, extractors);
    // Iterate through all the qrels and for each document id we have for them
    LOG.debug("Processing queries");
    for (String qid : this.qrels.getQids()) {
        LOG.debug(String.format("Processing qid: %s", qid));
        // Get the map of documents
        RerankerContext context = queryContextMap.get(qid);
        for (Map.Entry<String, Integer> entry : this.qrels.getDocMap(qid).entrySet()) {
            String docId = entry.getKey();
            int qrelScore = entry.getValue();
            // We issue a specific query
            TopDocs topDocs = searcher.search(docIdQuery(docId), 1);
            if (topDocs.totalHits == 0) {
                LOG.warn(String.format("Document Id %s expected but not found in index, skipping...", docId));
                continue;
            }
            ScoreDoc hit = topDocs.scoreDocs[0];
            Document doc = reader.document(hit.doc, fieldsToLoad);
            // TODO factor for test
            Terms terms = reader.getTermVector(hit.doc, getTermVectorField());
            if (terms == null) {
                LOG.debug(String.format("No term vectors found for doc %s, qid %s", docId, qid));
                continue;
            }
            float[] featureValues = extractors.extractAll(doc, terms, context);
            writeFeatureVector(out, qid, qrelScore, docId, featureValues);
        }
        LOG.debug(String.format("Finished processing for qid: %s", qid));
        out.flush();
    }
}
Also used : IndexSearcher(org.apache.lucene.search.IndexSearcher) Terms(org.apache.lucene.index.Terms) Document(org.apache.lucene.document.Document) ScoreDoc(org.apache.lucene.search.ScoreDoc) TopDocs(org.apache.lucene.search.TopDocs) FeatureExtractors(io.anserini.ltr.feature.FeatureExtractors) Bits(org.apache.lucene.util.Bits) RerankerContext(io.anserini.rerank.RerankerContext)

Example 4 with RerankerContext

use of io.anserini.rerank.RerankerContext in project Anserini by castorini.

the class SearchTweets method main.

public static void main(String[] args) throws Exception {
    long initializationTime = System.currentTimeMillis();
    SearchArgs searchArgs = new SearchArgs();
    CmdLineParser parser = new CmdLineParser(searchArgs, ParserProperties.defaults().withUsageWidth(90));
    try {
        parser.parseArgument(args);
    } catch (CmdLineException e) {
        System.err.println(e.getMessage());
        parser.printUsage(System.err);
        System.err.println("Example: SearchTweets" + parser.printExample(OptionHandlerFilter.REQUIRED));
        return;
    }
    LOG.info("Reading index at " + searchArgs.index);
    Directory dir;
    if (searchArgs.inmem) {
        LOG.info("Using MMapDirectory with preload");
        dir = new MMapDirectory(Paths.get(searchArgs.index));
        ((MMapDirectory) dir).setPreload(true);
    } else {
        LOG.info("Using default FSDirectory");
        dir = FSDirectory.open(Paths.get(searchArgs.index));
    }
    IndexReader reader = DirectoryReader.open(dir);
    IndexSearcher searcher = new IndexSearcher(reader);
    if (searchArgs.ql) {
        LOG.info("Using QL scoring model");
        searcher.setSimilarity(new LMDirichletSimilarity(searchArgs.mu));
    } else if (searchArgs.bm25) {
        LOG.info("Using BM25 scoring model");
        searcher.setSimilarity(new BM25Similarity(searchArgs.k1, searchArgs.b));
    } else {
        LOG.error("Error: Must specify scoring model!");
        System.exit(-1);
    }
    RerankerCascade cascade = new RerankerCascade();
    EnglishAnalyzer englishAnalyzer = new EnglishAnalyzer();
    if (searchArgs.rm3) {
        cascade.add(new Rm3Reranker(englishAnalyzer, FIELD_BODY, "src/main/resources/io/anserini/rerank/rm3/rm3-stoplist.twitter.txt"));
        cascade.add(new RemoveRetweetsTemporalTiebreakReranker());
    } else {
        cascade.add(new RemoveRetweetsTemporalTiebreakReranker());
    }
    if (!searchArgs.model.isEmpty() && searchArgs.extractors != null) {
        LOG.debug(String.format("Ranklib model used, modeled loaded from %s", searchArgs.model));
        cascade.add(new RankLibReranker(searchArgs.model, FIELD_BODY, searchArgs.extractors));
    }
    FeatureExtractors extractorChain = null;
    if (searchArgs.extractors != null) {
        extractorChain = FeatureExtractors.loadExtractor(searchArgs.extractors);
    }
    if (searchArgs.dumpFeatures) {
        PrintStream out = new PrintStream(searchArgs.featureFile);
        Qrels qrels = new Qrels(searchArgs.qrels);
        cascade.add(new TweetsLtrDataGenerator(out, qrels, extractorChain));
    }
    MicroblogTopicSet topics = MicroblogTopicSet.fromFile(new File(searchArgs.topics));
    PrintStream out = new PrintStream(new FileOutputStream(new File(searchArgs.output)));
    LOG.info("Writing output to " + searchArgs.output);
    LOG.info("Initialized complete! (elapsed time = " + (System.currentTimeMillis() - initializationTime) + "ms)");
    long totalTime = 0;
    int cnt = 0;
    for (MicroblogTopic topic : topics) {
        long curQueryTime = System.currentTimeMillis();
        // do not cosider the tweets with tweet ids that are beyond the queryTweetTime
        // <querytweettime> tag contains the timestamp of the query in terms of the
        // chronologically nearest tweet id within the corpus
        Query filter = TermRangeQuery.newStringRange(FIELD_ID, "0", String.valueOf(topic.getQueryTweetTime()), true, true);
        Query query = AnalyzerUtils.buildBagOfWordsQuery(FIELD_BODY, englishAnalyzer, topic.getQuery());
        BooleanQuery.Builder builder = new BooleanQuery.Builder();
        builder.add(filter, BooleanClause.Occur.FILTER);
        builder.add(query, BooleanClause.Occur.MUST);
        Query q = builder.build();
        TopDocs rs = searcher.search(q, searchArgs.hits);
        List<String> queryTokens = AnalyzerUtils.tokenize(englishAnalyzer, topic.getQuery());
        RerankerContext context = new RerankerContext(searcher, query, topic.getId(), topic.getQuery(), queryTokens, FIELD_BODY, filter);
        ScoredDocuments docs = cascade.run(ScoredDocuments.fromTopDocs(rs, searcher), context);
        long queryTime = (System.currentTimeMillis() - curQueryTime);
        for (int i = 0; i < docs.documents.length; i++) {
            String qid = topic.getId().replaceFirst("^MB0*", "");
            out.println(String.format("%s Q0 %s %d %f %s", qid, docs.documents[i].getField(FIELD_ID).stringValue(), (i + 1), docs.scores[i], searchArgs.runtag));
        }
        LOG.info("Query " + topic.getId() + " (elapsed time = " + queryTime + "ms)");
        totalTime += queryTime;
        cnt++;
    }
    LOG.info("All queries completed!");
    LOG.info("Total elapsed time = " + totalTime + "ms");
    LOG.info("Average query latency = " + (totalTime / cnt) + "ms");
    reader.close();
    out.close();
}
Also used : RemoveRetweetsTemporalTiebreakReranker(io.anserini.rerank.twitter.RemoveRetweetsTemporalTiebreakReranker) ScoredDocuments(io.anserini.rerank.ScoredDocuments) RerankerCascade(io.anserini.rerank.RerankerCascade) Rm3Reranker(io.anserini.rerank.rm3.Rm3Reranker) RankLibReranker(io.anserini.rerank.RankLibReranker) MMapDirectory(org.apache.lucene.store.MMapDirectory) Directory(org.apache.lucene.store.Directory) FSDirectory(org.apache.lucene.store.FSDirectory) PrintStream(java.io.PrintStream) Qrels(io.anserini.util.Qrels) CmdLineParser(org.kohsuke.args4j.CmdLineParser) EnglishAnalyzer(org.apache.lucene.analysis.en.EnglishAnalyzer) MMapDirectory(org.apache.lucene.store.MMapDirectory) FeatureExtractors(io.anserini.ltr.feature.FeatureExtractors) TweetsLtrDataGenerator(io.anserini.ltr.TweetsLtrDataGenerator) FileOutputStream(java.io.FileOutputStream) IndexReader(org.apache.lucene.index.IndexReader) BM25Similarity(org.apache.lucene.search.similarities.BM25Similarity) LMDirichletSimilarity(org.apache.lucene.search.similarities.LMDirichletSimilarity) File(java.io.File) CmdLineException(org.kohsuke.args4j.CmdLineException) RerankerContext(io.anserini.rerank.RerankerContext)

Example 5 with RerankerContext

use of io.anserini.rerank.RerankerContext in project Anserini by castorini.

the class SearchCollection method searchBackgroundLinking.

public <K> ScoredDocuments searchBackgroundLinking(IndexSearcher searcher, K qid, String docid, RerankerCascade cascade) throws IOException {
    // Extract a list of analyzed terms from the document to compose a query.
    List<String> terms = BackgroundLinkingTopicReader.extractTerms(reader, docid, args.backgroundlinking_k, analyzer);
    // Since the terms are already analyzed, we just join them together and use the StandardQueryParser.
    Query docQuery;
    try {
        docQuery = new StandardQueryParser().parse(StringUtils.join(terms, " "), IndexArgs.CONTENTS);
    } catch (QueryNodeException e) {
        throw new RuntimeException("Unable to create a Lucene query comprised of terms extracted from query document!");
    }
    // Per track guidelines, no opinion or editorials. Filter out articles of these types.
    Query filter = new TermInSetQuery(WashingtonPostGenerator.WashingtonPostField.KICKER.name, new BytesRef("Opinions"), new BytesRef("Letters to the Editor"), new BytesRef("The Post's View"));
    BooleanQuery.Builder builder = new BooleanQuery.Builder();
    builder.add(filter, BooleanClause.Occur.MUST_NOT);
    builder.add(docQuery, BooleanClause.Occur.MUST);
    Query query = builder.build();
    // Search using constructed query.
    TopDocs rs;
    if (args.arbitraryScoreTieBreak) {
        rs = searcher.search(query, (isRerank && args.rf_qrels == null) ? args.rerankcutoff : args.hits);
    } else {
        rs = searcher.search(query, (isRerank && args.rf_qrels == null) ? args.rerankcutoff : args.hits, BREAK_SCORE_TIES_BY_DOCID, true);
    }
    RerankerContext context = new RerankerContext<>(searcher, qid, query, docid, StringUtils.join(", ", terms), terms, null, args);
    // Run the existing cascade.
    ScoredDocuments docs = cascade.run(ScoredDocuments.fromTopDocs(rs, searcher), context);
    // Perform post-processing (e.g., date filter, dedupping, etc.) as a final step.
    return new NewsBackgroundLinkingReranker().rerank(docs, context);
}
Also used : QueryNodeException(org.apache.lucene.queryparser.flexible.core.QueryNodeException) NewsBackgroundLinkingReranker(io.anserini.rerank.lib.NewsBackgroundLinkingReranker) BooleanQuery(org.apache.lucene.search.BooleanQuery) Query(org.apache.lucene.search.Query) TermInSetQuery(org.apache.lucene.search.TermInSetQuery) BooleanQuery(org.apache.lucene.search.BooleanQuery) ScoredDocuments(io.anserini.rerank.ScoredDocuments) TopDocs(org.apache.lucene.search.TopDocs) TermInSetQuery(org.apache.lucene.search.TermInSetQuery) StandardQueryParser(org.apache.lucene.queryparser.flexible.standard.StandardQueryParser) BytesRef(org.apache.lucene.util.BytesRef) RerankerContext(io.anserini.rerank.RerankerContext)

Aggregations

RerankerContext (io.anserini.rerank.RerankerContext)15 ScoredDocuments (io.anserini.rerank.ScoredDocuments)9 TopDocs (org.apache.lucene.search.TopDocs)9 IndexSearcher (org.apache.lucene.search.IndexSearcher)8 Query (org.apache.lucene.search.Query)7 Document (org.apache.lucene.document.Document)6 FeatureExtractors (io.anserini.ltr.feature.FeatureExtractors)4 RerankerCascade (io.anserini.rerank.RerankerCascade)4 IndexReader (org.apache.lucene.index.IndexReader)4 BooleanQuery (org.apache.lucene.search.BooleanQuery)4 EnglishAnalyzer (org.apache.lucene.analysis.en.EnglishAnalyzer)3 IndexableField (org.apache.lucene.index.IndexableField)3 Terms (org.apache.lucene.index.Terms)3 QueryNodeException (org.apache.lucene.queryparser.flexible.core.QueryNodeException)3 ScoreDoc (org.apache.lucene.search.ScoreDoc)3 TermInSetQuery (org.apache.lucene.search.TermInSetQuery)3 CmdLineException (org.kohsuke.args4j.CmdLineException)3 ScoreTiesAdjusterReranker (io.anserini.rerank.lib.ScoreTiesAdjusterReranker)2 RemoveRetweetsTemporalTiebreakReranker (io.anserini.rerank.twitter.RemoveRetweetsTemporalTiebreakReranker)2 QueryGenerator (io.anserini.search.query.QueryGenerator)2