Search in sources :

Example 1 with Value

use of org.jbibtex.Value in project Anserini by castorini.

the class BibtexGenerator method createDocument.

@Override
public Document createDocument(BibtexCollection.Document bibtexDoc) throws GeneratorException {
    String id = bibtexDoc.id();
    String content = bibtexDoc.contents();
    String type = bibtexDoc.type();
    BibTeXEntry bibtexEntry = bibtexDoc.bibtexEntry();
    if (content == null || content.trim().isEmpty()) {
        throw new EmptyDocumentException();
    }
    Document doc = new Document();
    // Store the collection docid.
    doc.add(new StringField(IndexArgs.ID, id, Field.Store.YES));
    // This is needed to break score ties by docid.
    doc.add(new SortedDocValuesField(IndexArgs.ID, new BytesRef(id)));
    // Store the collection's bibtex type
    doc.add(new StringField(TYPE, type, Field.Store.YES));
    if (args.storeRaw) {
        doc.add(new StoredField(IndexArgs.RAW, bibtexDoc.raw()));
    }
    FieldType fieldType = new FieldType();
    fieldType.setStored(args.storeContents);
    // Are we storing document vectors?
    if (args.storeDocvectors) {
        fieldType.setStoreTermVectors(true);
        fieldType.setStoreTermVectorPositions(true);
    }
    // Are we building a "positional" or "count" index?
    if (args.storePositions) {
        fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
    } else {
        fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS);
    }
    doc.add(new Field(IndexArgs.CONTENTS, content, fieldType));
    for (Map.Entry<Key, Value> fieldEntry : bibtexEntry.getFields().entrySet()) {
        String fieldKey = fieldEntry.getKey().toString();
        String fieldValue = fieldEntry.getValue().toUserString();
        // not worth trying to parse/normalize all numbers at the moment
        if (fieldKey.equals(BibtexField.NUMBER.name)) {
            continue;
        }
        if (STRING_FIELD_NAMES.contains(fieldKey)) {
            // index field as single token
            doc.add(new StringField(fieldKey, fieldValue, Field.Store.YES));
        } else if (FIELDS_WITHOUT_STEMMING.contains(fieldKey)) {
            // index field without stemming but store original string value
            FieldType nonStemmedType = new FieldType(fieldType);
            nonStemmedType.setStored(true);
            // token stream to be indexed
            Analyzer nonStemmingAnalyzer = DefaultEnglishAnalyzer.newNonStemmingInstance(CharArraySet.EMPTY_SET);
            StringReader reader = new StringReader(fieldValue);
            TokenStream stream = nonStemmingAnalyzer.tokenStream(null, reader);
            Field field = new Field(fieldKey, fieldValue, nonStemmedType);
            field.setTokenStream(stream);
            doc.add(field);
            nonStemmingAnalyzer.close();
        } else if (fieldKey.equals(BibtexField.YEAR.name)) {
            if (fieldValue != "") {
                // index as numeric value to allow range queries
                doc.add(new IntPoint(fieldKey, Integer.parseInt(fieldValue)));
            }
            doc.add(new StoredField(fieldKey, fieldValue));
        } else {
            // default to normal Field with tokenization and stemming
            doc.add(new Field(fieldKey, fieldValue, fieldType));
        }
    }
    return doc;
}
Also used : TokenStream(org.apache.lucene.analysis.TokenStream) BibTeXEntry(org.jbibtex.BibTeXEntry) Document(org.apache.lucene.document.Document) Analyzer(org.apache.lucene.analysis.Analyzer) DefaultEnglishAnalyzer(io.anserini.analysis.DefaultEnglishAnalyzer) FieldType(org.apache.lucene.document.FieldType) StringField(org.apache.lucene.document.StringField) StoredField(org.apache.lucene.document.StoredField) SortedDocValuesField(org.apache.lucene.document.SortedDocValuesField) Field(org.apache.lucene.document.Field) IntPoint(org.apache.lucene.document.IntPoint) StoredField(org.apache.lucene.document.StoredField) StringField(org.apache.lucene.document.StringField) SortedDocValuesField(org.apache.lucene.document.SortedDocValuesField) Value(org.jbibtex.Value) StringReader(java.io.StringReader) Map(java.util.Map) BytesRef(org.apache.lucene.util.BytesRef) Key(org.jbibtex.Key)

Example 2 with Value

use of org.jbibtex.Value in project Anserini by castorini.

the class BibtexCollectionTest method checkDocument.

@Override
void checkDocument(SourceDocument doc, Map<String, String> expected) {
    assertTrue(doc.indexable());
    Map<Key, Value> parsedFields = ((BibtexCollection.Document) doc).bibtexEntry().getFields();
    for (Map.Entry<String, String> entry : expected.entrySet()) {
        String expectedKey = entry.getKey();
        String expectedValue = entry.getValue();
        if (expectedKey.equals("id")) {
            assertEquals(expectedValue, doc.id());
        } else if (expectedKey.equals("type")) {
            assertEquals(expectedValue, ((BibtexCollection.Document) doc).type());
        } else if (expectedKey.equals("contents")) {
            assertEquals(expectedValue, doc.contents());
            assertEquals(expectedValue, doc.raw());
        } else {
            Value parsedValue = parsedFields.get(new Key(expectedKey));
            assertNotNull(parsedValue);
            assertEquals(expectedValue, parsedValue.toUserString());
        }
    }
}
Also used : Value(org.jbibtex.Value) Map(java.util.Map) HashMap(java.util.HashMap) Key(org.jbibtex.Key)

Aggregations

Map (java.util.Map)2 Key (org.jbibtex.Key)2 Value (org.jbibtex.Value)2 DefaultEnglishAnalyzer (io.anserini.analysis.DefaultEnglishAnalyzer)1 StringReader (java.io.StringReader)1 HashMap (java.util.HashMap)1 Analyzer (org.apache.lucene.analysis.Analyzer)1 TokenStream (org.apache.lucene.analysis.TokenStream)1 Document (org.apache.lucene.document.Document)1 Field (org.apache.lucene.document.Field)1 FieldType (org.apache.lucene.document.FieldType)1 IntPoint (org.apache.lucene.document.IntPoint)1 SortedDocValuesField (org.apache.lucene.document.SortedDocValuesField)1 StoredField (org.apache.lucene.document.StoredField)1 StringField (org.apache.lucene.document.StringField)1 BytesRef (org.apache.lucene.util.BytesRef)1 BibTeXEntry (org.jbibtex.BibTeXEntry)1