Search in sources :

Example 1 with SourceDocument

use of io.anserini.document.SourceDocument in project Anserini by castorini.

the class LuceneDocumentGenerator method createDocument.

public Document createDocument(SourceDocument src) {
    String id = src.id();
    String contents;
    try {
        // If there's a transform, use it.
        contents = transform != null ? transform.apply(src.content()) : src.content();
    } catch (Exception e) {
        LOG.error("Error extracting document text, skipping document: " + id, e);
        counters.errors.incrementAndGet();
        return null;
    }
    if (contents.trim().length() == 0) {
        LOG.info("Empty document: " + id);
        counters.emptyDocuments.incrementAndGet();
        return null;
    }
    // make a new, empty document
    Document document = new Document();
    // document id
    document.add(new StringField(FIELD_ID, id, Field.Store.YES));
    if (args.storeRawDocs) {
        document.add(new StoredField(FIELD_RAW, src.content()));
    }
    FieldType fieldType = new FieldType();
    fieldType.setStored(args.storeTransformedDocs);
    // Are we storing document vectors?
    if (args.storeDocvectors) {
        fieldType.setStoreTermVectors(true);
        fieldType.setStoreTermVectorPositions(true);
    }
    // Are we building a "positional" or "count" index?
    if (args.storePositions) {
        fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
    } else {
        fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS);
    }
    document.add(new Field(FIELD_BODY, contents, fieldType));
    return document;
}
Also used : Field(org.apache.lucene.document.Field) StringField(org.apache.lucene.document.StringField) StoredField(org.apache.lucene.document.StoredField) StoredField(org.apache.lucene.document.StoredField) StringField(org.apache.lucene.document.StringField) Document(org.apache.lucene.document.Document) SourceDocument(io.anserini.document.SourceDocument) FieldType(org.apache.lucene.document.FieldType)

Aggregations

SourceDocument (io.anserini.document.SourceDocument)1 Document (org.apache.lucene.document.Document)1 Field (org.apache.lucene.document.Field)1 FieldType (org.apache.lucene.document.FieldType)1 StoredField (org.apache.lucene.document.StoredField)1 StringField (org.apache.lucene.document.StringField)1