Search in sources :

Example 31 with DataBag

use of org.apache.pig.data.DataBag in project varaha by thedatachef.

the class TermVectorCentroid method exec.

public DataBag exec(Tuple input) throws IOException {
    if (input == null || input.size() < 1 || input.isNull(0))
        return null;
    DataBag bagOfVectors = (DataBag) input.get(0);
    DataBag centroid = BagFactory.getInstance().newDefaultBag();
    HashMap termSums = new HashMap<String, Double>();
    //
    for (Tuple t : bagOfVectors) {
        DataBag v = (DataBag) t.get(0);
        for (Tuple v_i : v) {
            if (!(v_i.isNull(0) || v_i.isNull(1))) {
                String term = v_i.get(0).toString();
                Object currentValue = termSums.get(term);
                if (currentValue == null) {
                    termSums.put(term, v_i.get(1));
                } else {
                    termSums.put(term, (Double) v_i.get(1) + (Double) currentValue);
                }
            }
        }
    }
    //
    // Go back through the hashmap and make the values averages
    //
    Iterator mapIterator = termSums.entrySet().iterator();
    while (mapIterator.hasNext()) {
        Map.Entry pair = (Map.Entry) mapIterator.next();
        Tuple termWeightPair = tupleFactory.newTuple(2);
        termWeightPair.set(0, pair.getKey());
        termWeightPair.set(1, (Double) pair.getValue() / bagOfVectors.size());
        centroid.add(termWeightPair);
    }
    return centroid;
}
Also used : DataBag(org.apache.pig.data.DataBag) HashMap(java.util.HashMap) Iterator(java.util.Iterator) Map(java.util.Map) HashMap(java.util.HashMap) Tuple(org.apache.pig.data.Tuple)

Example 32 with DataBag

use of org.apache.pig.data.DataBag in project varaha by thedatachef.

the class TokenizeText method fillBag.

/**
       Fills a DataBag with tokens from a TokenStream
     */
public DataBag fillBag(TokenStream stream) throws IOException {
    DataBag result = bagFactory.newDefaultBag();
    CharTermAttribute termAttribute = stream.addAttribute(CharTermAttribute.class);
    try {
        stream.reset();
        while (stream.incrementToken()) {
            if (termAttribute.length() > 0) {
                Tuple termText = tupleFactory.newTuple(termAttribute.toString());
                result.add(termText);
            }
        }
        stream.end();
    } finally {
        stream.close();
    }
    return result;
}
Also used : DataBag(org.apache.pig.data.DataBag) CharTermAttribute(org.apache.lucene.analysis.tokenattributes.CharTermAttribute) Tuple(org.apache.pig.data.Tuple)

Aggregations

DataBag (org.apache.pig.data.DataBag)32 Tuple (org.apache.pig.data.Tuple)27 Test (org.junit.Test)10 Map (java.util.Map)7 IOException (java.io.IOException)6 HashMap (java.util.HashMap)6 BasicBSONObject (org.bson.BasicBSONObject)6 ArrayList (java.util.ArrayList)5 BasicDBList (com.mongodb.BasicDBList)3 BasicDBObject (com.mongodb.BasicDBObject)3 List (java.util.List)3 Properties (java.util.Properties)3 DefaultDataBag (org.apache.pig.data.DefaultDataBag)3 UDFContext (org.apache.pig.impl.util.UDFContext)3 DateTime (org.joda.time.DateTime)3 HCatFieldSchema (org.apache.hive.hcatalog.data.schema.HCatFieldSchema)2 ResourceSchema (org.apache.pig.ResourceSchema)2 ResourceFieldSchema (org.apache.pig.ResourceSchema.ResourceFieldSchema)2 DefaultTuple (org.apache.pig.data.DefaultTuple)2 ParallelTopicModel (cc.mallet.topics.ParallelTopicModel)1