Search in sources :

Example 1 with WebPage

use of org.apache.gora.examples.generated.WebPage in project gora by apache.

the class SparkWordCount method wordCount.

public int wordCount(DataStore<String, WebPage> inStore, DataStore<String, TokenDatum> outStore) throws IOException {
    //Spark engine initialization
    GoraSparkEngine<String, WebPage> goraSparkEngine = new GoraSparkEngine<>(String.class, WebPage.class);
    SparkConf sparkConf = new SparkConf().setAppName("Gora Spark Word Count Application").setMaster("local");
    Class[] c = new Class[1];
    c[0] = inStore.getPersistentClass();
    sparkConf.registerKryoClasses(c);
    //
    JavaSparkContext sc = new JavaSparkContext(sparkConf);
    JavaPairRDD<String, WebPage> goraRDD = goraSparkEngine.initialize(sc, inStore);
    long count = goraRDD.count();
    log.info("Total Web page count: {}", count);
    JavaRDD<Tuple2<String, Long>> mappedGoraRdd = goraRDD.values().map(mapFunc);
    JavaPairRDD<String, Long> reducedGoraRdd = JavaPairRDD.fromJavaRDD(mappedGoraRdd).reduceByKey(redFunc);
    //Print output for debug purpose
    log.info("SparkWordCount debug purpose TokenDatum print starts:");
    Map<String, Long> tokenDatumMap = reducedGoraRdd.collectAsMap();
    for (String key : tokenDatumMap.keySet()) {
        log.info(key);
        log.info(tokenDatumMap.get(key).toString());
    }
    log.info("SparkWordCount debug purpose TokenDatum print ends:");
    //
    //write output to datastore
    Configuration sparkHadoopConf = goraSparkEngine.generateOutputConf(outStore);
    reducedGoraRdd.saveAsNewAPIHadoopDataset(sparkHadoopConf);
    return 1;
}
Also used : WebPage(org.apache.gora.examples.generated.WebPage) Configuration(org.apache.hadoop.conf.Configuration) GoraSparkEngine(org.apache.gora.spark.GoraSparkEngine) Tuple2(scala.Tuple2) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) SparkConf(org.apache.spark.SparkConf)

Example 2 with WebPage

use of org.apache.gora.examples.generated.WebPage in project gora by apache.

the class SparkWordCount method run.

public int run(String[] args) throws Exception {
    DataStore<String, WebPage> inStore;
    DataStore<String, TokenDatum> outStore;
    Configuration hadoopConf = new Configuration();
    if (args.length > 0) {
        String dataStoreClass = args[0];
        inStore = DataStoreFactory.getDataStore(dataStoreClass, String.class, WebPage.class, hadoopConf);
        if (args.length > 1) {
            dataStoreClass = args[1];
        }
        outStore = DataStoreFactory.getDataStore(dataStoreClass, String.class, TokenDatum.class, hadoopConf);
    } else {
        inStore = DataStoreFactory.getDataStore(String.class, WebPage.class, hadoopConf);
        outStore = DataStoreFactory.getDataStore(String.class, TokenDatum.class, hadoopConf);
    }
    return wordCount(inStore, outStore);
}
Also used : WebPage(org.apache.gora.examples.generated.WebPage) Configuration(org.apache.hadoop.conf.Configuration) TokenDatum(org.apache.gora.examples.generated.TokenDatum)

Example 3 with WebPage

use of org.apache.gora.examples.generated.WebPage in project gora by apache.

the class MapReduceSerialization method mapReduceSerialization.

public int mapReduceSerialization(DataStore<String, WebPage> inStore, DataStore<String, WebPage> outStore) throws IOException, InterruptedException, ClassNotFoundException {
    Query<String, WebPage> query = inStore.newQuery();
    query.setFields("url");
    Job job = createJob(inStore, query, outStore);
    return job.waitForCompletion(true) ? 0 : 1;
}
Also used : WebPage(org.apache.gora.examples.generated.WebPage) Job(org.apache.hadoop.mapreduce.Job)

Example 4 with WebPage

use of org.apache.gora.examples.generated.WebPage in project gora by apache.

the class WordCount method run.

@Override
public int run(String[] args) throws Exception {
    DataStore<String, WebPage> inStore;
    DataStore<String, TokenDatum> outStore;
    Configuration conf = new Configuration();
    if (args.length > 0) {
        String dataStoreClass = args[0];
        inStore = DataStoreFactory.getDataStore(dataStoreClass, String.class, WebPage.class, conf);
        if (args.length > 1) {
            dataStoreClass = args[1];
        }
        outStore = DataStoreFactory.getDataStore(dataStoreClass, String.class, TokenDatum.class, conf);
    } else {
        inStore = DataStoreFactory.getDataStore(String.class, WebPage.class, conf);
        outStore = DataStoreFactory.getDataStore(String.class, TokenDatum.class, conf);
    }
    return wordCount(inStore, outStore);
}
Also used : WebPage(org.apache.gora.examples.generated.WebPage) Configuration(org.apache.hadoop.conf.Configuration) TokenDatum(org.apache.gora.examples.generated.TokenDatum)

Example 5 with WebPage

use of org.apache.gora.examples.generated.WebPage in project gora by apache.

the class WebPageDataCreator method run.

public int run(String[] args) throws Exception {
    String dataStoreClass = "org.apache.gora.hbase.store.HBaseStore";
    if (args.length > 0) {
        dataStoreClass = args[0];
    }
    DataStore<String, WebPage> store = DataStoreFactory.getDataStore(dataStoreClass, String.class, WebPage.class, new Configuration());
    createWebPageData(store);
    return 0;
}
Also used : WebPage(org.apache.gora.examples.generated.WebPage) Configuration(org.apache.hadoop.conf.Configuration)

Aggregations

WebPage (org.apache.gora.examples.generated.WebPage)67 Test (org.junit.Test)33 Utf8 (org.apache.avro.util.Utf8)32 DBObject (com.mongodb.DBObject)7 Configuration (org.apache.hadoop.conf.Configuration)6 Employee (org.apache.gora.examples.generated.Employee)5 Metadata (org.apache.gora.examples.generated.Metadata)4 BeanFactoryImpl (org.apache.gora.persistency.impl.BeanFactoryImpl)4 ByteBuffer (java.nio.ByteBuffer)3 org.apache.hadoop.hbase.client (org.apache.hadoop.hbase.client)3 ByteArrayInputStream (java.io.ByteArrayInputStream)2 DataInputStream (java.io.DataInputStream)2 IOException (java.io.IOException)2 ArrayList (java.util.ArrayList)2 Field (org.apache.avro.Schema.Field)2 TokenDatum (org.apache.gora.examples.generated.TokenDatum)2 FilterList (org.apache.gora.filter.FilterList)2 TableName (org.apache.hadoop.hbase.TableName)2 Job (org.apache.hadoop.mapreduce.Job)2 Properties (java.util.Properties)1