Search in sources :

Example 1 with PCollectionTuple

use of com.google.cloud.dataflow.sdk.values.PCollectionTuple in project spark-dataflow by cloudera.

the class TransformTranslator method multiDo.

private static <I, O> TransformEvaluator<ParDo.BoundMulti<I, O>> multiDo() {
    return new TransformEvaluator<ParDo.BoundMulti<I, O>>() {

        @Override
        public void evaluate(ParDo.BoundMulti<I, O> transform, EvaluationContext context) {
            TupleTag<O> mainOutputTag = MULTIDO_FG.get("mainOutputTag", transform);
            MultiDoFnFunction<I, O> multifn = new MultiDoFnFunction<>(transform.getFn(), context.getRuntimeContext(), mainOutputTag, getSideInputs(transform.getSideInputs(), context));
            @SuppressWarnings("unchecked") JavaRDDLike<WindowedValue<I>, ?> inRDD = (JavaRDDLike<WindowedValue<I>, ?>) context.getInputRDD(transform);
            JavaPairRDD<TupleTag<?>, WindowedValue<?>> all = inRDD.mapPartitionsToPair(multifn).cache();
            PCollectionTuple pct = context.getOutput(transform);
            for (Map.Entry<TupleTag<?>, PCollection<?>> e : pct.getAll().entrySet()) {
                @SuppressWarnings("unchecked") JavaPairRDD<TupleTag<?>, WindowedValue<?>> filtered = all.filter(new TupleTagFilter(e.getKey()));
                @SuppressWarnings("unchecked") JavaRDD<WindowedValue<Object>> // Object is the best we can do since different outputs can have different tags
                values = (JavaRDD<WindowedValue<Object>>) (JavaRDD<?>) filtered.values();
                context.setRDD(e.getValue(), values);
            }
        }
    };
}
Also used : TupleTag(com.google.cloud.dataflow.sdk.values.TupleTag) TextIO(com.google.cloud.dataflow.sdk.io.TextIO) AvroIO(com.google.cloud.dataflow.sdk.io.AvroIO) HadoopIO(com.cloudera.dataflow.hadoop.HadoopIO) JavaRDD(org.apache.spark.api.java.JavaRDD) JavaRDDLike(org.apache.spark.api.java.JavaRDDLike) PCollection(com.google.cloud.dataflow.sdk.values.PCollection) WindowedValue(com.google.cloud.dataflow.sdk.util.WindowedValue) ParDo(com.google.cloud.dataflow.sdk.transforms.ParDo) PCollectionTuple(com.google.cloud.dataflow.sdk.values.PCollectionTuple) Map(java.util.Map) ImmutableMap(com.google.common.collect.ImmutableMap)

Example 2 with PCollectionTuple

use of com.google.cloud.dataflow.sdk.values.PCollectionTuple in project spark-dataflow by cloudera.

the class MultiOutputWordCountTest method testRun.

@Test
public void testRun() throws Exception {
    Pipeline p = Pipeline.create(PipelineOptionsFactory.create());
    PCollection<String> regex = p.apply(Create.of("[^a-zA-Z']+"));
    PCollection<String> w1 = p.apply(Create.of("Here are some words to count", "and some others"));
    PCollection<String> w2 = p.apply(Create.of("Here are some more words", "and even more words"));
    PCollectionList<String> list = PCollectionList.of(w1).and(w2);
    PCollection<String> union = list.apply(Flatten.<String>pCollections());
    PCollectionView<String> regexView = regex.apply(View.<String>asSingleton());
    CountWords countWords = new CountWords(regexView);
    PCollectionTuple luc = union.apply(countWords);
    PCollection<Long> unique = luc.get(lowerCnts).apply(ApproximateUnique.<KV<String, Long>>globally(16));
    EvaluationResult res = SparkPipelineRunner.create().run(p);
    Iterable<KV<String, Long>> actualLower = res.get(luc.get(lowerCnts));
    Assert.assertEquals("are", actualLower.iterator().next().getKey());
    Iterable<KV<String, Long>> actualUpper = res.get(luc.get(upperCnts));
    Assert.assertEquals("Here", actualUpper.iterator().next().getKey());
    Iterable<Long> actualUniqCount = res.get(unique);
    Assert.assertEquals(9, (long) actualUniqCount.iterator().next());
    int actualTotalWords = res.getAggregatorValue("totalWords", Integer.class);
    Assert.assertEquals(18, actualTotalWords);
    int actualMaxWordLength = res.getAggregatorValue("maxWordLength", Integer.class);
    Assert.assertEquals(6, actualMaxWordLength);
    AggregatorValues<Integer> aggregatorValues = res.getAggregatorValues(countWords.getTotalWordsAggregator());
    Assert.assertEquals(18, Iterables.getOnlyElement(aggregatorValues.getValues()).intValue());
    res.close();
}
Also used : KV(com.google.cloud.dataflow.sdk.values.KV) Pipeline(com.google.cloud.dataflow.sdk.Pipeline) PCollectionTuple(com.google.cloud.dataflow.sdk.values.PCollectionTuple) Test(org.junit.Test)

Aggregations

PCollectionTuple (com.google.cloud.dataflow.sdk.values.PCollectionTuple)2 HadoopIO (com.cloudera.dataflow.hadoop.HadoopIO)1 Pipeline (com.google.cloud.dataflow.sdk.Pipeline)1 AvroIO (com.google.cloud.dataflow.sdk.io.AvroIO)1 TextIO (com.google.cloud.dataflow.sdk.io.TextIO)1 ParDo (com.google.cloud.dataflow.sdk.transforms.ParDo)1 WindowedValue (com.google.cloud.dataflow.sdk.util.WindowedValue)1 KV (com.google.cloud.dataflow.sdk.values.KV)1 PCollection (com.google.cloud.dataflow.sdk.values.PCollection)1 TupleTag (com.google.cloud.dataflow.sdk.values.TupleTag)1 ImmutableMap (com.google.common.collect.ImmutableMap)1 Map (java.util.Map)1 JavaRDD (org.apache.spark.api.java.JavaRDD)1 JavaRDDLike (org.apache.spark.api.java.JavaRDDLike)1 Test (org.junit.Test)1