Search in sources :

Example 1 with WriteOneFilePerWindow

use of com.google.cloud.dataflow.examples.common.WriteOneFilePerWindow in project DataflowJavaSDK-examples by GoogleCloudPlatform.

the class WindowedWordCount method main.

public static void main(String[] args) throws IOException {
    Options options = PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
    final String output = options.getOutput();
    final Instant minTimestamp = new Instant(options.getMinTimestampMillis());
    final Instant maxTimestamp = new Instant(options.getMaxTimestampMillis());
    Pipeline pipeline = Pipeline.create(options);
    /**
     * Concept #1: the Beam SDK lets us run the same pipeline with either a bounded or
     * unbounded input source.
     */
    PCollection<String> input = pipeline.apply(TextIO.read().from(options.getInputFile())).apply(ParDo.of(new AddTimestampFn(minTimestamp, maxTimestamp)));
    /**
     * Concept #3: Window into fixed windows. The fixed window size for this example defaults to 1
     * minute (you can change this with a command-line option). See the documentation for more
     * information on how fixed windows work, and for information on the other types of windowing
     * available (e.g., sliding windows).
     */
    PCollection<String> windowedWords = input.apply(Window.<String>into(FixedWindows.of(Duration.standardMinutes(options.getWindowSize()))));
    /**
     * Concept #4: Re-use our existing CountWords transform that does not have knowledge of
     * windows over a PCollection containing windowed values.
     */
    PCollection<KV<String, Long>> wordCounts = windowedWords.apply(new WordCount.CountWords());
    /**
     * Concept #5: Format the results and write to a sharded file partitioned by window, using a
     * simple ParDo operation. Because there may be failures followed by retries, the
     * writes must be idempotent, but the details of writing to files is elided here.
     */
    wordCounts.apply(MapElements.via(new WordCount.FormatAsTextFn())).apply(new WriteOneFilePerWindow(output, options.getNumShards()));
    PipelineResult result = pipeline.run();
    try {
        result.waitUntilFinish();
    } catch (Exception exc) {
        result.cancel();
    }
}
Also used : ExampleBigQueryTableOptions(com.google.cloud.dataflow.examples.common.ExampleBigQueryTableOptions) ExampleOptions(com.google.cloud.dataflow.examples.common.ExampleOptions) PipelineOptions(org.apache.beam.sdk.options.PipelineOptions) WriteOneFilePerWindow(com.google.cloud.dataflow.examples.common.WriteOneFilePerWindow) Instant(org.joda.time.Instant) PipelineResult(org.apache.beam.sdk.PipelineResult) KV(org.apache.beam.sdk.values.KV) IOException(java.io.IOException) Pipeline(org.apache.beam.sdk.Pipeline)

Aggregations

ExampleBigQueryTableOptions (com.google.cloud.dataflow.examples.common.ExampleBigQueryTableOptions)1 ExampleOptions (com.google.cloud.dataflow.examples.common.ExampleOptions)1 WriteOneFilePerWindow (com.google.cloud.dataflow.examples.common.WriteOneFilePerWindow)1 IOException (java.io.IOException)1 Pipeline (org.apache.beam.sdk.Pipeline)1 PipelineResult (org.apache.beam.sdk.PipelineResult)1 PipelineOptions (org.apache.beam.sdk.options.PipelineOptions)1 KV (org.apache.beam.sdk.values.KV)1 Instant (org.joda.time.Instant)1