Search in sources :

Example 26 with PCollectionView

use of org.apache.beam.sdk.values.PCollectionView in project beam by apache.

the class WriteFiles method createWrite.

/**
   * A write is performed as sequence of three {@link ParDo}'s.
   *
   * <p>This singleton collection containing the WriteOperation is then used as a side
   * input to a ParDo over the PCollection of elements to write. In this bundle-writing phase,
   * {@link WriteOperation#createWriter} is called to obtain a {@link Writer}.
   * {@link Writer#open} and {@link Writer#close} are called in
   * {@link DoFn.StartBundle} and {@link DoFn.FinishBundle}, respectively, and
   * {@link Writer#write} method is called for every element in the bundle. The output
   * of this ParDo is a PCollection of <i>writer result</i> objects (see {@link FileBasedSink}
   * for a description of writer results)-one for each bundle.
   *
   * <p>The final do-once ParDo uses a singleton collection asinput and the collection of writer
   * results as a side-input. In this ParDo, {@link WriteOperation#finalize} is called
   * to finalize the write.
   *
   * <p>If the write of any element in the PCollection fails, {@link Writer#close} will be
   * called before the exception that caused the write to fail is propagated and the write result
   * will be discarded.
   *
   * <p>Since the {@link WriteOperation} is serialized after the initialization ParDo and
   * deserialized in the bundle-writing and finalization phases, any state change to the
   * WriteOperation object that occurs during initialization is visible in the latter
   * phases. However, the WriteOperation is not serialized after the bundle-writing
   * phase. This is why implementations should guarantee that
   * {@link WriteOperation#createWriter} does not mutate WriteOperation).
   */
private PDone createWrite(PCollection<T> input) {
    Pipeline p = input.getPipeline();
    if (!windowedWrites) {
        // Re-window the data into the global window and remove any existing triggers.
        input = input.apply(Window.<T>into(new GlobalWindows()).triggering(DefaultTrigger.of()).discardingFiredPanes());
    }
    // Perform the per-bundle writes as a ParDo on the input PCollection (with the
    // WriteOperation as a side input) and collect the results of the writes in a
    // PCollection. There is a dependency between this ParDo and the first (the
    // WriteOperation PCollection as a side input), so this will happen after the
    // initial ParDo.
    PCollection<FileResult> results;
    final PCollectionView<Integer> numShardsView;
    Coder<BoundedWindow> shardedWindowCoder = (Coder<BoundedWindow>) input.getWindowingStrategy().getWindowFn().windowCoder();
    if (computeNumShards == null && numShardsProvider == null) {
        numShardsView = null;
        results = input.apply("WriteBundles", ParDo.of(windowedWrites ? new WriteWindowedBundles() : new WriteUnwindowedBundles()));
    } else {
        List<PCollectionView<?>> sideInputs = Lists.newArrayList();
        if (computeNumShards != null) {
            numShardsView = input.apply(computeNumShards);
            sideInputs.add(numShardsView);
        } else {
            numShardsView = null;
        }
        PCollection<KV<Integer, Iterable<T>>> sharded = input.apply("ApplyShardLabel", ParDo.of(new ApplyShardingKey<T>(numShardsView, (numShardsView != null) ? null : numShardsProvider)).withSideInputs(sideInputs)).apply("GroupIntoShards", GroupByKey.<Integer, T>create());
        shardedWindowCoder = (Coder<BoundedWindow>) sharded.getWindowingStrategy().getWindowFn().windowCoder();
        results = sharded.apply("WriteShardedBundles", ParDo.of(new WriteShardedBundles()));
    }
    results.setCoder(FileResultCoder.of(shardedWindowCoder));
    if (windowedWrites) {
        // When processing streaming windowed writes, results will arrive multiple times. This
        // means we can't share the below implementation that turns the results into a side input,
        // as new data arriving into a side input does not trigger the listening DoFn. Instead
        // we aggregate the result set using a singleton GroupByKey, so the DoFn will be triggered
        // whenever new data arrives.
        PCollection<KV<Void, FileResult>> keyedResults = results.apply("AttachSingletonKey", WithKeys.<Void, FileResult>of((Void) null));
        keyedResults.setCoder(KvCoder.of(VoidCoder.of(), FileResultCoder.of(shardedWindowCoder)));
        // Is the continuation trigger sufficient?
        keyedResults.apply("FinalizeGroupByKey", GroupByKey.<Void, FileResult>create()).apply("Finalize", ParDo.of(new DoFn<KV<Void, Iterable<FileResult>>, Integer>() {

            @ProcessElement
            public void processElement(ProcessContext c) throws Exception {
                LOG.info("Finalizing write operation {}.", writeOperation);
                List<FileResult> results = Lists.newArrayList(c.element().getValue());
                writeOperation.finalize(results);
                LOG.debug("Done finalizing write operation");
            }
        }));
    } else {
        final PCollectionView<Iterable<FileResult>> resultsView = results.apply(View.<FileResult>asIterable());
        ImmutableList.Builder<PCollectionView<?>> sideInputs = ImmutableList.<PCollectionView<?>>builder().add(resultsView);
        if (numShardsView != null) {
            sideInputs.add(numShardsView);
        }
        // Finalize the write in another do-once ParDo on the singleton collection containing the
        // Writer. The results from the per-bundle writes are given as an Iterable side input.
        // The WriteOperation's state is the same as after its initialization in the first
        // do-once ParDo. There is a dependency between this ParDo and the parallel write (the writer
        // results collection as a side input), so it will happen after the parallel write.
        // For the non-windowed case, we guarantee that  if no data is written but the user has
        // set numShards, then all shards will be written out as empty files. For this reason we
        // use a side input here.
        PCollection<Void> singletonCollection = p.apply(Create.of((Void) null));
        singletonCollection.apply("Finalize", ParDo.of(new DoFn<Void, Integer>() {

            @ProcessElement
            public void processElement(ProcessContext c) throws Exception {
                LOG.info("Finalizing write operation {}.", writeOperation);
                List<FileResult> results = Lists.newArrayList(c.sideInput(resultsView));
                LOG.debug("Side input initialized to finalize write operation {}.", writeOperation);
                // We must always output at least 1 shard, and honor user-specified numShards if
                // set.
                int minShardsNeeded;
                if (numShardsView != null) {
                    minShardsNeeded = c.sideInput(numShardsView);
                } else if (numShardsProvider != null) {
                    minShardsNeeded = numShardsProvider.get();
                } else {
                    minShardsNeeded = 1;
                }
                int extraShardsNeeded = minShardsNeeded - results.size();
                if (extraShardsNeeded > 0) {
                    LOG.info("Creating {} empty output shards in addition to {} written for a total of {}.", extraShardsNeeded, results.size(), minShardsNeeded);
                    for (int i = 0; i < extraShardsNeeded; ++i) {
                        Writer<T> writer = writeOperation.createWriter();
                        writer.openUnwindowed(UUID.randomUUID().toString(), UNKNOWN_SHARDNUM);
                        FileResult emptyWrite = writer.close();
                        results.add(emptyWrite);
                    }
                    LOG.debug("Done creating extra shards.");
                }
                writeOperation.finalize(results);
                LOG.debug("Done finalizing write operation {}", writeOperation);
            }
        }).withSideInputs(sideInputs.build()));
    }
    return PDone.in(input.getPipeline());
}
Also used : ImmutableList(com.google.common.collect.ImmutableList) BoundedWindow(org.apache.beam.sdk.transforms.windowing.BoundedWindow) ImmutableList(com.google.common.collect.ImmutableList) List(java.util.List) Coder(org.apache.beam.sdk.coders.Coder) KvCoder(org.apache.beam.sdk.coders.KvCoder) FileResultCoder(org.apache.beam.sdk.io.FileBasedSink.FileResultCoder) VoidCoder(org.apache.beam.sdk.coders.VoidCoder) GlobalWindows(org.apache.beam.sdk.transforms.windowing.GlobalWindows) KV(org.apache.beam.sdk.values.KV) Pipeline(org.apache.beam.sdk.Pipeline) PCollectionView(org.apache.beam.sdk.values.PCollectionView) DoFn(org.apache.beam.sdk.transforms.DoFn) FileResult(org.apache.beam.sdk.io.FileBasedSink.FileResult) Writer(org.apache.beam.sdk.io.FileBasedSink.Writer)

Example 27 with PCollectionView

use of org.apache.beam.sdk.values.PCollectionView in project beam by apache.

the class DoFnOperatorTest method testSideInputs.

public void testSideInputs(boolean keyed) throws Exception {
    WindowedValue.ValueOnlyWindowedValueCoder<String> windowedValueCoder = WindowedValue.getValueOnlyCoder(StringUtf8Coder.of());
    TupleTag<String> outputTag = new TupleTag<>("main-output");
    ImmutableMap<Integer, PCollectionView<?>> sideInputMapping = ImmutableMap.<Integer, PCollectionView<?>>builder().put(1, view1).put(2, view2).build();
    Coder<String> keyCoder = null;
    if (keyed) {
        keyCoder = StringUtf8Coder.of();
    }
    DoFnOperator<String, String, String> doFnOperator = new DoFnOperator<>(new IdentityDoFn<String>(), "stepName", windowedValueCoder, outputTag, Collections.<TupleTag<?>>emptyList(), new DoFnOperator.DefaultOutputManagerFactory<String>(), WindowingStrategy.globalDefault(), sideInputMapping, /* side-input mapping */
    ImmutableList.<PCollectionView<?>>of(view1, view2), /* side inputs */
    PipelineOptionsFactory.as(FlinkPipelineOptions.class), keyCoder);
    TwoInputStreamOperatorTestHarness<WindowedValue<String>, RawUnionValue, String> testHarness = new TwoInputStreamOperatorTestHarness<>(doFnOperator);
    if (keyed) {
        // we use a dummy key for the second input since it is considered to be broadcast
        testHarness = new KeyedTwoInputStreamOperatorTestHarness<>(doFnOperator, new StringKeySelector(), new DummyKeySelector(), BasicTypeInfo.STRING_TYPE_INFO);
    }
    testHarness.open();
    IntervalWindow firstWindow = new IntervalWindow(new Instant(0), new Instant(100));
    IntervalWindow secondWindow = new IntervalWindow(new Instant(0), new Instant(500));
    // test the keep of sideInputs events
    testHarness.processElement2(new StreamRecord<>(new RawUnionValue(1, valuesInWindow(ImmutableList.of("hello", "ciao"), new Instant(0), firstWindow))));
    testHarness.processElement2(new StreamRecord<>(new RawUnionValue(2, valuesInWindow(ImmutableList.of("foo", "bar"), new Instant(0), secondWindow))));
    // push in a regular elements
    WindowedValue<String> helloElement = valueInWindow("Hello", new Instant(0), firstWindow);
    WindowedValue<String> worldElement = valueInWindow("World", new Instant(1000), firstWindow);
    testHarness.processElement1(new StreamRecord<>(helloElement));
    testHarness.processElement1(new StreamRecord<>(worldElement));
    // test the keep of pushed-back events
    testHarness.processElement2(new StreamRecord<>(new RawUnionValue(1, valuesInWindow(ImmutableList.of("hello", "ciao"), new Instant(1000), firstWindow))));
    testHarness.processElement2(new StreamRecord<>(new RawUnionValue(2, valuesInWindow(ImmutableList.of("foo", "bar"), new Instant(1000), secondWindow))));
    assertThat(this.<String>stripStreamRecordFromWindowedValue(testHarness.getOutput()), contains(helloElement, worldElement));
    testHarness.close();
}
Also used : TwoInputStreamOperatorTestHarness(org.apache.flink.streaming.util.TwoInputStreamOperatorTestHarness) KeyedTwoInputStreamOperatorTestHarness(org.apache.flink.streaming.util.KeyedTwoInputStreamOperatorTestHarness) RawUnionValue(org.apache.beam.sdk.transforms.join.RawUnionValue) Instant(org.joda.time.Instant) TupleTag(org.apache.beam.sdk.values.TupleTag) FlinkPipelineOptions(org.apache.beam.runners.flink.FlinkPipelineOptions) DoFnOperator(org.apache.beam.runners.flink.translation.wrappers.streaming.DoFnOperator) PCollectionView(org.apache.beam.sdk.values.PCollectionView) WindowedValue(org.apache.beam.sdk.util.WindowedValue) IntervalWindow(org.apache.beam.sdk.transforms.windowing.IntervalWindow)

Example 28 with PCollectionView

use of org.apache.beam.sdk.values.PCollectionView in project beam by apache.

the class SideInputContainerTest method getReturnsLatestPaneInWindow.

@Test
public void getReturnsLatestPaneInWindow() throws Exception {
    WindowedValue<KV<String, Integer>> one = WindowedValue.of(KV.of("one", 1), new Instant(1L), SECOND_WINDOW, PaneInfo.createPane(true, false, Timing.EARLY));
    WindowedValue<KV<String, Integer>> two = WindowedValue.of(KV.of("two", 2), new Instant(20L), SECOND_WINDOW, PaneInfo.createPane(true, false, Timing.EARLY));
    container.write(mapView, ImmutableList.<WindowedValue<?>>of(one, two));
    Map<String, Integer> viewContents = container.createReaderForViews(ImmutableList.<PCollectionView<?>>of(mapView)).get(mapView, SECOND_WINDOW);
    assertThat(viewContents, hasEntry("one", 1));
    assertThat(viewContents, hasEntry("two", 2));
    assertThat(viewContents.size(), is(2));
    WindowedValue<KV<String, Integer>> three = WindowedValue.of(KV.of("three", 3), new Instant(300L), SECOND_WINDOW, PaneInfo.createPane(false, false, Timing.EARLY, 1, -1));
    container.write(mapView, ImmutableList.<WindowedValue<?>>of(three));
    Map<String, Integer> overwrittenViewContents = container.createReaderForViews(ImmutableList.<PCollectionView<?>>of(mapView)).get(mapView, SECOND_WINDOW);
    assertThat(overwrittenViewContents, hasEntry("three", 3));
    assertThat(overwrittenViewContents.size(), is(1));
}
Also used : PCollectionView(org.apache.beam.sdk.values.PCollectionView) Instant(org.joda.time.Instant) KV(org.apache.beam.sdk.values.KV) Test(org.junit.Test)

Example 29 with PCollectionView

use of org.apache.beam.sdk.values.PCollectionView in project beam by apache.

the class ParDoTest method testParDoWithTaggedOutputName.

@Test
public void testParDoWithTaggedOutputName() {
    pipeline.enableAbandonedNodeEnforcement(false);
    TupleTag<String> mainOutputTag = new TupleTag<String>("main") {
    };
    TupleTag<String> additionalOutputTag1 = new TupleTag<String>("output1") {
    };
    TupleTag<String> additionalOutputTag2 = new TupleTag<String>("output2") {
    };
    TupleTag<String> additionalOutputTag3 = new TupleTag<String>("output3") {
    };
    TupleTag<String> additionalOutputTagUnwritten = new TupleTag<String>("unwrittenOutput") {
    };
    PCollectionTuple outputs = pipeline.apply(Create.of(Arrays.asList(3, -42, 666))).setName("MyInput").apply("MyParDo", ParDo.of(new TestDoFn(Arrays.<PCollectionView<Integer>>asList(), Arrays.asList(additionalOutputTag1, additionalOutputTag2, additionalOutputTag3))).withOutputTags(mainOutputTag, TupleTagList.of(additionalOutputTag3).and(additionalOutputTag1).and(additionalOutputTagUnwritten).and(additionalOutputTag2)));
    assertEquals("MyParDo.main", outputs.get(mainOutputTag).getName());
    assertEquals("MyParDo.output1", outputs.get(additionalOutputTag1).getName());
    assertEquals("MyParDo.output2", outputs.get(additionalOutputTag2).getName());
    assertEquals("MyParDo.output3", outputs.get(additionalOutputTag3).getName());
    assertEquals("MyParDo.unwrittenOutput", outputs.get(additionalOutputTagUnwritten).getName());
}
Also used : PCollectionView(org.apache.beam.sdk.values.PCollectionView) TupleTag(org.apache.beam.sdk.values.TupleTag) PCollectionTuple(org.apache.beam.sdk.values.PCollectionTuple) StringUtils.byteArrayToJsonString(org.apache.beam.sdk.util.StringUtils.byteArrayToJsonString) Matchers.containsString(org.hamcrest.Matchers.containsString) Test(org.junit.Test)

Example 30 with PCollectionView

use of org.apache.beam.sdk.values.PCollectionView in project beam by apache.

the class ParDoTest method testParDoWithTaggedOutput.

@Test
@Category(ValidatesRunner.class)
public void testParDoWithTaggedOutput() {
    List<Integer> inputs = Arrays.asList(3, -42, 666);
    TupleTag<String> mainOutputTag = new TupleTag<String>("main") {
    };
    TupleTag<String> additionalOutputTag1 = new TupleTag<String>("additional1") {
    };
    TupleTag<String> additionalOutputTag2 = new TupleTag<String>("additional2") {
    };
    TupleTag<String> additionalOutputTag3 = new TupleTag<String>("additional3") {
    };
    TupleTag<String> additionalOutputTagUnwritten = new TupleTag<String>("unwrittenOutput") {
    };
    PCollectionTuple outputs = pipeline.apply(Create.of(inputs)).apply(ParDo.of(new TestDoFn(Arrays.<PCollectionView<Integer>>asList(), Arrays.asList(additionalOutputTag1, additionalOutputTag2, additionalOutputTag3))).withOutputTags(mainOutputTag, TupleTagList.of(additionalOutputTag3).and(additionalOutputTag1).and(additionalOutputTagUnwritten).and(additionalOutputTag2)));
    PAssert.that(outputs.get(mainOutputTag)).satisfies(ParDoTest.HasExpectedOutput.forInput(inputs));
    PAssert.that(outputs.get(additionalOutputTag1)).satisfies(ParDoTest.HasExpectedOutput.forInput(inputs).fromOutput(additionalOutputTag1));
    PAssert.that(outputs.get(additionalOutputTag2)).satisfies(ParDoTest.HasExpectedOutput.forInput(inputs).fromOutput(additionalOutputTag2));
    PAssert.that(outputs.get(additionalOutputTag3)).satisfies(ParDoTest.HasExpectedOutput.forInput(inputs).fromOutput(additionalOutputTag3));
    PAssert.that(outputs.get(additionalOutputTagUnwritten)).empty();
    pipeline.run();
}
Also used : PCollectionView(org.apache.beam.sdk.values.PCollectionView) TupleTag(org.apache.beam.sdk.values.TupleTag) PCollectionTuple(org.apache.beam.sdk.values.PCollectionTuple) StringUtils.byteArrayToJsonString(org.apache.beam.sdk.util.StringUtils.byteArrayToJsonString) Matchers.containsString(org.hamcrest.Matchers.containsString) Category(org.junit.experimental.categories.Category) Test(org.junit.Test)

Aggregations

PCollectionView (org.apache.beam.sdk.values.PCollectionView)67 Map (java.util.Map)29 HashMap (java.util.HashMap)28 Test (org.junit.Test)28 TupleTag (org.apache.beam.sdk.values.TupleTag)27 BoundedWindow (org.apache.beam.sdk.transforms.windowing.BoundedWindow)22 Coder (org.apache.beam.sdk.coders.Coder)21 KV (org.apache.beam.sdk.values.KV)20 Instant (org.joda.time.Instant)20 KvCoder (org.apache.beam.sdk.coders.KvCoder)18 WindowedValue (org.apache.beam.sdk.util.WindowedValue)18 PCollection (org.apache.beam.sdk.values.PCollection)18 DoFn (org.apache.beam.sdk.transforms.DoFn)16 ArrayList (java.util.ArrayList)15 IntervalWindow (org.apache.beam.sdk.transforms.windowing.IntervalWindow)14 List (java.util.List)13 ImmutableMap (org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.ImmutableMap)13 IOException (java.io.IOException)12 RunnerApi (org.apache.beam.model.pipeline.v1.RunnerApi)12 ByteString (org.apache.beam.vendor.grpc.v1p43p2.com.google.protobuf.ByteString)10