Search in sources :

Example 11 with DataflowPackage

use of com.google.api.services.dataflow.model.DataflowPackage in project beam by apache.

the class DataflowPipelineTranslatorTest method testTaggedNamesOverridden.

/**
   * Test that in translation the name for collections of a multi-output ParDo - a special case
   * because the user can name tags - are overridden to be what the Dataflow service expects.
   */
@Test
public void testTaggedNamesOverridden() throws Exception {
    DataflowPipelineOptions options = buildPipelineOptions();
    DataflowRunner runner = DataflowRunner.fromOptions(options);
    options.setStreaming(false);
    DataflowPipelineTranslator translator = DataflowPipelineTranslator.fromOptions(options);
    Pipeline pipeline = Pipeline.create(options);
    TupleTag<Integer> tag1 = new TupleTag<Integer>("frazzle") {
    };
    TupleTag<Integer> tag2 = new TupleTag<Integer>("bazzle") {
    };
    TupleTag<Integer> tag3 = new TupleTag<Integer>() {
    };
    PCollectionTuple outputs = pipeline.apply(Create.of(3)).apply(ParDo.of(new DoFn<Integer, Integer>() {

        @ProcessElement
        public void drop() {
        }
    }).withOutputTags(tag1, TupleTagList.of(tag2).and(tag3)));
    outputs.get(tag1).setName("bizbazzle");
    outputs.get(tag2).setName("gonzaggle");
    outputs.get(tag3).setName("froonazzle");
    runner.replaceTransforms(pipeline);
    Job job = translator.translate(pipeline, runner, Collections.<DataflowPackage>emptyList()).getJob();
    // The ParDo step
    Step step = job.getSteps().get(1);
    String stepName = Structs.getString(step.getProperties(), PropertyNames.USER_NAME);
    List<Map<String, Object>> outputInfos = Structs.getListOfMaps(step.getProperties(), PropertyNames.OUTPUT_INFO, null);
    assertThat(outputInfos.size(), equalTo(3));
    // The names set by the user _and_ the tags _must_ be ignored, or metrics will not show up.
    for (int i = 0; i < outputInfos.size(); ++i) {
        assertThat(Structs.getString(outputInfos.get(i), PropertyNames.USER_NAME), equalTo(String.format("%s.out%s", stepName, i)));
    }
}
Also used : DataflowPipelineOptions(org.apache.beam.runners.dataflow.options.DataflowPipelineOptions) TupleTag(org.apache.beam.sdk.values.TupleTag) Step(com.google.api.services.dataflow.model.Step) Structs.getString(org.apache.beam.runners.dataflow.util.Structs.getString) Pipeline(org.apache.beam.sdk.Pipeline) PCollectionTuple(org.apache.beam.sdk.values.PCollectionTuple) Job(com.google.api.services.dataflow.model.Job) Map(java.util.Map) ImmutableMap(com.google.common.collect.ImmutableMap) DataflowPackage(com.google.api.services.dataflow.model.DataflowPackage) Test(org.junit.Test)

Example 12 with DataflowPackage

use of com.google.api.services.dataflow.model.DataflowPackage in project beam by apache.

the class PackageUtil method stageClasspathElements.

// Visible for testing.
static List<DataflowPackage> stageClasspathElements(Collection<String> classpathElements, final String stagingPath, final Sleeper retrySleeper, ListeningExecutorService executorService, final CreateOptions createOptions) {
    LOG.info("Uploading {} files from PipelineOptions.filesToStage to staging location to " + "prepare for execution.", classpathElements.size());
    if (classpathElements.size() > SANE_CLASSPATH_SIZE) {
        LOG.warn("Your classpath contains {} elements, which Google Cloud Dataflow automatically " + "copies to all workers. Having this many entries on your classpath may be indicative " + "of an issue in your pipeline. You may want to consider trimming the classpath to " + "necessary dependencies only, using --filesToStage pipeline option to override " + "what files are being staged, or bundling several dependencies into one.", classpathElements.size());
    }
    checkArgument(stagingPath != null, "Can't stage classpath elements because no staging location has been provided");
    // Inline a copy here because the inner code returns an immutable list and we want to mutate it.
    List<PackageAttributes> packageAttributes = new LinkedList<>(computePackageAttributes(classpathElements, stagingPath, executorService));
    // Compute the returned list of DataflowPackage objects here so that they are returned in the
    // same order as on the classpath.
    List<DataflowPackage> packages = Lists.newArrayListWithExpectedSize(packageAttributes.size());
    for (final PackageAttributes attributes : packageAttributes) {
        packages.add(attributes.getDataflowPackage());
    }
    // Order package attributes in descending size order so that we upload the largest files first.
    Collections.sort(packageAttributes, new PackageUploadOrder());
    final AtomicInteger numUploaded = new AtomicInteger(0);
    final AtomicInteger numCached = new AtomicInteger(0);
    List<ListenableFuture<?>> futures = new LinkedList<>();
    for (final PackageAttributes attributes : packageAttributes) {
        futures.add(executorService.submit(new Runnable() {

            @Override
            public void run() {
                stageOnePackage(attributes, numUploaded, numCached, retrySleeper, createOptions);
            }
        }));
    }
    try {
        Futures.allAsList(futures).get();
    } catch (InterruptedException e) {
        Thread.currentThread().interrupt();
        throw new RuntimeException("Interrupted while staging packages", e);
    } catch (ExecutionException e) {
        throw new RuntimeException("Error while staging packages", e.getCause());
    }
    LOG.info("Staging files complete: {} files cached, {} files newly uploaded", numCached.get(), numUploaded.get());
    return packages;
}
Also used : LinkedList(java.util.LinkedList) AtomicInteger(java.util.concurrent.atomic.AtomicInteger) ListenableFuture(com.google.common.util.concurrent.ListenableFuture) ExecutionException(java.util.concurrent.ExecutionException) DataflowPackage(com.google.api.services.dataflow.model.DataflowPackage)

Example 13 with DataflowPackage

use of com.google.api.services.dataflow.model.DataflowPackage in project beam by apache.

the class DataflowPipelineTranslatorTest method testScalingAlgorithmMissing.

@Test
public void testScalingAlgorithmMissing() throws IOException {
    DataflowPipelineOptions options = buildPipelineOptions();
    Pipeline p = buildPipeline(options);
    p.traverseTopologically(new RecordingPipelineVisitor());
    Job job = DataflowPipelineTranslator.fromOptions(options).translate(p, DataflowRunner.fromOptions(options), Collections.<DataflowPackage>emptyList()).getJob();
    assertEquals(1, job.getEnvironment().getWorkerPools().size());
    // Autoscaling settings are always set.
    assertNull(job.getEnvironment().getWorkerPools().get(0).getAutoscalingSettings().getAlgorithm());
    assertEquals(0, job.getEnvironment().getWorkerPools().get(0).getAutoscalingSettings().getMaxNumWorkers().intValue());
}
Also used : DataflowPipelineOptions(org.apache.beam.runners.dataflow.options.DataflowPipelineOptions) Job(com.google.api.services.dataflow.model.Job) DataflowPackage(com.google.api.services.dataflow.model.DataflowPackage) Pipeline(org.apache.beam.sdk.Pipeline) Test(org.junit.Test)

Example 14 with DataflowPackage

use of com.google.api.services.dataflow.model.DataflowPackage in project beam by apache.

the class DataflowPipelineTranslatorTest method testStepDisplayData.

@Test
public void testStepDisplayData() throws Exception {
    DataflowPipelineOptions options = buildPipelineOptions();
    DataflowPipelineTranslator translator = DataflowPipelineTranslator.fromOptions(options);
    Pipeline pipeline = Pipeline.create(options);
    DoFn<Integer, Integer> fn1 = new DoFn<Integer, Integer>() {

        @ProcessElement
        public void processElement(ProcessContext c) throws Exception {
            c.output(c.element());
        }

        @Override
        public void populateDisplayData(DisplayData.Builder builder) {
            builder.add(DisplayData.item("foo", "bar")).add(DisplayData.item("foo2", DataflowPipelineTranslatorTest.class).withLabel("Test Class").withLinkUrl("http://www.google.com"));
        }
    };
    DoFn<Integer, Integer> fn2 = new DoFn<Integer, Integer>() {

        @ProcessElement
        public void processElement(ProcessContext c) throws Exception {
            c.output(c.element());
        }

        @Override
        public void populateDisplayData(DisplayData.Builder builder) {
            builder.add(DisplayData.item("foo3", 1234));
        }
    };
    ParDo.SingleOutput<Integer, Integer> parDo1 = ParDo.of(fn1);
    ParDo.SingleOutput<Integer, Integer> parDo2 = ParDo.of(fn2);
    pipeline.apply(Create.of(1, 2, 3)).apply(parDo1).apply(parDo2);
    DataflowRunner runner = DataflowRunner.fromOptions(options);
    runner.replaceTransforms(pipeline);
    Job job = translator.translate(pipeline, runner, Collections.<DataflowPackage>emptyList()).getJob();
    assertAllStepOutputsHaveUniqueIds(job);
    List<Step> steps = job.getSteps();
    assertEquals(3, steps.size());
    Map<String, Object> parDo1Properties = steps.get(1).getProperties();
    Map<String, Object> parDo2Properties = steps.get(2).getProperties();
    assertThat(parDo1Properties, hasKey("display_data"));
    @SuppressWarnings("unchecked") Collection<Map<String, String>> fn1displayData = (Collection<Map<String, String>>) parDo1Properties.get("display_data");
    @SuppressWarnings("unchecked") Collection<Map<String, String>> fn2displayData = (Collection<Map<String, String>>) parDo2Properties.get("display_data");
    ImmutableSet<ImmutableMap<String, Object>> expectedFn1DisplayData = ImmutableSet.of(ImmutableMap.<String, Object>builder().put("key", "foo").put("type", "STRING").put("value", "bar").put("namespace", fn1.getClass().getName()).build(), ImmutableMap.<String, Object>builder().put("key", "fn").put("label", "Transform Function").put("type", "JAVA_CLASS").put("value", fn1.getClass().getName()).put("shortValue", fn1.getClass().getSimpleName()).put("namespace", parDo1.getClass().getName()).build(), ImmutableMap.<String, Object>builder().put("key", "foo2").put("type", "JAVA_CLASS").put("value", DataflowPipelineTranslatorTest.class.getName()).put("shortValue", DataflowPipelineTranslatorTest.class.getSimpleName()).put("namespace", fn1.getClass().getName()).put("label", "Test Class").put("linkUrl", "http://www.google.com").build());
    ImmutableSet<ImmutableMap<String, Object>> expectedFn2DisplayData = ImmutableSet.of(ImmutableMap.<String, Object>builder().put("key", "fn").put("label", "Transform Function").put("type", "JAVA_CLASS").put("value", fn2.getClass().getName()).put("shortValue", fn2.getClass().getSimpleName()).put("namespace", parDo2.getClass().getName()).build(), ImmutableMap.<String, Object>builder().put("key", "foo3").put("type", "INTEGER").put("value", 1234L).put("namespace", fn2.getClass().getName()).build());
    assertEquals(expectedFn1DisplayData, ImmutableSet.copyOf(fn1displayData));
    assertEquals(expectedFn2DisplayData, ImmutableSet.copyOf(fn2displayData));
}
Also used : Step(com.google.api.services.dataflow.model.Step) Structs.getString(org.apache.beam.runners.dataflow.util.Structs.getString) Job(com.google.api.services.dataflow.model.Job) DataflowPipelineOptions(org.apache.beam.runners.dataflow.options.DataflowPipelineOptions) ImmutableMap(com.google.common.collect.ImmutableMap) Pipeline(org.apache.beam.sdk.Pipeline) DoFn(org.apache.beam.sdk.transforms.DoFn) ParDo(org.apache.beam.sdk.transforms.ParDo) Collection(java.util.Collection) PCollection(org.apache.beam.sdk.values.PCollection) Structs.addObject(org.apache.beam.runners.dataflow.util.Structs.addObject) Map(java.util.Map) ImmutableMap(com.google.common.collect.ImmutableMap) DataflowPackage(com.google.api.services.dataflow.model.DataflowPackage) Test(org.junit.Test)

Example 15 with DataflowPackage

use of com.google.api.services.dataflow.model.DataflowPackage in project beam by apache.

the class DataflowPipelineTranslatorTest method testToSingletonTranslationWithIsmSideInput.

@Test
public void testToSingletonTranslationWithIsmSideInput() throws Exception {
    // A "change detector" test that makes sure the translation
    // of getting a PCollectionView<T> does not change
    // in bad ways during refactor
    DataflowPipelineOptions options = buildPipelineOptions();
    DataflowPipelineTranslator translator = DataflowPipelineTranslator.fromOptions(options);
    Pipeline pipeline = Pipeline.create(options);
    pipeline.apply(Create.of(1)).apply(View.<Integer>asSingleton());
    DataflowRunner runner = DataflowRunner.fromOptions(options);
    runner.replaceTransforms(pipeline);
    Job job = translator.translate(pipeline, runner, Collections.<DataflowPackage>emptyList()).getJob();
    assertAllStepOutputsHaveUniqueIds(job);
    List<Step> steps = job.getSteps();
    assertEquals(5, steps.size());
    @SuppressWarnings("unchecked") List<Map<String, Object>> toIsmRecordOutputs = (List<Map<String, Object>>) steps.get(3).getProperties().get(PropertyNames.OUTPUT_INFO);
    assertTrue(Structs.getBoolean(Iterables.getOnlyElement(toIsmRecordOutputs), "use_indexed_format"));
    Step collectionToSingletonStep = steps.get(4);
    assertEquals("CollectionToSingleton", collectionToSingletonStep.getKind());
}
Also used : DataflowPipelineOptions(org.apache.beam.runners.dataflow.options.DataflowPipelineOptions) Step(com.google.api.services.dataflow.model.Step) Structs.getString(org.apache.beam.runners.dataflow.util.Structs.getString) Pipeline(org.apache.beam.sdk.Pipeline) List(java.util.List) TupleTagList(org.apache.beam.sdk.values.TupleTagList) ArrayList(java.util.ArrayList) ImmutableList(com.google.common.collect.ImmutableList) LinkedList(java.util.LinkedList) Structs.addObject(org.apache.beam.runners.dataflow.util.Structs.addObject) Job(com.google.api.services.dataflow.model.Job) Map(java.util.Map) ImmutableMap(com.google.common.collect.ImmutableMap) DataflowPackage(com.google.api.services.dataflow.model.DataflowPackage) Test(org.junit.Test)

Aggregations

DataflowPackage (com.google.api.services.dataflow.model.DataflowPackage)34 Test (org.junit.Test)30 Job (com.google.api.services.dataflow.model.Job)22 DataflowPipelineOptions (org.apache.beam.runners.dataflow.options.DataflowPipelineOptions)22 Pipeline (org.apache.beam.sdk.Pipeline)21 Structs.getString (org.apache.beam.runners.dataflow.util.Structs.getString)13 File (java.io.File)11 Step (com.google.api.services.dataflow.model.Step)8 ImmutableMap (com.google.common.collect.ImmutableMap)6 Map (java.util.Map)6 Structs.addObject (org.apache.beam.runners.dataflow.util.Structs.addObject)6 FileNotFoundException (java.io.FileNotFoundException)5 GcsPath (org.apache.beam.sdk.util.gcsfs.GcsPath)4 Matchers.anyString (org.mockito.Matchers.anyString)4 Pipe (java.nio.channels.Pipe)3 LinkedList (java.util.LinkedList)3 WorkerPool (com.google.api.services.dataflow.model.WorkerPool)2 ImmutableList (com.google.common.collect.ImmutableList)2 IOException (java.io.IOException)2 ArrayList (java.util.ArrayList)2