Search in sources :

Example 61 with DoFn

use of org.apache.beam.sdk.transforms.DoFn in project beam by apache.

the class UserParDoFnFactory method create.

@Override
public ParDoFn create(PipelineOptions options, CloudObject cloudUserFn, @Nullable List<SideInputInfo> sideInputInfos, TupleTag<?> mainOutputTag, Map<TupleTag<?>, Integer> outputTupleTagsToReceiverIndices, DataflowExecutionContext<?> executionContext, DataflowOperationContext operationContext) throws Exception {
    DoFnInstanceManager instanceManager = fnCache.get(operationContext.nameContext().systemName(), () -> DoFnInstanceManagers.cloningPool(doFnExtractor.getDoFnInfo(cloudUserFn), options));
    DoFnInfo<?, ?> doFnInfo = instanceManager.peek();
    DataflowExecutionContext.DataflowStepContext stepContext = executionContext.getStepContext(operationContext);
    Iterable<PCollectionView<?>> sideInputViews = doFnInfo.getSideInputViews();
    SideInputReader sideInputReader = executionContext.getSideInputReader(sideInputInfos, sideInputViews, operationContext);
    if (doFnInfo.getDoFn() instanceof BatchStatefulParDoOverrides.BatchStatefulDoFn) {
        // HACK: BatchStatefulDoFn is a class from DataflowRunner's overrides
        // that just instructs the worker to execute it differently. This will
        // be replaced by metadata in the Runner API payload
        BatchStatefulParDoOverrides.BatchStatefulDoFn fn = (BatchStatefulParDoOverrides.BatchStatefulDoFn) doFnInfo.getDoFn();
        DoFn underlyingFn = fn.getUnderlyingDoFn();
        return new BatchModeUngroupingParDoFn((BatchModeExecutionContext.StepContext) stepContext, new SimpleParDoFn(options, DoFnInstanceManagers.singleInstance(doFnInfo.withFn(underlyingFn)), sideInputReader, doFnInfo.getMainOutput(), outputTupleTagsToReceiverIndices, stepContext, operationContext, doFnInfo.getDoFnSchemaInformation(), doFnInfo.getSideInputMapping(), runnerFactory));
    } else if (doFnInfo.getDoFn() instanceof StreamingPCollectionViewWriterFn) {
        // HACK: StreamingPCollectionViewWriterFn is a class from
        // DataflowPipelineTranslator. Using the class as an indicator is a migration path
        // to simply having an indicator string.
        checkArgument(stepContext instanceof StreamingModeExecutionContext.StreamingModeStepContext, "stepContext must be a StreamingModeStepContext to use StreamingPCollectionViewWriterFn");
        DataflowRunner.StreamingPCollectionViewWriterFn<Object> writerFn = (StreamingPCollectionViewWriterFn<Object>) doFnInfo.getDoFn();
        return new StreamingPCollectionViewWriterParDoFn((StreamingModeExecutionContext.StreamingModeStepContext) stepContext, writerFn.getView().getTagInternal(), writerFn.getDataCoder(), (Coder<BoundedWindow>) doFnInfo.getWindowingStrategy().getWindowFn().windowCoder());
    } else {
        return new SimpleParDoFn(options, instanceManager, sideInputReader, doFnInfo.getMainOutput(), outputTupleTagsToReceiverIndices, stepContext, operationContext, doFnInfo.getDoFnSchemaInformation(), doFnInfo.getSideInputMapping(), runnerFactory);
    }
}
Also used : Coder(org.apache.beam.sdk.coders.Coder) SideInputReader(org.apache.beam.runners.core.SideInputReader) BatchStatefulParDoOverrides(org.apache.beam.runners.dataflow.BatchStatefulParDoOverrides) StreamingPCollectionViewWriterFn(org.apache.beam.runners.dataflow.DataflowRunner.StreamingPCollectionViewWriterFn) PCollectionView(org.apache.beam.sdk.values.PCollectionView) DoFn(org.apache.beam.sdk.transforms.DoFn) ParDoFn(org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoFn) CloudObject(org.apache.beam.runners.dataflow.util.CloudObject)

Example 62 with DoFn

use of org.apache.beam.sdk.transforms.DoFn in project beam by apache.

the class Query13 method expand.

@Override
public PCollection<Event> expand(PCollection<Event> events) {
    final Coder<Event> coder = events.getCoder();
    return events.apply("Pair with random key", ParDo.of(new AssignShardFn<>(configuration.numKeyBuckets))).apply(GroupByKey.create()).apply("ExpandIterable", ParDo.of(new DoFn<KV<Integer, Iterable<Event>>, Event>() {

        @ProcessElement
        public void processElement(@Element KV<Integer, Iterable<Event>> element, OutputReceiver<Event> r) {
            for (Event value : element.getValue()) {
                r.output(value);
            }
        }
    })).apply(name + ".Serialize", ParDo.of(new DoFn<Event, Event>() {

        private final Counter bytesMetric = Metrics.counter(name, "serde-bytes");

        private final Random random = new Random();

        private double pardoCPUFactor = (configuration.pardoCPUFactor >= 0.0 && configuration.pardoCPUFactor <= 1.0) ? configuration.pardoCPUFactor : 1.0;

        @ProcessElement
        public void processElement(ProcessContext c) throws CoderException, IOException {
            Event event;
            if (random.nextDouble() <= pardoCPUFactor) {
                event = encodeDecode(coder, c.element(), bytesMetric);
            } else {
                event = c.element();
            }
            c.output(event);
        }
    }));
}
Also used : KV(org.apache.beam.sdk.values.KV) DoFn(org.apache.beam.sdk.transforms.DoFn) Counter(org.apache.beam.sdk.metrics.Counter) Random(java.util.Random) Event(org.apache.beam.sdk.nexmark.model.Event)

Example 63 with DoFn

use of org.apache.beam.sdk.transforms.DoFn in project beam by apache.

the class Query3 method expand.

@Override
public PCollection<NameCityStateId> expand(PCollection<Event> events) {
    PCollection<KV<Long, Event>> auctionsBySellerId = events.apply(NexmarkQueryUtil.JUST_NEW_AUCTIONS).apply(name + ".InCategory", Filter.by(auction -> auction.category == 10)).apply("EventByAuctionSeller", ParDo.of(new DoFn<Auction, KV<Long, Event>>() {

        @ProcessElement
        public void processElement(ProcessContext c) {
            Event e = new Event();
            e.newAuction = c.element();
            c.output(KV.of(c.element().seller, e));
        }
    }));
    PCollection<KV<Long, Event>> personsById = events.apply(NexmarkQueryUtil.JUST_NEW_PERSONS).apply(name + ".InState", Filter.by(person -> "OR".equals(person.state) || "ID".equals(person.state) || "CA".equals(person.state))).apply("EventByPersonId", ParDo.of(new DoFn<Person, KV<Long, Event>>() {

        @ProcessElement
        public void processElement(ProcessContext c) {
            Event e = new Event();
            e.newPerson = c.element();
            c.output(KV.of(c.element().id, e));
        }
    }));
    // Join auctions and people.
    return PCollectionList.of(auctionsBySellerId).and(personsById).apply(Flatten.pCollections()).apply(name + ".Join", ParDo.of(joinDoFn)).apply(name + ".Project", ParDo.of(new DoFn<KV<Auction, Person>, NameCityStateId>() {

        @ProcessElement
        public void processElement(ProcessContext c) {
            Auction auction = c.element().getKey();
            Person person = c.element().getValue();
            c.output(new NameCityStateId(person.name, person.city, person.state, auction.id));
        }
    }));
}
Also used : DoFn(org.apache.beam.sdk.transforms.DoFn) NameCityStateId(org.apache.beam.sdk.nexmark.model.NameCityStateId) Auction(org.apache.beam.sdk.nexmark.model.Auction) Event(org.apache.beam.sdk.nexmark.model.Event) KV(org.apache.beam.sdk.values.KV) Person(org.apache.beam.sdk.nexmark.model.Person)

Example 64 with DoFn

use of org.apache.beam.sdk.transforms.DoFn in project beam by apache.

the class BigQueryIOReadTest method checkSetsProject.

private void checkSetsProject(String projectId) throws Exception {
    fakeDatasetService.createDataset(projectId, "dataset-id", "", "", null);
    String tableId = "sometable";
    TableReference tableReference = new TableReference().setProjectId(projectId).setDatasetId("dataset-id").setTableId(tableId);
    fakeDatasetService.createTable(new Table().setTableReference(tableReference).setSchema(new TableSchema().setFields(ImmutableList.of(new TableFieldSchema().setName("name").setType("STRING"), new TableFieldSchema().setName("number").setType("INTEGER")))));
    FakeBigQueryServices fakeBqServices = new FakeBigQueryServices().withJobService(new FakeJobService()).withDatasetService(fakeDatasetService);
    List<TableRow> expected = ImmutableList.of(new TableRow().set("name", "a").set("number", 1L), new TableRow().set("name", "b").set("number", 2L), new TableRow().set("name", "c").set("number", 3L), new TableRow().set("name", "d").set("number", 4L), new TableRow().set("name", "e").set("number", 5L), new TableRow().set("name", "f").set("number", 6L));
    fakeDatasetService.insertAll(tableReference, expected, null);
    TableReference tableRef = new TableReference().setDatasetId("dataset-id").setTableId(tableId);
    PCollection<KV<String, Long>> output = p.apply(BigQueryIO.read().from(tableRef).withTestServices(fakeBqServices)).apply(ParDo.of(new DoFn<TableRow, KV<String, Long>>() {

        @ProcessElement
        public void processElement(ProcessContext c) throws Exception {
            c.output(KV.of((String) c.element().get("name"), Long.valueOf((String) c.element().get("number"))));
        }
    }));
    PAssert.that(output).containsInAnyOrder(ImmutableList.of(KV.of("a", 1L), KV.of("b", 2L), KV.of("c", 3L), KV.of("d", 4L), KV.of("e", 5L), KV.of("f", 6L)));
    p.run();
}
Also used : Table(com.google.api.services.bigquery.model.Table) TableSchema(com.google.api.services.bigquery.model.TableSchema) ByteString(com.google.protobuf.ByteString) KV(org.apache.beam.sdk.values.KV) TableFieldSchema(com.google.api.services.bigquery.model.TableFieldSchema) TableReference(com.google.api.services.bigquery.model.TableReference) BigQueryResourceNaming.createTempTableReference(org.apache.beam.sdk.io.gcp.bigquery.BigQueryResourceNaming.createTempTableReference) DoFn(org.apache.beam.sdk.transforms.DoFn) FakeJobService(org.apache.beam.sdk.io.gcp.testing.FakeJobService) TableRow(com.google.api.services.bigquery.model.TableRow) FakeBigQueryServices(org.apache.beam.sdk.io.gcp.testing.FakeBigQueryServices)

Example 65 with DoFn

use of org.apache.beam.sdk.transforms.DoFn in project beam by apache.

the class BigQueryIOReadTest method testReadFromTable.

private void testReadFromTable(boolean useTemplateCompatibility, boolean useReadTableRows) throws IOException, InterruptedException {
    Table sometable = new Table();
    sometable.setSchema(new TableSchema().setFields(ImmutableList.of(new TableFieldSchema().setName("name").setType("STRING"), new TableFieldSchema().setName("number").setType("INTEGER"))));
    sometable.setTableReference(new TableReference().setProjectId("non-executing-project").setDatasetId("somedataset").setTableId("sometable"));
    sometable.setNumBytes(1024L * 1024L);
    FakeDatasetService fakeDatasetService = new FakeDatasetService();
    fakeDatasetService.createDataset("non-executing-project", "somedataset", "", "", null);
    fakeDatasetService.createTable(sometable);
    List<TableRow> records = Lists.newArrayList(new TableRow().set("name", "a").set("number", 1L), new TableRow().set("name", "b").set("number", 2L), new TableRow().set("name", "c").set("number", 3L));
    fakeDatasetService.insertAll(sometable.getTableReference(), records, null);
    FakeBigQueryServices fakeBqServices = new FakeBigQueryServices().withJobService(new FakeJobService()).withDatasetService(fakeDatasetService);
    PTransform<PBegin, PCollection<TableRow>> readTransform;
    if (useReadTableRows) {
        BigQueryIO.Read read = BigQueryIO.read().from("non-executing-project:somedataset.sometable").withTestServices(fakeBqServices).withoutValidation();
        readTransform = useTemplateCompatibility ? read.withTemplateCompatibility() : read;
    } else {
        BigQueryIO.TypedRead<TableRow> read = BigQueryIO.readTableRows().from("non-executing-project:somedataset.sometable").withTestServices(fakeBqServices).withoutValidation();
        readTransform = useTemplateCompatibility ? read.withTemplateCompatibility() : read;
    }
    PCollection<KV<String, Long>> output = p.apply(readTransform).apply(ParDo.of(new DoFn<TableRow, KV<String, Long>>() {

        @ProcessElement
        public void processElement(ProcessContext c) throws Exception {
            c.output(KV.of((String) c.element().get("name"), Long.valueOf((String) c.element().get("number"))));
        }
    }));
    PAssert.that(output).containsInAnyOrder(ImmutableList.of(KV.of("a", 1L), KV.of("b", 2L), KV.of("c", 3L)));
    p.run();
}
Also used : Table(com.google.api.services.bigquery.model.Table) TableSchema(com.google.api.services.bigquery.model.TableSchema) KV(org.apache.beam.sdk.values.KV) ByteString(com.google.protobuf.ByteString) PBegin(org.apache.beam.sdk.values.PBegin) TableFieldSchema(com.google.api.services.bigquery.model.TableFieldSchema) PCollection(org.apache.beam.sdk.values.PCollection) TableReference(com.google.api.services.bigquery.model.TableReference) BigQueryResourceNaming.createTempTableReference(org.apache.beam.sdk.io.gcp.bigquery.BigQueryResourceNaming.createTempTableReference) FakeDatasetService(org.apache.beam.sdk.io.gcp.testing.FakeDatasetService) DoFn(org.apache.beam.sdk.transforms.DoFn) FakeJobService(org.apache.beam.sdk.io.gcp.testing.FakeJobService) TableRow(com.google.api.services.bigquery.model.TableRow) FakeBigQueryServices(org.apache.beam.sdk.io.gcp.testing.FakeBigQueryServices)

Aggregations

DoFn (org.apache.beam.sdk.transforms.DoFn)154 Test (org.junit.Test)98 Pipeline (org.apache.beam.sdk.Pipeline)60 KV (org.apache.beam.sdk.values.KV)45 TupleTag (org.apache.beam.sdk.values.TupleTag)28 StateSpec (org.apache.beam.sdk.state.StateSpec)26 Instant (org.joda.time.Instant)26 ArrayList (java.util.ArrayList)23 TestPipeline (org.apache.beam.sdk.testing.TestPipeline)23 BoundedWindow (org.apache.beam.sdk.transforms.windowing.BoundedWindow)22 PCollection (org.apache.beam.sdk.values.PCollection)21 TimerSpec (org.apache.beam.sdk.state.TimerSpec)19 WindowedValue (org.apache.beam.sdk.util.WindowedValue)18 PCollectionView (org.apache.beam.sdk.values.PCollectionView)18 HashMap (java.util.HashMap)17 Coder (org.apache.beam.sdk.coders.Coder)17 List (java.util.List)16 Map (java.util.Map)14 ValueState (org.apache.beam.sdk.state.ValueState)14 RunnerApi (org.apache.beam.model.pipeline.v1.RunnerApi)13