Search in sources :

Example 6 with BatchJoiner

use of io.cdap.cdap.etl.api.batch.BatchJoiner in project cdap by cdapio.

the class PipelinePluginContext method wrapPlugin.

private Object wrapPlugin(String pluginId, Object plugin) {
    Caller caller = getCaller(pluginId);
    StageMetrics stageMetrics = new DefaultStageMetrics(metrics, pluginId);
    OperationTimer operationTimer = processTimingEnabled ? new MetricsOperationTimer(stageMetrics) : NoOpOperationTimer.INSTANCE;
    if (plugin instanceof Action) {
        return new WrappedAction((Action) plugin, caller);
    } else if (plugin instanceof BatchSource) {
        return new WrappedBatchSource<>((BatchSource) plugin, caller, operationTimer);
    } else if (plugin instanceof BatchSink) {
        return new WrappedBatchSink<>((BatchSink) plugin, caller, operationTimer);
    } else if (plugin instanceof ErrorTransform) {
        return new WrappedErrorTransform<>((ErrorTransform) plugin, caller, operationTimer);
    } else if (plugin instanceof Transform) {
        return new WrappedTransform<>((Transform) plugin, caller, operationTimer);
    } else if (plugin instanceof BatchReducibleAggregator) {
        return new WrappedReduceAggregator<>((BatchReducibleAggregator) plugin, caller, operationTimer);
    } else if (plugin instanceof BatchAggregator) {
        return new WrappedBatchAggregator<>((BatchAggregator) plugin, caller, operationTimer);
    } else if (plugin instanceof BatchJoiner) {
        return new WrappedBatchJoiner<>((BatchJoiner) plugin, caller, operationTimer);
    } else if (plugin instanceof PostAction) {
        return new WrappedPostAction((PostAction) plugin, caller);
    } else if (plugin instanceof SplitterTransform) {
        return new WrappedSplitterTransform<>((SplitterTransform) plugin, caller, operationTimer);
    }
    return wrapUnknownPlugin(pluginId, plugin, caller);
}
Also used : Action(io.cdap.cdap.etl.api.action.Action) PostAction(io.cdap.cdap.etl.api.batch.PostAction) BatchSource(io.cdap.cdap.etl.api.batch.BatchSource) BatchAggregator(io.cdap.cdap.etl.api.batch.BatchAggregator) BatchSink(io.cdap.cdap.etl.api.batch.BatchSink) DefaultStageMetrics(io.cdap.cdap.etl.common.DefaultStageMetrics) StageMetrics(io.cdap.cdap.etl.api.StageMetrics) BatchReducibleAggregator(io.cdap.cdap.etl.api.batch.BatchReducibleAggregator) SplitterTransform(io.cdap.cdap.etl.api.SplitterTransform) BatchJoiner(io.cdap.cdap.etl.api.batch.BatchJoiner) ErrorTransform(io.cdap.cdap.etl.api.ErrorTransform) PostAction(io.cdap.cdap.etl.api.batch.PostAction) ErrorTransform(io.cdap.cdap.etl.api.ErrorTransform) SplitterTransform(io.cdap.cdap.etl.api.SplitterTransform) Transform(io.cdap.cdap.etl.api.Transform) DefaultStageMetrics(io.cdap.cdap.etl.common.DefaultStageMetrics)

Example 7 with BatchJoiner

use of io.cdap.cdap.etl.api.batch.BatchJoiner in project cdap by cdapio.

the class JoinOnFunction method createInitializedJoinOnTransform.

private JoinOnTransform<INPUT_RECORD, JOIN_KEY> createInitializedJoinOnTransform() throws Exception {
    Object plugin = pluginFunctionContext.createPlugin();
    BatchJoiner<JOIN_KEY, INPUT_RECORD, Object> joiner;
    boolean filterNullKeys = false;
    if (plugin instanceof BatchAutoJoiner) {
        BatchAutoJoiner autoJoiner = (BatchAutoJoiner) plugin;
        AutoJoinerContext autoJoinerContext = pluginFunctionContext.createAutoJoinerContext();
        JoinDefinition joinDefinition = autoJoiner.define(autoJoinerContext);
        autoJoinerContext.getFailureCollector().getOrThrowException();
        String stageName = pluginFunctionContext.getStageName();
        if (joinDefinition == null) {
            throw new IllegalStateException(String.format("Join stage '%s' did not specify a join definition. " + "Check with the plugin developer to ensure it is implemented correctly.", stageName));
        }
        JoinCondition condition = joinDefinition.getCondition();
        /*
         Filter out the record if it comes from an optional stage
         and the key is null, or if any of the fields in the key is null.
         For example, suppose we are performing a left outer join on:

          A (id, name) = (0, alice), (null, bob)
          B (id, email) = (0, alice@example.com), (null, placeholder@example.com)

         The final output should be:

         joined (A.id, A.name, B.email) = (0, alice, alice@example.com), (null, bob, null, null)

         that is, the bob record should not be joined to the placeholder@example email, even though both their
         ids are null.
       */
        if (condition.getOp() == JoinCondition.Op.KEY_EQUALITY && !((JoinCondition.OnKeys) condition).isNullSafe()) {
            filterNullKeys = joinDefinition.getStages().stream().filter(s -> !s.isRequired()).map(JoinStage::getStageName).anyMatch(s -> s.equals(inputStageName));
        }
        joiner = new JoinerBridge(stageName, autoJoiner, joinDefinition);
    } else {
        joiner = (BatchJoiner<JOIN_KEY, INPUT_RECORD, Object>) plugin;
        BatchJoinerRuntimeContext context = pluginFunctionContext.createBatchRuntimeContext();
        joiner.initialize(context);
    }
    return new JoinOnTransform<>(joiner, inputStageName, filterNullKeys);
}
Also used : BatchJoiner(io.cdap.cdap.etl.api.batch.BatchJoiner) Transformation(io.cdap.cdap.etl.api.Transformation) BatchJoinerRuntimeContext(io.cdap.cdap.etl.api.batch.BatchJoinerRuntimeContext) PairFlatMapFunction(org.apache.spark.api.java.function.PairFlatMapFunction) Iterator(java.util.Iterator) JoinStage(io.cdap.cdap.etl.api.join.JoinStage) JoinerBridge(io.cdap.cdap.etl.common.plugin.JoinerBridge) Tuple2(scala.Tuple2) Schema(io.cdap.cdap.api.data.schema.Schema) Constants(io.cdap.cdap.etl.common.Constants) StructuredRecord(io.cdap.cdap.api.data.format.StructuredRecord) TrackedTransform(io.cdap.cdap.etl.common.TrackedTransform) Emitter(io.cdap.cdap.etl.api.Emitter) DefaultEmitter(io.cdap.cdap.etl.common.DefaultEmitter) JoinDefinition(io.cdap.cdap.etl.api.join.JoinDefinition) AutoJoinerContext(io.cdap.cdap.etl.api.join.AutoJoinerContext) JoinCondition(io.cdap.cdap.etl.api.join.JoinCondition) BatchAutoJoiner(io.cdap.cdap.etl.api.batch.BatchAutoJoiner) BatchJoinerRuntimeContext(io.cdap.cdap.etl.api.batch.BatchJoinerRuntimeContext) JoinStage(io.cdap.cdap.etl.api.join.JoinStage) JoinCondition(io.cdap.cdap.etl.api.join.JoinCondition) BatchAutoJoiner(io.cdap.cdap.etl.api.batch.BatchAutoJoiner) AutoJoinerContext(io.cdap.cdap.etl.api.join.AutoJoinerContext) JoinDefinition(io.cdap.cdap.etl.api.join.JoinDefinition) JoinerBridge(io.cdap.cdap.etl.common.plugin.JoinerBridge)

Example 8 with BatchJoiner

use of io.cdap.cdap.etl.api.batch.BatchJoiner in project cdap by cdapio.

the class SparkStreamingPipelineRunner method handleJoin.

@Override
protected SparkCollection<Object> handleJoin(Map<String, SparkCollection<Object>> inputDataCollections, PipelinePhase pipelinePhase, PluginFunctionContext pluginFunctionContext, StageSpec stageSpec, FunctionCache.Factory functionCacheFactory, Object plugin, Integer numPartitions, StageStatisticsCollector collector, Set<String> shufflers) throws Exception {
    String stageName = stageSpec.getName();
    BatchJoiner<?, ?, ?> joiner;
    if (plugin instanceof BatchAutoJoiner) {
        BatchAutoJoiner autoJoiner = (BatchAutoJoiner) plugin;
        Map<String, Schema> inputSchemas = new HashMap<>();
        for (String inputStageName : pipelinePhase.getStageInputs(stageName)) {
            StageSpec inputStageSpec = pipelinePhase.getStage(inputStageName);
            inputSchemas.put(inputStageName, inputStageSpec.getOutputSchema());
        }
        FailureCollector failureCollector = new LoggingFailureCollector(stageName, inputSchemas);
        AutoJoinerContext autoJoinerContext = DefaultAutoJoinerContext.from(inputSchemas, failureCollector);
        failureCollector.getOrThrowException();
        JoinDefinition joinDefinition = autoJoiner.define(autoJoinerContext);
        if (joinDefinition == null) {
            throw new IllegalStateException(String.format("Joiner stage '%s' did not specify a join definition. " + "Check with the plugin developer to ensure it is implemented correctly.", stageName));
        }
        joiner = new JoinerBridge(stageName, autoJoiner, joinDefinition);
    } else if (plugin instanceof BatchJoiner) {
        joiner = (BatchJoiner) plugin;
    } else {
        // should never happen unless there is a bug in the code. should have failed during deployment
        throw new IllegalStateException(String.format("Stage '%s' is an unknown joiner type %s", stageName, plugin.getClass().getName()));
    }
    BatchJoinerRuntimeContext joinerRuntimeContext = pluginFunctionContext.createBatchRuntimeContext();
    joiner.initialize(joinerRuntimeContext);
    shufflers.add(stageName);
    return handleJoin(joiner, inputDataCollections, stageSpec, functionCacheFactory, numPartitions, collector);
}
Also used : BatchJoinerRuntimeContext(io.cdap.cdap.etl.api.batch.BatchJoinerRuntimeContext) LoggingFailureCollector(io.cdap.cdap.etl.validation.LoggingFailureCollector) HashMap(java.util.HashMap) Schema(io.cdap.cdap.api.data.schema.Schema) BatchJoiner(io.cdap.cdap.etl.api.batch.BatchJoiner) BatchAutoJoiner(io.cdap.cdap.etl.api.batch.BatchAutoJoiner) DefaultAutoJoinerContext(io.cdap.cdap.etl.common.DefaultAutoJoinerContext) AutoJoinerContext(io.cdap.cdap.etl.api.join.AutoJoinerContext) JoinDefinition(io.cdap.cdap.etl.api.join.JoinDefinition) StageSpec(io.cdap.cdap.etl.proto.v2.spec.StageSpec) LoggingFailureCollector(io.cdap.cdap.etl.validation.LoggingFailureCollector) FailureCollector(io.cdap.cdap.etl.api.FailureCollector) JoinerBridge(io.cdap.cdap.etl.common.plugin.JoinerBridge)

Example 9 with BatchJoiner

use of io.cdap.cdap.etl.api.batch.BatchJoiner in project cdap by cdapio.

the class MapReduceTransformExecutorFactory method getTransformation.

@SuppressWarnings("unchecked")
@Override
protected <IN, OUT> TrackedTransform<IN, OUT> getTransformation(StageSpec stageSpec) throws Exception {
    String stageName = stageSpec.getName();
    String pluginType = stageSpec.getPluginType();
    StageMetrics stageMetrics = new DefaultStageMetrics(metrics, stageName);
    TaskAttemptContext taskAttemptContext = (TaskAttemptContext) taskContext.getHadoopContext();
    StageStatisticsCollector collector = collectStageStatistics ? new MapReduceStageStatisticsCollector(stageName, taskAttemptContext) : new NoopStageStatisticsCollector();
    if (BatchAggregator.PLUGIN_TYPE.equals(pluginType)) {
        Object plugin = pluginInstantiator.newPluginInstance(stageName, macroEvaluator);
        BatchAggregator<?, ?, ?> batchAggregator;
        if (plugin instanceof BatchReducibleAggregator) {
            BatchReducibleAggregator<?, ?, ?, ?> reducibleAggregator = (BatchReducibleAggregator<?, ?, ?, ?>) plugin;
            batchAggregator = new AggregatorBridge<>(reducibleAggregator);
        } else {
            batchAggregator = (BatchAggregator<?, ?, ?>) plugin;
        }
        BatchRuntimeContext runtimeContext = createRuntimeContext(stageSpec);
        batchAggregator.initialize(runtimeContext);
        if (isMapPhase) {
            return getTrackedEmitKeyStep(new MapperAggregatorTransformation(batchAggregator, mapOutputKeyClassName, mapOutputValClassName), stageMetrics, getDataTracer(stageName), collector);
        } else {
            return getTrackedAggregateStep(new ReducerAggregatorTransformation(batchAggregator, mapOutputKeyClassName, mapOutputValClassName), stageMetrics, getDataTracer(stageName), collector);
        }
    } else if (BatchJoiner.PLUGIN_TYPE.equals(pluginType)) {
        Object plugin = pluginInstantiator.newPluginInstance(stageName, macroEvaluator);
        BatchJoiner<?, ?, ?> batchJoiner;
        Set<String> filterNullKeyStages = new HashSet<>();
        if (plugin instanceof BatchAutoJoiner) {
            BatchAutoJoiner autoJoiner = (BatchAutoJoiner) plugin;
            FailureCollector failureCollector = new LoggingFailureCollector(stageName, stageSpec.getInputSchemas());
            DefaultAutoJoinerContext context = DefaultAutoJoinerContext.from(stageSpec.getInputSchemas(), failureCollector);
            // definition will be non-null due to validate by PipelinePhasePreparer at the start of the run
            JoinDefinition joinDefinition = autoJoiner.define(context);
            JoinCondition condition = joinDefinition.getCondition();
            // should never happen as it's checked at deployment time, but add this to be safe.
            if (condition.getOp() != JoinCondition.Op.KEY_EQUALITY) {
                failureCollector.addFailure(String.format("Join stage '%s' uses a %s condition, which is not supported with the MapReduce engine.", stageName, condition.getOp()), "Switch to a different execution engine.");
            }
            failureCollector.getOrThrowException();
            batchJoiner = new JoinerBridge(stageName, autoJoiner, joinDefinition);
            // this is the same as filtering out records that have a null key if they are from an optional stage
            if (condition.getOp() == JoinCondition.Op.KEY_EQUALITY && !((JoinCondition.OnKeys) condition).isNullSafe()) {
                filterNullKeyStages = joinDefinition.getStages().stream().filter(s -> !s.isRequired()).map(JoinStage::getStageName).collect(Collectors.toSet());
            }
        } else {
            batchJoiner = (BatchJoiner<?, ?, ?>) plugin;
        }
        BatchJoinerRuntimeContext runtimeContext = createRuntimeContext(stageSpec);
        batchJoiner.initialize(runtimeContext);
        if (isMapPhase) {
            return getTrackedEmitKeyStep(new MapperJoinerTransformation(batchJoiner, mapOutputKeyClassName, mapOutputValClassName, filterNullKeyStages), stageMetrics, getDataTracer(stageName), collector);
        } else {
            return getTrackedMergeStep(new ReducerJoinerTransformation(batchJoiner, mapOutputKeyClassName, mapOutputValClassName, runtimeContext.getInputSchemas().size()), stageMetrics, getDataTracer(stageName), collector);
        }
    }
    return super.getTransformation(stageSpec);
}
Also used : LoggingFailureCollector(io.cdap.cdap.etl.validation.LoggingFailureCollector) Set(java.util.Set) HashSet(java.util.HashSet) DefaultAutoJoinerContext(io.cdap.cdap.etl.common.DefaultAutoJoinerContext) BatchAutoJoiner(io.cdap.cdap.etl.api.batch.BatchAutoJoiner) BatchRuntimeContext(io.cdap.cdap.etl.api.batch.BatchRuntimeContext) JoinDefinition(io.cdap.cdap.etl.api.join.JoinDefinition) StageMetrics(io.cdap.cdap.etl.api.StageMetrics) DefaultStageMetrics(io.cdap.cdap.etl.common.DefaultStageMetrics) BatchReducibleAggregator(io.cdap.cdap.etl.api.batch.BatchReducibleAggregator) JoinerBridge(io.cdap.cdap.etl.common.plugin.JoinerBridge) BatchJoinerRuntimeContext(io.cdap.cdap.etl.api.batch.BatchJoinerRuntimeContext) NoopStageStatisticsCollector(io.cdap.cdap.etl.common.NoopStageStatisticsCollector) TaskAttemptContext(org.apache.hadoop.mapreduce.TaskAttemptContext) BatchJoiner(io.cdap.cdap.etl.api.batch.BatchJoiner) JoinCondition(io.cdap.cdap.etl.api.join.JoinCondition) StageStatisticsCollector(io.cdap.cdap.etl.common.StageStatisticsCollector) NoopStageStatisticsCollector(io.cdap.cdap.etl.common.NoopStageStatisticsCollector) DefaultStageMetrics(io.cdap.cdap.etl.common.DefaultStageMetrics) LoggingFailureCollector(io.cdap.cdap.etl.validation.LoggingFailureCollector) FailureCollector(io.cdap.cdap.etl.api.FailureCollector)

Example 10 with BatchJoiner

use of io.cdap.cdap.etl.api.batch.BatchJoiner in project cdap by caskdata.

the class PipelinePluginContext method wrapPlugin.

private Object wrapPlugin(String pluginId, Object plugin) {
    Caller caller = getCaller(pluginId);
    StageMetrics stageMetrics = new DefaultStageMetrics(metrics, pluginId);
    OperationTimer operationTimer = processTimingEnabled ? new MetricsOperationTimer(stageMetrics) : NoOpOperationTimer.INSTANCE;
    if (plugin instanceof Action) {
        return new WrappedAction((Action) plugin, caller);
    } else if (plugin instanceof BatchSource) {
        return new WrappedBatchSource<>((BatchSource) plugin, caller, operationTimer);
    } else if (plugin instanceof BatchSink) {
        return new WrappedBatchSink<>((BatchSink) plugin, caller, operationTimer);
    } else if (plugin instanceof ErrorTransform) {
        return new WrappedErrorTransform<>((ErrorTransform) plugin, caller, operationTimer);
    } else if (plugin instanceof Transform) {
        return new WrappedTransform<>((Transform) plugin, caller, operationTimer);
    } else if (plugin instanceof BatchReducibleAggregator) {
        return new WrappedReduceAggregator<>((BatchReducibleAggregator) plugin, caller, operationTimer);
    } else if (plugin instanceof BatchAggregator) {
        return new WrappedBatchAggregator<>((BatchAggregator) plugin, caller, operationTimer);
    } else if (plugin instanceof BatchJoiner) {
        return new WrappedBatchJoiner<>((BatchJoiner) plugin, caller, operationTimer);
    } else if (plugin instanceof PostAction) {
        return new WrappedPostAction((PostAction) plugin, caller);
    } else if (plugin instanceof SplitterTransform) {
        return new WrappedSplitterTransform<>((SplitterTransform) plugin, caller, operationTimer);
    }
    return wrapUnknownPlugin(pluginId, plugin, caller);
}
Also used : Action(io.cdap.cdap.etl.api.action.Action) PostAction(io.cdap.cdap.etl.api.batch.PostAction) BatchSource(io.cdap.cdap.etl.api.batch.BatchSource) BatchAggregator(io.cdap.cdap.etl.api.batch.BatchAggregator) BatchSink(io.cdap.cdap.etl.api.batch.BatchSink) DefaultStageMetrics(io.cdap.cdap.etl.common.DefaultStageMetrics) StageMetrics(io.cdap.cdap.etl.api.StageMetrics) BatchReducibleAggregator(io.cdap.cdap.etl.api.batch.BatchReducibleAggregator) SplitterTransform(io.cdap.cdap.etl.api.SplitterTransform) BatchJoiner(io.cdap.cdap.etl.api.batch.BatchJoiner) ErrorTransform(io.cdap.cdap.etl.api.ErrorTransform) PostAction(io.cdap.cdap.etl.api.batch.PostAction) ErrorTransform(io.cdap.cdap.etl.api.ErrorTransform) SplitterTransform(io.cdap.cdap.etl.api.SplitterTransform) Transform(io.cdap.cdap.etl.api.Transform) DefaultStageMetrics(io.cdap.cdap.etl.common.DefaultStageMetrics)

Aggregations

BatchJoiner (io.cdap.cdap.etl.api.batch.BatchJoiner)12 BatchAutoJoiner (io.cdap.cdap.etl.api.batch.BatchAutoJoiner)8 BatchJoinerRuntimeContext (io.cdap.cdap.etl.api.batch.BatchJoinerRuntimeContext)8 JoinDefinition (io.cdap.cdap.etl.api.join.JoinDefinition)8 FailureCollector (io.cdap.cdap.etl.api.FailureCollector)6 BatchReducibleAggregator (io.cdap.cdap.etl.api.batch.BatchReducibleAggregator)6 AutoJoinerContext (io.cdap.cdap.etl.api.join.AutoJoinerContext)6 DefaultAutoJoinerContext (io.cdap.cdap.etl.common.DefaultAutoJoinerContext)6 JoinerBridge (io.cdap.cdap.etl.common.plugin.JoinerBridge)6 StageSpec (io.cdap.cdap.etl.proto.v2.spec.StageSpec)6 LoggingFailureCollector (io.cdap.cdap.etl.validation.LoggingFailureCollector)6 Schema (io.cdap.cdap.api.data.schema.Schema)4 SplitterTransform (io.cdap.cdap.etl.api.SplitterTransform)4 StageMetrics (io.cdap.cdap.etl.api.StageMetrics)4 BatchAggregator (io.cdap.cdap.etl.api.batch.BatchAggregator)4 JoinCondition (io.cdap.cdap.etl.api.join.JoinCondition)4 DefaultStageMetrics (io.cdap.cdap.etl.common.DefaultStageMetrics)4 HashMap (java.util.HashMap)4 StructuredRecord (io.cdap.cdap.api.data.format.StructuredRecord)2 Emitter (io.cdap.cdap.etl.api.Emitter)2