Search in sources :

Example 1 with SparkTransformExecutorFactory

use of io.cdap.cdap.etl.spark.SparkTransformExecutorFactory in project cdap by caskdata.

the class MultiSinkFunction method initializeBranchExecutors.

private void initializeBranchExecutors() {
    emitter = new DefaultEmitter<>();
    PipelinePluginInstantiator pluginInstantiator = new PipelinePluginInstantiator(pipelineRuntime.getPluginContext(), pipelineRuntime.getMetrics(), phaseSpec, new SingleConnectorFactory());
    MacroEvaluator macroEvaluator = new DefaultMacroEvaluator(pipelineRuntime.getArguments(), pipelineRuntime.getLogicalStartTime(), pipelineRuntime.getSecureStore(), pipelineRuntime.getServiceDiscoverer(), pipelineRuntime.getNamespace());
    executorFactory = new SparkTransformExecutorFactory(pluginInstantiator, macroEvaluator, null, collectors, dataTracers, pipelineRuntime, emitter);
    /*
       If the dag is:

            |--> t1 --> k1
       s1 --|
            |--> k2
                 ^
           s2 ---|

       the group is t1, k1, and k2.
     */
    PipelinePhase pipelinePhase = phaseSpec.getPhase();
    branchExecutors = new HashMap<>();
    inputConnections = new HashMap<>();
    for (String groupSource : group) {
        // group "sources" are stages in the group that don't have an input from another stage in the group.
        if (Sets.difference(pipelinePhase.getStageInputs(groupSource), group).isEmpty()) {
            continue;
        }
        // get the branch by taking a subset of the pipeline starting from the "source".
        // with the example above, the two branches are t1 -> k1, and k2.
        PipelinePhase branch;
        if (pipelinePhase.getSinks().contains(groupSource)) {
            // pipelinePhase.subsetFrom() throws an exception if the new "source" is also a sink,
            // since a Dag cannot be a single node. so build it manually.
            branch = PipelinePhase.builder(pipelinePhase.getPluginTypes()).addStage(pipelinePhase.getStage(groupSource)).build();
        } else {
            branch = pipelinePhase.subsetFrom(Collections.singleton(groupSource));
        }
        try {
            branchExecutors.put(groupSource, executorFactory.create(branch));
        } catch (Exception e) {
            throw new IllegalStateException(String.format("Unable to get subset of pipeline starting from stage %s. " + "This indicates a planning error. Please report this bug and turn off stage " + "consolidation by setting %s to false in the runtime arguments.", groupSource, Constants.CONSOLIDATE_STAGES), e);
        }
        /*
          create a mapping from possible inputs to "group sources". This will help identify which incoming
          records should be sent to which branch executor.

          for example, the pipeline may look like:

                           |port a --> k1
             s --> split --|
                           |port b --> k2

          In this scenario, k1, and k2, are all in the same group, so the map contains:

            { stageName: split, port: a, type: output } -> [k1]
            { stageName: split, port: b, type: output } -> [k2]

          A slightly more complicated example:

                               |--> k1
            s1 --> transform --|
                      |        |--> k2
                      |
                      |--> error collector --> k3

          In this scenario, k1, k2, k3, and error collector are in the same group, so the map contains:

            { stageName: transform, type: output } -> [k1, k2]
            { stageName: transform, type: error } -> [k3]
       */
        String groupSourceType = pipelinePhase.getStage(groupSource).getPluginType();
        RecordType recordType = ErrorTransform.PLUGIN_TYPE.equals(groupSourceType) ? RecordType.ERROR : RecordType.OUTPUT;
        for (String inputStage : pipelinePhase.getStageInputs(groupSource)) {
            Map<String, StageSpec.Port> ports = pipelinePhase.getStage(inputStage).getOutputPorts();
            String port = ports.get(groupSource).getPort();
            InputInfo inputInfo = new InputInfo(inputStage, recordType, port);
            Set<String> groupSources = inputConnections.computeIfAbsent(inputInfo, key -> new HashSet<>());
            groupSources.add(groupSource);
        }
    }
}
Also used : DefaultMacroEvaluator(io.cdap.cdap.etl.common.DefaultMacroEvaluator) MacroEvaluator(io.cdap.cdap.api.macro.MacroEvaluator) SingleConnectorFactory(io.cdap.cdap.etl.batch.connector.SingleConnectorFactory) SparkTransformExecutorFactory(io.cdap.cdap.etl.spark.SparkTransformExecutorFactory) RecordType(io.cdap.cdap.etl.common.RecordType) PipelinePhase(io.cdap.cdap.etl.common.PipelinePhase) DefaultMacroEvaluator(io.cdap.cdap.etl.common.DefaultMacroEvaluator) PipelinePluginInstantiator(io.cdap.cdap.etl.batch.PipelinePluginInstantiator)

Aggregations

MacroEvaluator (io.cdap.cdap.api.macro.MacroEvaluator)1 PipelinePluginInstantiator (io.cdap.cdap.etl.batch.PipelinePluginInstantiator)1 SingleConnectorFactory (io.cdap.cdap.etl.batch.connector.SingleConnectorFactory)1 DefaultMacroEvaluator (io.cdap.cdap.etl.common.DefaultMacroEvaluator)1 PipelinePhase (io.cdap.cdap.etl.common.PipelinePhase)1 RecordType (io.cdap.cdap.etl.common.RecordType)1 SparkTransformExecutorFactory (io.cdap.cdap.etl.spark.SparkTransformExecutorFactory)1