Search in sources :

Example 1 with SparkPipelineRuntime

use of io.cdap.cdap.etl.spark.SparkPipelineRuntime in project cdap by caskdata.

the class BaseRDDCollection method createStoreTask.

@Override
public Runnable createStoreTask(final StageSpec stageSpec, final SparkSink<T> sink) throws Exception {
    return new Runnable() {

        @Override
        public void run() {
            String stageName = stageSpec.getName();
            PipelineRuntime pipelineRuntime = new SparkPipelineRuntime(sec);
            SparkExecutionPluginContext sparkPluginContext = new BasicSparkExecutionPluginContext(sec, jsc, datasetContext, pipelineRuntime, stageSpec);
            JavaRDD<T> countedRDD = rdd.map(new CountingFunction<T>(stageName, sec.getMetrics(), Constants.Metrics.RECORDS_IN, null));
            SparkConf sparkConf = jsc.getConf();
            try {
                sink.run(sparkPluginContext, countedRDD);
            } catch (Exception e) {
                throw Throwables.propagate(e);
            }
        }
    };
}
Also used : SparkExecutionPluginContext(io.cdap.cdap.etl.api.batch.SparkExecutionPluginContext) PipelineRuntime(io.cdap.cdap.etl.common.PipelineRuntime) SparkPipelineRuntime(io.cdap.cdap.etl.spark.SparkPipelineRuntime) SparkPipelineRuntime(io.cdap.cdap.etl.spark.SparkPipelineRuntime) SparkConf(org.apache.spark.SparkConf) AccessException(io.cdap.cdap.api.security.AccessException) DatasetManagementException(io.cdap.cdap.api.dataset.DatasetManagementException)

Example 2 with SparkPipelineRuntime

use of io.cdap.cdap.etl.spark.SparkPipelineRuntime in project cdap by caskdata.

the class BaseRDDCollection method publishAlerts.

@Override
public void publishAlerts(StageSpec stageSpec, StageStatisticsCollector collector) throws Exception {
    PluginFunctionContext pluginFunctionContext = new PluginFunctionContext(stageSpec, sec, collector);
    AlertPublisher alertPublisher = pluginFunctionContext.createPlugin();
    PipelineRuntime pipelineRuntime = new SparkPipelineRuntime(sec);
    AlertPublisherContext alertPublisherContext = new DefaultAlertPublisherContext(pipelineRuntime, stageSpec, sec.getMessagingContext(), sec.getAdmin());
    alertPublisher.initialize(alertPublisherContext);
    StageMetrics stageMetrics = new DefaultStageMetrics(sec.getMetrics(), stageSpec.getName());
    TrackedIterator<Alert> trackedAlerts = new TrackedIterator<>(((JavaRDD<Alert>) rdd).collect().iterator(), stageMetrics, Constants.Metrics.RECORDS_IN);
    alertPublisher.publish(trackedAlerts);
    alertPublisher.destroy();
}
Also used : PluginFunctionContext(io.cdap.cdap.etl.spark.function.PluginFunctionContext) AlertPublisher(io.cdap.cdap.etl.api.AlertPublisher) PipelineRuntime(io.cdap.cdap.etl.common.PipelineRuntime) SparkPipelineRuntime(io.cdap.cdap.etl.spark.SparkPipelineRuntime) SparkPipelineRuntime(io.cdap.cdap.etl.spark.SparkPipelineRuntime) TrackedIterator(io.cdap.cdap.etl.common.TrackedIterator) Alert(io.cdap.cdap.etl.api.Alert) AlertPublisherContext(io.cdap.cdap.etl.api.AlertPublisherContext) DefaultAlertPublisherContext(io.cdap.cdap.etl.common.DefaultAlertPublisherContext) DefaultAlertPublisherContext(io.cdap.cdap.etl.common.DefaultAlertPublisherContext) StageMetrics(io.cdap.cdap.etl.api.StageMetrics) DefaultStageMetrics(io.cdap.cdap.etl.common.DefaultStageMetrics) DefaultStageMetrics(io.cdap.cdap.etl.common.DefaultStageMetrics) JavaRDD(org.apache.spark.api.java.JavaRDD)

Example 3 with SparkPipelineRuntime

use of io.cdap.cdap.etl.spark.SparkPipelineRuntime in project cdap by caskdata.

the class BaseRDDCollection method compute.

@Override
public <U> SparkCollection<U> compute(StageSpec stageSpec, SparkCompute<T, U> compute) throws Exception {
    String stageName = stageSpec.getName();
    PipelineRuntime pipelineRuntime = new SparkPipelineRuntime(sec);
    SparkExecutionPluginContext sparkPluginContext = new BasicSparkExecutionPluginContext(sec, jsc, datasetContext, pipelineRuntime, stageSpec);
    compute.initialize(sparkPluginContext);
    JavaRDD<T> countedInput = rdd.map(new CountingFunction<T>(stageName, sec.getMetrics(), Constants.Metrics.RECORDS_IN, null));
    SparkConf sparkConf = jsc.getConf();
    return wrap(compute.transform(sparkPluginContext, countedInput).map(new CountingFunction<U>(stageName, sec.getMetrics(), Constants.Metrics.RECORDS_OUT, sec.getDataTracer(stageName))));
}
Also used : SparkExecutionPluginContext(io.cdap.cdap.etl.api.batch.SparkExecutionPluginContext) PipelineRuntime(io.cdap.cdap.etl.common.PipelineRuntime) SparkPipelineRuntime(io.cdap.cdap.etl.spark.SparkPipelineRuntime) SparkPipelineRuntime(io.cdap.cdap.etl.spark.SparkPipelineRuntime) CountingFunction(io.cdap.cdap.etl.spark.function.CountingFunction) SparkConf(org.apache.spark.SparkConf)

Example 4 with SparkPipelineRuntime

use of io.cdap.cdap.etl.spark.SparkPipelineRuntime in project cdap by caskdata.

the class DStreamCollection method compute.

@Override
public <U> SparkCollection<U> compute(StageSpec stageSpec, SparkCompute<T, U> compute) throws Exception {
    SparkCompute<T, U> wrappedCompute = new DynamicSparkCompute<>(new DynamicDriverContext(stageSpec, sec, new NoopStageStatisticsCollector()), compute);
    Transactionals.execute(sec, new TxRunnable() {

        @Override
        public void run(DatasetContext datasetContext) throws Exception {
            PipelineRuntime pipelineRuntime = new SparkPipelineRuntime(sec);
            SparkExecutionPluginContext sparkPluginContext = new BasicSparkExecutionPluginContext(sec, JavaSparkContext.fromSparkContext(stream.context().sparkContext()), datasetContext, pipelineRuntime, stageSpec);
            wrappedCompute.initialize(sparkPluginContext);
        }
    }, Exception.class);
    return wrap(stream.transform(new ComputeTransformFunction<>(sec, stageSpec, wrappedCompute)));
}
Also used : DynamicSparkCompute(io.cdap.cdap.etl.spark.streaming.function.DynamicSparkCompute) NoopStageStatisticsCollector(io.cdap.cdap.etl.common.NoopStageStatisticsCollector) ComputeTransformFunction(io.cdap.cdap.etl.spark.streaming.function.ComputeTransformFunction) PipelineRuntime(io.cdap.cdap.etl.common.PipelineRuntime) SparkPipelineRuntime(io.cdap.cdap.etl.spark.SparkPipelineRuntime) SparkPipelineRuntime(io.cdap.cdap.etl.spark.SparkPipelineRuntime) BasicSparkExecutionPluginContext(io.cdap.cdap.etl.spark.batch.BasicSparkExecutionPluginContext) BasicSparkExecutionPluginContext(io.cdap.cdap.etl.spark.batch.BasicSparkExecutionPluginContext) SparkExecutionPluginContext(io.cdap.cdap.etl.api.batch.SparkExecutionPluginContext) TxRunnable(io.cdap.cdap.api.TxRunnable) DatasetContext(io.cdap.cdap.api.data.DatasetContext)

Example 5 with SparkPipelineRuntime

use of io.cdap.cdap.etl.spark.SparkPipelineRuntime in project cdap by caskdata.

the class SparkStreamingPipelineDriver method run.

private JavaStreamingContext run(DataStreamsPipelineSpec pipelineSpec, PipelinePhase pipelinePhase, JavaSparkExecutionContext sec, @Nullable String checkpointDir, @Nullable JavaSparkContext context) throws Exception {
    PipelinePluginContext pluginContext = new PipelinePluginContext(sec.getPluginContext(), sec.getMetrics(), pipelineSpec.isStageLoggingEnabled(), pipelineSpec.isProcessTimingEnabled());
    PipelineRuntime pipelineRuntime = new SparkPipelineRuntime(sec);
    MacroEvaluator evaluator = new DefaultMacroEvaluator(pipelineRuntime.getArguments(), sec.getLogicalStartTime(), sec.getSecureStore(), sec.getServiceDiscoverer(), sec.getNamespace());
    SparkStreamingPreparer preparer = new SparkStreamingPreparer(pluginContext, sec.getMetrics(), evaluator, pipelineRuntime, sec);
    try {
        SparkFieldLineageRecorder recorder = new SparkFieldLineageRecorder(sec, pipelinePhase, pipelineSpec, preparer);
        recorder.record();
    } catch (Exception e) {
        LOG.warn("Failed to emit field lineage operations for streaming pipeline", e);
    }
    Set<String> uncombinableSinks = preparer.getUncombinableSinks();
    // the content in the function might not run due to spark checkpointing, currently just have the lineage logic
    // before anything is run
    Function0<JavaStreamingContext> contextFunction = (Function0<JavaStreamingContext>) () -> {
        JavaSparkContext javaSparkContext = context == null ? new JavaSparkContext() : context;
        JavaStreamingContext jssc = new JavaStreamingContext(javaSparkContext, Durations.milliseconds(pipelineSpec.getBatchIntervalMillis()));
        SparkStreamingPipelineRunner runner = new SparkStreamingPipelineRunner(sec, jssc, pipelineSpec, pipelineSpec.isCheckpointsDisabled());
        // Seems like they should be set at configure time instead of runtime? but that requires an API change.
        try {
            PhaseSpec phaseSpec = new PhaseSpec(sec.getApplicationSpecification().getName(), pipelinePhase, Collections.emptyMap(), pipelineSpec.isStageLoggingEnabled(), pipelineSpec.isProcessTimingEnabled());
            boolean shouldConsolidateStages = Boolean.parseBoolean(sec.getRuntimeArguments().getOrDefault(Constants.CONSOLIDATE_STAGES, Boolean.TRUE.toString()));
            boolean shouldCacheFunctions = Boolean.parseBoolean(sec.getRuntimeArguments().getOrDefault(Constants.CACHE_FUNCTIONS, Boolean.TRUE.toString()));
            runner.runPipeline(phaseSpec, StreamingSource.PLUGIN_TYPE, sec, Collections.emptyMap(), pluginContext, Collections.emptyMap(), uncombinableSinks, shouldConsolidateStages, shouldCacheFunctions);
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
        if (checkpointDir != null) {
            jssc.checkpoint(checkpointDir);
            jssc.sparkContext().hadoopConfiguration().set("fs.defaultFS", checkpointDir);
        }
        return jssc;
    };
    return checkpointDir == null ? contextFunction.call() : JavaStreamingContext.getOrCreate(checkpointDir, contextFunction, context.hadoopConfiguration());
}
Also used : PipelineRuntime(io.cdap.cdap.etl.common.PipelineRuntime) SparkPipelineRuntime(io.cdap.cdap.etl.spark.SparkPipelineRuntime) DefaultMacroEvaluator(io.cdap.cdap.etl.common.DefaultMacroEvaluator) MacroEvaluator(io.cdap.cdap.api.macro.MacroEvaluator) SparkPipelineRuntime(io.cdap.cdap.etl.spark.SparkPipelineRuntime) SparkStreamingPreparer(io.cdap.cdap.etl.spark.streaming.SparkStreamingPreparer) Function0(org.apache.spark.api.java.function.Function0) IOException(java.io.IOException) JavaStreamingContext(org.apache.spark.streaming.api.java.JavaStreamingContext) DefaultMacroEvaluator(io.cdap.cdap.etl.common.DefaultMacroEvaluator) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) PhaseSpec(io.cdap.cdap.etl.common.PhaseSpec) PipelinePluginContext(io.cdap.cdap.etl.common.plugin.PipelinePluginContext)

Aggregations

PipelineRuntime (io.cdap.cdap.etl.common.PipelineRuntime)10 SparkPipelineRuntime (io.cdap.cdap.etl.spark.SparkPipelineRuntime)10 TxRunnable (io.cdap.cdap.api.TxRunnable)5 DatasetContext (io.cdap.cdap.api.data.DatasetContext)5 MacroEvaluator (io.cdap.cdap.api.macro.MacroEvaluator)5 SparkExecutionPluginContext (io.cdap.cdap.etl.api.batch.SparkExecutionPluginContext)5 DefaultMacroEvaluator (io.cdap.cdap.etl.common.DefaultMacroEvaluator)5 PluginContext (io.cdap.cdap.api.plugin.PluginContext)4 BasicArguments (io.cdap.cdap.etl.common.BasicArguments)4 SparkPipelinePluginContext (io.cdap.cdap.etl.spark.plugin.SparkPipelinePluginContext)4 PluginFunctionContext (io.cdap.cdap.etl.spark.function.PluginFunctionContext)3 Alert (io.cdap.cdap.etl.api.Alert)2 AlertPublisher (io.cdap.cdap.etl.api.AlertPublisher)2 AlertPublisherContext (io.cdap.cdap.etl.api.AlertPublisherContext)2 StageMetrics (io.cdap.cdap.etl.api.StageMetrics)2 DefaultAlertPublisherContext (io.cdap.cdap.etl.common.DefaultAlertPublisherContext)2 DefaultStageMetrics (io.cdap.cdap.etl.common.DefaultStageMetrics)2 NoopStageStatisticsCollector (io.cdap.cdap.etl.common.NoopStageStatisticsCollector)2 TrackedIterator (io.cdap.cdap.etl.common.TrackedIterator)2 StageSpec (io.cdap.cdap.etl.proto.v2.spec.StageSpec)2