Examples with JavaPairRDD - org.apache.spark.api.java.JavaPairRDD

Example 46 with JavaPairRDD

use of org.apache.spark.api.java.JavaPairRDD in project beam by apache.

the class SparkCompat method extractOutput.

/**
 * Extracts the output for a given collection of WindowedAccumulators.
 *
 * <p>This is required because the API of JavaPairRDD.flatMapValues is different among Spark
 * versions. See https://issues.apache.org/jira/browse/SPARK-19287
 */
public static <K, InputT, AccumT, OutputT> JavaPairRDD<K, WindowedValue<OutputT>> extractOutput(JavaPairRDD<K, SparkCombineFn.WindowedAccumulator<KV<K, InputT>, InputT, AccumT, ?>> accumulatePerKey, SparkCombineFn<KV<K, InputT>, InputT, AccumT, OutputT> sparkCombineFn) {
    try {
        if (accumulatePerKey.context().version().startsWith("3")) {
            FlatMapFunction<SparkCombineFn.WindowedAccumulator<KV<K, InputT>, InputT, AccumT, ?>, WindowedValue<OutputT>> flatMapFunction = (FlatMapFunction<SparkCombineFn.WindowedAccumulator<KV<K, InputT>, InputT, AccumT, ?>, WindowedValue<OutputT>>) windowedAccumulator -> sparkCombineFn.extractOutputStream(windowedAccumulator).iterator();
            // This invokes by reflection the equivalent of:
            // return accumulatePerKey.flatMapValues(flatMapFunction);
            Method method = accumulatePerKey.getClass().getDeclaredMethod("flatMapValues", FlatMapFunction.class);
            Object result = method.invoke(accumulatePerKey, flatMapFunction);
            return (JavaPairRDD<K, WindowedValue<OutputT>>) result;
        }
        Function<SparkCombineFn.WindowedAccumulator<KV<K, InputT>, InputT, AccumT, ?>, Iterable<WindowedValue<OutputT>>> flatMapFunction = windowedAccumulator -> sparkCombineFn.extractOutputStream(windowedAccumulator).collect(Collectors.toList());
        // This invokes by reflection the equivalent of:
        // return accumulatePerKey.flatMapValues(flatMapFunction);
        Method method = accumulatePerKey.getClass().getDeclaredMethod("flatMapValues", Function.class);
        Object result = method.invoke(accumulatePerKey, flatMapFunction);
        return (JavaPairRDD<K, WindowedValue<OutputT>>) result;
    } catch (NoSuchMethodException | IllegalAccessException | InvocationTargetException e) {
        throw new RuntimeException("Error invoking Spark flatMapValues", e);
    }
}

Also used : SparkListenerApplicationStart(org.apache.spark.scheduler.SparkListenerApplicationStart) SparkCombineFn(org.apache.beam.runners.spark.translation.SparkCombineFn) KV(org.apache.beam.sdk.values.KV) WindowedValue(org.apache.beam.sdk.util.WindowedValue) JavaStreamingContext(org.apache.spark.streaming.api.java.JavaStreamingContext) PipelineResult(org.apache.beam.sdk.PipelineResult) ApplicationNameOptions(org.apache.beam.sdk.options.ApplicationNameOptions) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) Option(scala.Option) Constructor(java.lang.reflect.Constructor) Collectors(java.util.stream.Collectors) JavaPairRDD(org.apache.spark.api.java.JavaPairRDD) InvocationTargetException(java.lang.reflect.InvocationTargetException) SparkBeamMetric(org.apache.beam.runners.spark.metrics.SparkBeamMetric) List(java.util.List) JavaConverters(scala.collection.JavaConverters) JavaDStream(org.apache.spark.streaming.api.java.JavaDStream) Function(org.apache.spark.api.java.function.Function) Method(java.lang.reflect.Method) SparkPipelineOptions(org.apache.beam.runners.spark.SparkPipelineOptions) FlatMapFunction(org.apache.spark.api.java.function.FlatMapFunction) KV(org.apache.beam.sdk.values.KV) Method(java.lang.reflect.Method) InvocationTargetException(java.lang.reflect.InvocationTargetException) WindowedValue(org.apache.beam.sdk.util.WindowedValue) FlatMapFunction(org.apache.spark.api.java.function.FlatMapFunction) SparkCombineFn(org.apache.beam.runners.spark.translation.SparkCombineFn) JavaPairRDD(org.apache.spark.api.java.JavaPairRDD)

Example 47 with JavaPairRDD

use of org.apache.spark.api.java.JavaPairRDD in project hive by apache.

the class SparkPlanGenerator method generateMapInput.

@SuppressWarnings("unchecked")
private MapInput generateMapInput(SparkPlan sparkPlan, MapWork mapWork) throws Exception {
    JobConf jobConf = cloneJobConf(mapWork);
    Class ifClass = getInputFormat(jobConf, mapWork);
    sc.sc().setCallSite(CallSite.apply(mapWork.getName(), ""));
    JavaPairRDD<WritableComparable, Writable> hadoopRDD;
    if (mapWork.getNumMapTasks() != null) {
        jobConf.setNumMapTasks(mapWork.getNumMapTasks());
        hadoopRDD = sc.hadoopRDD(jobConf, ifClass, WritableComparable.class, Writable.class, mapWork.getNumMapTasks());
    } else {
        hadoopRDD = sc.hadoopRDD(jobConf, ifClass, WritableComparable.class, Writable.class);
    }
    boolean toCache = false;
    String tables = mapWork.getAllRootOperators().stream().filter(op -> op instanceof TableScanOperator).map(ts -> ((TableScanDesc) ts.getConf()).getAlias()).collect(Collectors.joining(", "));
    String rddName = mapWork.getName() + " (" + tables + ", " + hadoopRDD.getNumPartitions() + (toCache ? ", cached)" : ")");
    // Caching is disabled for MapInput due to HIVE-8920
    MapInput result = new MapInput(sparkPlan, hadoopRDD, toCache, rddName, mapWork);
    return result;
}

Also used : StatsPublisher(org.apache.hadoop.hive.ql.stats.StatsPublisher) FileSystem(org.apache.hadoop.fs.FileSystem) CallSite(org.apache.spark.util.CallSite) LoggerFactory(org.slf4j.LoggerFactory) StatsCollectionContext(org.apache.hadoop.hive.ql.stats.StatsCollectionContext) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) WritableComparable(org.apache.hadoop.io.WritableComparable) HashMap(java.util.HashMap) Writable(org.apache.hadoop.io.Writable) TableScanDesc(org.apache.hadoop.hive.ql.plan.TableScanDesc) FileSinkOperator(org.apache.hadoop.hive.ql.exec.FileSinkOperator) Utilities(org.apache.hadoop.hive.ql.exec.Utilities) ExecReducer(org.apache.hadoop.hive.ql.exec.mr.ExecReducer) ReduceWork(org.apache.hadoop.hive.ql.plan.ReduceWork) Map(java.util.Map) Path(org.apache.hadoop.fs.Path) Context(org.apache.hadoop.hive.ql.Context) BaseWork(org.apache.hadoop.hive.ql.plan.BaseWork) PerfLogger(org.apache.hadoop.hive.ql.log.PerfLogger) BucketizedHiveInputFormat(org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat) Logger(org.slf4j.Logger) HiveConf(org.apache.hadoop.hive.conf.HiveConf) Set(java.util.Set) IOException(java.io.IOException) Collectors(java.util.stream.Collectors) SessionState(org.apache.hadoop.hive.ql.session.SessionState) JavaPairRDD(org.apache.spark.api.java.JavaPairRDD) TableScanOperator(org.apache.hadoop.hive.ql.exec.TableScanOperator) JavaUtils(org.apache.hadoop.hive.common.JavaUtils) MergeFileWork(org.apache.hadoop.hive.ql.io.merge.MergeFileWork) Operator(org.apache.hadoop.hive.ql.exec.Operator) ExecMapper(org.apache.hadoop.hive.ql.exec.mr.ExecMapper) JobConf(org.apache.hadoop.mapred.JobConf) StatsFactory(org.apache.hadoop.hive.ql.stats.StatsFactory) List(java.util.List) SparkEdgeProperty(org.apache.hadoop.hive.ql.plan.SparkEdgeProperty) MergeFileOutputFormat(org.apache.hadoop.hive.ql.io.merge.MergeFileOutputFormat) MapWork(org.apache.hadoop.hive.ql.plan.MapWork) SparkWork(org.apache.hadoop.hive.ql.plan.SparkWork) Preconditions(com.google.common.base.Preconditions) FileOutputFormat(org.apache.hadoop.mapred.FileOutputFormat) MergeFileMapper(org.apache.hadoop.hive.ql.io.merge.MergeFileMapper) ErrorMsg(org.apache.hadoop.hive.ql.ErrorMsg) HiveException(org.apache.hadoop.hive.ql.metadata.HiveException) TableScanOperator(org.apache.hadoop.hive.ql.exec.TableScanOperator) WritableComparable(org.apache.hadoop.io.WritableComparable) TableScanDesc(org.apache.hadoop.hive.ql.plan.TableScanDesc) Writable(org.apache.hadoop.io.Writable) JobConf(org.apache.hadoop.mapred.JobConf)

Example 48 with JavaPairRDD

use of org.apache.spark.api.java.JavaPairRDD in project incubator-systemml by apache.

the class MLContextConversionUtil method matrixObjectToBinaryBlockMatrix.

/**
	 * Convert a {@code MatrixObject} to a {@code BinaryBlockMatrix}.
	 * 
	 * @param matrixObject
	 *            the {@code MatrixObject}
	 * @param sparkExecutionContext
	 *            the Spark execution context
	 * @return the {@code MatrixObject} converted to a {@code BinaryBlockMatrix}
	 */
public static BinaryBlockMatrix matrixObjectToBinaryBlockMatrix(MatrixObject matrixObject, SparkExecutionContext sparkExecutionContext) {
    try {
        @SuppressWarnings("unchecked") JavaPairRDD<MatrixIndexes, MatrixBlock> binaryBlock = (JavaPairRDD<MatrixIndexes, MatrixBlock>) sparkExecutionContext.getRDDHandleForMatrixObject(matrixObject, InputInfo.BinaryBlockInputInfo);
        MatrixCharacteristics matrixCharacteristics = matrixObject.getMatrixCharacteristics();
        return new BinaryBlockMatrix(binaryBlock, matrixCharacteristics);
    } catch (DMLRuntimeException e) {
        throw new MLContextException("DMLRuntimeException while converting matrix object to BinaryBlockMatrix", e);
    }
}

Also used : MatrixBlock(org.apache.sysml.runtime.matrix.data.MatrixBlock) MatrixIndexes(org.apache.sysml.runtime.matrix.data.MatrixIndexes) JavaPairRDD(org.apache.spark.api.java.JavaPairRDD) MatrixCharacteristics(org.apache.sysml.runtime.matrix.MatrixCharacteristics) DMLRuntimeException(org.apache.sysml.runtime.DMLRuntimeException)

Example 49 with JavaPairRDD

use of org.apache.spark.api.java.JavaPairRDD in project incubator-systemml by apache.

the class ResultMergeRemoteSpark method executeMerge.

@SuppressWarnings("unchecked")
protected RDDObject executeMerge(MatrixObject compare, MatrixObject[] inputs, String varname, long rlen, long clen, int brlen, int bclen) throws DMLRuntimeException {
    String jobname = "ParFor-RMSP";
    long t0 = DMLScript.STATISTICS ? System.nanoTime() : 0;
    SparkExecutionContext sec = (SparkExecutionContext) _ec;
    boolean withCompare = (compare != null);
    RDDObject ret = null;
    //determine degree of parallelism
    int numRed = (int) determineNumReducers(rlen, clen, brlen, bclen, _numReducers);
    //sanity check for empty src files
    if (inputs == null || inputs.length == 0)
        throw new DMLRuntimeException("Execute merge should never be called with no inputs.");
    try {
        //note: initial implementation via union over all result rdds discarded due to 
        //stack overflow errors with many parfor tasks, and thus many rdds
        //Step 1: construct input rdd from all result files of parfor workers
        //a) construct job conf with all files
        InputInfo ii = InputInfo.BinaryBlockInputInfo;
        JobConf job = new JobConf(ResultMergeRemoteMR.class);
        job.setJobName(jobname);
        job.setInputFormat(ii.inputFormatClass);
        Path[] paths = new Path[inputs.length];
        for (int i = 0; i < paths.length; i++) {
            //ensure input exists on hdfs (e.g., if in-memory or RDD)
            inputs[i].exportData();
            paths[i] = new Path(inputs[i].getFileName());
            //update rdd handle to allow lazy evaluation by guarding 
            //against cleanup of temporary result files
            setRDDHandleForMerge(inputs[i], sec);
        }
        FileInputFormat.setInputPaths(job, paths);
        //b) create rdd from input files w/ deep copy of keys and blocks
        JavaPairRDD<MatrixIndexes, MatrixBlock> rdd = sec.getSparkContext().hadoopRDD(job, ii.inputFormatClass, ii.inputKeyClass, ii.inputValueClass).mapPartitionsToPair(new CopyBlockPairFunction(true), true);
        //Step 2a: merge with compare
        JavaPairRDD<MatrixIndexes, MatrixBlock> out = null;
        if (withCompare) {
            JavaPairRDD<MatrixIndexes, MatrixBlock> compareRdd = (JavaPairRDD<MatrixIndexes, MatrixBlock>) sec.getRDDHandleForMatrixObject(compare, InputInfo.BinaryBlockInputInfo);
            //merge values which differ from compare values
            ResultMergeRemoteSparkWCompare cfun = new ResultMergeRemoteSparkWCompare();
            out = //group all result blocks per key
            rdd.groupByKey(numRed).join(//join compare block and result blocks 
            compareRdd).mapToPair(//merge result blocks w/ compare
            cfun);
        } else //Step 2b: merge without compare
        {
            //direct merge in any order (disjointness guaranteed)
            out = RDDAggregateUtils.mergeByKey(rdd, false);
        }
        //Step 3: create output rdd handle w/ lineage
        ret = new RDDObject(out, varname);
        for (int i = 0; i < paths.length; i++) ret.addLineageChild(inputs[i].getRDDHandle());
        if (withCompare)
            ret.addLineageChild(compare.getRDDHandle());
    } catch (Exception ex) {
        throw new DMLRuntimeException(ex);
    }
    //maintain statistics
    Statistics.incrementNoOfCompiledSPInst();
    Statistics.incrementNoOfExecutedSPInst();
    if (DMLScript.STATISTICS) {
        Statistics.maintainCPHeavyHitters(jobname, System.nanoTime() - t0);
    }
    return ret;
}

Also used : Path(org.apache.hadoop.fs.Path) MatrixBlock(org.apache.sysml.runtime.matrix.data.MatrixBlock) MatrixIndexes(org.apache.sysml.runtime.matrix.data.MatrixIndexes) CopyBlockPairFunction(org.apache.sysml.runtime.instructions.spark.functions.CopyBlockPairFunction) DMLRuntimeException(org.apache.sysml.runtime.DMLRuntimeException) DMLRuntimeException(org.apache.sysml.runtime.DMLRuntimeException) InputInfo(org.apache.sysml.runtime.matrix.data.InputInfo) JavaPairRDD(org.apache.spark.api.java.JavaPairRDD) RDDObject(org.apache.sysml.runtime.instructions.spark.data.RDDObject) SparkExecutionContext(org.apache.sysml.runtime.controlprogram.context.SparkExecutionContext) JobConf(org.apache.hadoop.mapred.JobConf)

Example 50 with JavaPairRDD

use of org.apache.spark.api.java.JavaPairRDD in project tdi-studio-se by Talend.

the class TalendDStreamPairRDD method saveAsHadoopDataset.

@Override
public void saveAsHadoopDataset(JobConf conf) {
    final JobConf config = conf;
    this.rdd.foreachRDD(new Function<JavaPairRDD<K, V>, Void>() {

        private static final long serialVersionUID = 1L;

        public Void call(JavaPairRDD<K, V> v1) throws Exception {
            v1.saveAsHadoopDataset(config);
            return null;
        }
    });
}

Also used : JavaPairRDD(org.apache.spark.api.java.JavaPairRDD) JobConf(org.apache.hadoop.mapred.JobConf)

Aggregations

JavaPairRDD (org.apache.spark.api.java.JavaPairRDD)99 MatrixBlock (org.apache.sysml.runtime.matrix.data.MatrixBlock)44 JavaSparkContext (org.apache.spark.api.java.JavaSparkContext)42 MatrixIndexes (org.apache.sysml.runtime.matrix.data.MatrixIndexes)42 MatrixCharacteristics (org.apache.sysml.runtime.matrix.MatrixCharacteristics)41 Tuple2 (scala.Tuple2)35 DMLRuntimeException (org.apache.sysml.runtime.DMLRuntimeException)33 JavaRDD (org.apache.spark.api.java.JavaRDD)28 List (java.util.List)27 SparkExecutionContext (org.apache.sysml.runtime.controlprogram.context.SparkExecutionContext)24 FrameBlock (org.apache.sysml.runtime.matrix.data.FrameBlock)23 Collectors (java.util.stream.Collectors)22 IOException (java.io.IOException)17 RDDObject (org.apache.sysml.runtime.instructions.spark.data.RDDObject)16 LongWritable (org.apache.hadoop.io.LongWritable)15 Broadcast (org.apache.spark.broadcast.Broadcast)15 Text (org.apache.hadoop.io.Text)12 UserException (org.broadinstitute.hellbender.exceptions.UserException)12 Function (org.apache.spark.api.java.function.Function)11 MatrixObject (org.apache.sysml.runtime.controlprogram.caching.MatrixObject)11