Examples with JavaPairRDD - org.apache.spark.api.java.JavaPairRDD

Example 31 with JavaPairRDD

use of org.apache.spark.api.java.JavaPairRDD in project systemml by apache.

the class DataPartitionerRemoteSpark method partitionMatrix.

@Override
@SuppressWarnings("unchecked")
protected void partitionMatrix(MatrixObject in, String fnameNew, InputInfo ii, OutputInfo oi, long rlen, long clen, int brlen, int bclen) {
    String jobname = "ParFor-DPSP";
    long t0 = DMLScript.STATISTICS ? System.nanoTime() : 0;
    SparkExecutionContext sec = (SparkExecutionContext) _ec;
    try {
        // cleanup existing output files
        MapReduceTool.deleteFileIfExistOnHDFS(fnameNew);
        // get input rdd
        JavaPairRDD<MatrixIndexes, MatrixBlock> inRdd = (JavaPairRDD<MatrixIndexes, MatrixBlock>) sec.getRDDHandleForMatrixObject(in, InputInfo.BinaryBlockInputInfo);
        // determine degree of parallelism
        MatrixCharacteristics mc = in.getMatrixCharacteristics();
        int numRed = (int) determineNumReducers(inRdd, mc, _numRed);
        // run spark remote data partition job
        DataPartitionerRemoteSparkMapper dpfun = new DataPartitionerRemoteSparkMapper(mc, ii, oi, _format, _n);
        DataPartitionerRemoteSparkReducer wfun = new DataPartitionerRemoteSparkReducer(fnameNew, oi, _replication);
        // partition the input blocks
        inRdd.flatMapToPair(dpfun).groupByKey(// group partition blocks
        numRed).foreach(// write partitions to hdfs
        wfun);
    } catch (Exception ex) {
        throw new DMLRuntimeException(ex);
    }
    // maintain statistics
    Statistics.incrementNoOfCompiledSPInst();
    Statistics.incrementNoOfExecutedSPInst();
    if (DMLScript.STATISTICS) {
        Statistics.maintainCPHeavyHitters(jobname, System.nanoTime() - t0);
    }
}

Also used : MatrixBlock(org.apache.sysml.runtime.matrix.data.MatrixBlock) MatrixIndexes(org.apache.sysml.runtime.matrix.data.MatrixIndexes) DMLRuntimeException(org.apache.sysml.runtime.DMLRuntimeException) MatrixCharacteristics(org.apache.sysml.runtime.matrix.MatrixCharacteristics) DMLRuntimeException(org.apache.sysml.runtime.DMLRuntimeException) JavaPairRDD(org.apache.spark.api.java.JavaPairRDD) SparkExecutionContext(org.apache.sysml.runtime.controlprogram.context.SparkExecutionContext)

Example 32 with JavaPairRDD

use of org.apache.spark.api.java.JavaPairRDD in project systemml by apache.

the class SparkExecutionContext method repartitionAndCacheMatrixObject.

@SuppressWarnings("unchecked")
public void repartitionAndCacheMatrixObject(String var) {
    MatrixObject mo = getMatrixObject(var);
    MatrixCharacteristics mcIn = mo.getMatrixCharacteristics();
    // double check size to avoid unnecessary spark context creation
    if (!OptimizerUtils.exceedsCachingThreshold(mo.getNumColumns(), (double) OptimizerUtils.estimateSizeExactSparsity(mcIn)))
        return;
    // get input rdd and default storage level
    JavaPairRDD<MatrixIndexes, MatrixBlock> in = (JavaPairRDD<MatrixIndexes, MatrixBlock>) getRDDHandleForMatrixObject(mo, InputInfo.BinaryBlockInputInfo);
    // avoid unnecessary caching of input in order to reduce memory pressure
    if (mo.getRDDHandle().allowsShortCircuitRead() && isRDDMarkedForCaching(in.id()) && !isRDDCached(in.id())) {
        in = (JavaPairRDD<MatrixIndexes, MatrixBlock>) ((RDDObject) mo.getRDDHandle().getLineageChilds().get(0)).getRDD();
        // investigate issue of unnecessarily large number of partitions
        int numPartitions = SparkUtils.getNumPreferredPartitions(mcIn, in);
        if (numPartitions < in.getNumPartitions())
            in = in.coalesce(numPartitions);
    }
    // repartition rdd (force creation of shuffled rdd via merge), note: without deep copy albeit
    // executed on the original data, because there will be no merge, i.e., no key duplicates
    JavaPairRDD<MatrixIndexes, MatrixBlock> out = RDDAggregateUtils.mergeByKey(in, false);
    // convert mcsr into memory-efficient csr if potentially sparse
    if (OptimizerUtils.checkSparseBlockCSRConversion(mcIn)) {
        out = out.mapValues(new CreateSparseBlockFunction(SparseBlock.Type.CSR));
    }
    // persist rdd in default storage level
    out.persist(Checkpoint.DEFAULT_STORAGE_LEVEL).count();
    // create new rdd handle, in-place of current matrix object
    // guaranteed to exist (see above)
    RDDObject inro = mo.getRDDHandle();
    // create new rdd object
    RDDObject outro = new RDDObject(out);
    // mark as checkpointed
    outro.setCheckpointRDD(true);
    // keep lineage to prevent cycles on cleanup
    outro.addLineageChild(inro);
    mo.setRDDHandle(outro);
}

Also used : MatrixBlock(org.apache.sysml.runtime.matrix.data.MatrixBlock) CompressedMatrixBlock(org.apache.sysml.runtime.compress.CompressedMatrixBlock) MatrixObject(org.apache.sysml.runtime.controlprogram.caching.MatrixObject) CreateSparseBlockFunction(org.apache.sysml.runtime.instructions.spark.functions.CreateSparseBlockFunction) MatrixIndexes(org.apache.sysml.runtime.matrix.data.MatrixIndexes) JavaPairRDD(org.apache.spark.api.java.JavaPairRDD) RDDObject(org.apache.sysml.runtime.instructions.spark.data.RDDObject) Checkpoint(org.apache.sysml.lops.Checkpoint) MatrixCharacteristics(org.apache.sysml.runtime.matrix.MatrixCharacteristics)

Example 33 with JavaPairRDD

use of org.apache.spark.api.java.JavaPairRDD in project systemml by apache.

the class SparkUtils method getEmptyBlockRDD.

/**
 * Creates an RDD of empty blocks according to the given matrix characteristics. This is
 * done in a scalable manner by parallelizing block ranges and generating empty blocks
 * in a distributed manner, under awareness of preferred output partition sizes.
 *
 * @param sc spark context
 * @param mc matrix characteristics
 * @return pair rdd of empty matrix blocks
 */
public static JavaPairRDD<MatrixIndexes, MatrixBlock> getEmptyBlockRDD(JavaSparkContext sc, MatrixCharacteristics mc) {
    // compute degree of parallelism and block ranges
    long size = mc.getNumBlocks() * OptimizerUtils.estimateSizeEmptyBlock(Math.min(Math.max(mc.getRows(), 1), mc.getRowsPerBlock()), Math.min(Math.max(mc.getCols(), 1), mc.getColsPerBlock()));
    int par = (int) Math.min(Math.max(SparkExecutionContext.getDefaultParallelism(true), Math.ceil(size / InfrastructureAnalyzer.getHDFSBlockSize())), mc.getNumBlocks());
    long pNumBlocks = (long) Math.ceil((double) mc.getNumBlocks() / par);
    // generate block offsets per partition
    List<Long> offsets = LongStream.iterate(0, n -> n + pNumBlocks).limit(par).boxed().collect(Collectors.toList());
    // parallelize offsets and generate all empty blocks
    return (JavaPairRDD<MatrixIndexes, MatrixBlock>) sc.parallelize(offsets, par).flatMapToPair(new GenerateEmptyBlocks(mc, pNumBlocks));
}

Also used : Function2(org.apache.spark.api.java.function.Function2) PairFlatMapFunction(org.apache.spark.api.java.function.PairFlatMapFunction) CopyBlockPairFunction(org.apache.sysml.runtime.instructions.spark.functions.CopyBlockPairFunction) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) MatrixIndexes(org.apache.sysml.runtime.matrix.data.MatrixIndexes) SparkExecutionContext(org.apache.sysml.runtime.controlprogram.context.SparkExecutionContext) ArrayList(java.util.ArrayList) MatrixBlock(org.apache.sysml.runtime.matrix.data.MatrixBlock) CopyBlockFunction(org.apache.sysml.runtime.instructions.spark.functions.CopyBlockFunction) FilterNonEmptyBlocksFunction(org.apache.sysml.runtime.instructions.spark.functions.FilterNonEmptyBlocksFunction) StorageLevel(org.apache.spark.storage.StorageLevel) InfrastructureAnalyzer(org.apache.sysml.runtime.controlprogram.parfor.stat.InfrastructureAnalyzer) FrameBlock(org.apache.sysml.runtime.matrix.data.FrameBlock) RecomputeNnzFunction(org.apache.sysml.runtime.instructions.spark.functions.RecomputeNnzFunction) MatrixCell(org.apache.sysml.runtime.matrix.data.MatrixCell) LongStream(java.util.stream.LongStream) Iterator(java.util.Iterator) Pair(org.apache.sysml.runtime.matrix.data.Pair) HashPartitioner(org.apache.spark.HashPartitioner) Checkpoint(org.apache.sysml.lops.Checkpoint) Tuple2(scala.Tuple2) Collectors(java.util.stream.Collectors) JavaPairRDD(org.apache.spark.api.java.JavaPairRDD) OptimizerUtils(org.apache.sysml.hops.OptimizerUtils) List(java.util.List) IndexedMatrixValue(org.apache.sysml.runtime.matrix.mapred.IndexedMatrixValue) CopyBinaryCellFunction(org.apache.sysml.runtime.instructions.spark.functions.CopyBinaryCellFunction) MatrixCharacteristics(org.apache.sysml.runtime.matrix.MatrixCharacteristics) Function(org.apache.spark.api.java.function.Function) UtilFunctions(org.apache.sysml.runtime.util.UtilFunctions) JavaPairRDD(org.apache.spark.api.java.JavaPairRDD) Checkpoint(org.apache.sysml.lops.Checkpoint)

Example 34 with JavaPairRDD

use of org.apache.spark.api.java.JavaPairRDD in project mm-dev by sbl-sdsc.

the class StructureAligner method getAllVsAllAlignments.

/**
 * Calculates all vs. all structural alignments of protein chains using the
 * specified alignment algorithm. The input structures must contain single
 * protein chains.
 *
 * @param targets structures containing single protein chains
 * @param alignmentAlgorithm name of the algorithm
 * @return dataset with alignment metrics
 */
public static Dataset<Row> getAllVsAllAlignments(JavaPairRDD<String, StructureDataInterface> targets, String alignmentAlgorithm) {
    SparkSession session = SparkSession.builder().getOrCreate();
    JavaSparkContext sc = new JavaSparkContext(session.sparkContext());
    // create a list of chainName/ C Alpha coordinates
    List<Tuple2<String, Point3d[]>> chains = targets.mapValues(s -> new ColumnarStructureX(s, true).getcAlphaCoordinates()).collect();
    // create an RDD of all pair indices (0,1), (0,2), ..., (1,2), (1,3), ...
    JavaRDD<Tuple2<Integer, Integer>> pairs = getPairs(sc, chains.size());
    // calculate structural alignments for all pairs.
    // broadcast (copy) chains to all worker nodes for efficient processing.
    // for each pair there can be zero or more solutions, therefore we flatmap the pairs.
    JavaRDD<Row> rows = pairs.flatMap(new StructuralAlignmentMapper(sc.broadcast(chains), alignmentAlgorithm));
    // convert rows to a dataset
    return session.createDataFrame(rows, getSchema());
}

Also used : IntStream(java.util.stream.IntStream) DataTypes(org.apache.spark.sql.types.DataTypes) StructField(org.apache.spark.sql.types.StructField) StructType(org.apache.spark.sql.types.StructType) Iterator(java.util.Iterator) Dataset(org.apache.spark.sql.Dataset) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) ColumnarStructureX(edu.sdsc.mmtf.spark.utils.ColumnarStructureX) Row(org.apache.spark.sql.Row) Tuple2(scala.Tuple2) Collectors(java.util.stream.Collectors) JavaPairRDD(org.apache.spark.api.java.JavaPairRDD) Serializable(java.io.Serializable) ArrayList(java.util.ArrayList) List(java.util.List) StructureDataInterface(org.rcsb.mmtf.api.StructureDataInterface) Point3d(javax.vecmath.Point3d) JavaRDD(org.apache.spark.api.java.JavaRDD) FlatMapFunction(org.apache.spark.api.java.function.FlatMapFunction) SparkSession(org.apache.spark.sql.SparkSession) SparkSession(org.apache.spark.sql.SparkSession) Tuple2(scala.Tuple2) Point3d(javax.vecmath.Point3d) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) Row(org.apache.spark.sql.Row) ColumnarStructureX(edu.sdsc.mmtf.spark.utils.ColumnarStructureX)

Example 35 with JavaPairRDD

use of org.apache.spark.api.java.JavaPairRDD in project mm-dev by sbl-sdsc.

the class D3RLigandProteinMerger method main.

public static void main(String[] args) throws IOException {
    long start = System.nanoTime();
    // instantiate Spark
    SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("D3RLigandProteinMerger");
    JavaSparkContext sc = new JavaSparkContext(conf);
    // String path = "/Users/peter/Downloads/Pose_prediction/417-1-hciq4/";
    String path = "/Users/peter/Downloads/Pose_prediction/";
    JavaPairRDD<String, StructureDataInterface> ligands = Molmporter.importMolFiles(path, sc);
    ligands = ligands.mapToPair(t -> new Tuple2<String, StructureDataInterface>(removeExtension(t._1), t._2));
    JavaPairRDD<String, StructureDataInterface> proteins = MmtfImporter.importPdbFiles(path, sc);
    proteins = proteins.mapToPair(t -> new Tuple2<String, StructureDataInterface>(removeExtension(t._1), t._2));
    JavaPairRDD<String, Tuple2<StructureDataInterface, StructureDataInterface>> pairs = proteins.join(ligands);
    JavaPairRDD<String, StructureDataInterface> complexes = pairs.mapToPair(t -> new Tuple2<String, StructureDataInterface>(t._1, MergeMmtf.MergeStructures(t._1, t._2._1, t._2._2)));
    complexes.foreach(t -> TraverseStructureHierarchy.printChainInfo(t._2));
    // System.out.println("Complexes: " + complexes.count());
    // complexes.keys().foreach(k -> System.out.println(k));
    // TraverseStructureHierarchy.printChainInfo(complexes.first()._2);
    sc.close();
    long end = System.nanoTime();
    System.out.println("Time: " + (end - start) / 1E9 + " sec.");
}

Also used : MmtfImporter(edu.sdsc.mmtf.spark.io.MmtfImporter) Arrays(java.util.Arrays) SparkConf(org.apache.spark.SparkConf) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) Set(java.util.Set) IOException(java.io.IOException) Tuple2(scala.Tuple2) JavaPairRDD(org.apache.spark.api.java.JavaPairRDD) File(java.io.File) ArrayList(java.util.ArrayList) HashSet(java.util.HashSet) TraverseStructureHierarchy(edu.sdsc.mmtf.spark.io.demos.TraverseStructureHierarchy) List(java.util.List) MergeMmtf(edu.sdsc.mm.dev.io.MergeMmtf) Molmporter(edu.sdsc.mm.dev.io.Molmporter) StructureDataInterface(org.rcsb.mmtf.api.StructureDataInterface) SynchronizedSortedBag(org.apache.commons.collections.bag.SynchronizedSortedBag) Path(java.nio.file.Path) Files(org.spark_project.guava.io.Files) JavaRDD(org.apache.spark.api.java.JavaRDD) Tuple2(scala.Tuple2) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) StructureDataInterface(org.rcsb.mmtf.api.StructureDataInterface) SparkConf(org.apache.spark.SparkConf)

Aggregations

JavaPairRDD (org.apache.spark.api.java.JavaPairRDD)99 MatrixBlock (org.apache.sysml.runtime.matrix.data.MatrixBlock)44 JavaSparkContext (org.apache.spark.api.java.JavaSparkContext)42 MatrixIndexes (org.apache.sysml.runtime.matrix.data.MatrixIndexes)42 MatrixCharacteristics (org.apache.sysml.runtime.matrix.MatrixCharacteristics)41 Tuple2 (scala.Tuple2)35 DMLRuntimeException (org.apache.sysml.runtime.DMLRuntimeException)33 JavaRDD (org.apache.spark.api.java.JavaRDD)28 List (java.util.List)27 SparkExecutionContext (org.apache.sysml.runtime.controlprogram.context.SparkExecutionContext)24 FrameBlock (org.apache.sysml.runtime.matrix.data.FrameBlock)23 Collectors (java.util.stream.Collectors)22 IOException (java.io.IOException)17 RDDObject (org.apache.sysml.runtime.instructions.spark.data.RDDObject)16 LongWritable (org.apache.hadoop.io.LongWritable)15 Broadcast (org.apache.spark.broadcast.Broadcast)15 Text (org.apache.hadoop.io.Text)12 UserException (org.broadinstitute.hellbender.exceptions.UserException)12 Function (org.apache.spark.api.java.function.Function)11 MatrixObject (org.apache.sysml.runtime.controlprogram.caching.MatrixObject)11