Search in sources :

Example 76 with JavaPairRDD

use of org.apache.spark.api.java.JavaPairRDD in project incubator-systemml by apache.

the class WriteSPInstruction method processFrameWriteInstruction.

@SuppressWarnings("unchecked")
protected void processFrameWriteInstruction(SparkExecutionContext sec, String fname, OutputInfo oi, ValueType[] schema) throws IOException {
    // get input rdd
    JavaPairRDD<Long, FrameBlock> in1 = (JavaPairRDD<Long, FrameBlock>) sec.getRDDHandleForVariable(input1.getName(), InputInfo.BinaryBlockInputInfo);
    MatrixCharacteristics mc = sec.getMatrixCharacteristics(input1.getName());
    if (oi == OutputInfo.TextCellOutputInfo) {
        JavaRDD<String> out = FrameRDDConverterUtils.binaryBlockToTextCell(in1, mc);
        customSaveTextFile(out, fname, false);
    } else if (oi == OutputInfo.CSVOutputInfo) {
        CSVFileFormatProperties props = (formatProperties != null) ? (CSVFileFormatProperties) formatProperties : null;
        JavaRDD<String> out = FrameRDDConverterUtils.binaryBlockToCsv(in1, mc, props, true);
        customSaveTextFile(out, fname, false);
    } else if (oi == OutputInfo.BinaryBlockOutputInfo) {
        JavaPairRDD<LongWritable, FrameBlock> out = in1.mapToPair(new LongFrameToLongWritableFrameFunction());
        out.saveAsHadoopFile(fname, LongWritable.class, FrameBlock.class, SequenceFileOutputFormat.class);
    } else {
        // unsupported formats: binarycell (not externalized)
        throw new DMLRuntimeException("Unexpected data format: " + OutputInfo.outputInfoToString(oi));
    }
    // write meta data file
    MapReduceTool.writeMetaDataFile(fname + ".mtd", input1.getValueType(), schema, DataType.FRAME, mc, oi, formatProperties);
}
Also used : CSVFileFormatProperties(org.apache.sysml.runtime.matrix.data.CSVFileFormatProperties) MatrixCharacteristics(org.apache.sysml.runtime.matrix.MatrixCharacteristics) JavaRDD(org.apache.spark.api.java.JavaRDD) DMLRuntimeException(org.apache.sysml.runtime.DMLRuntimeException) FrameBlock(org.apache.sysml.runtime.matrix.data.FrameBlock) JavaPairRDD(org.apache.spark.api.java.JavaPairRDD) LongWritable(org.apache.hadoop.io.LongWritable) LongFrameToLongWritableFrameFunction(org.apache.sysml.runtime.instructions.spark.utils.FrameRDDConverterUtils.LongFrameToLongWritableFrameFunction)

Example 77 with JavaPairRDD

use of org.apache.spark.api.java.JavaPairRDD in project incubator-systemml by apache.

the class SparkUtils method getEmptyBlockRDD.

/**
 * Creates an RDD of empty blocks according to the given matrix characteristics. This is
 * done in a scalable manner by parallelizing block ranges and generating empty blocks
 * in a distributed manner, under awareness of preferred output partition sizes.
 *
 * @param sc spark context
 * @param mc matrix characteristics
 * @return pair rdd of empty matrix blocks
 */
public static JavaPairRDD<MatrixIndexes, MatrixBlock> getEmptyBlockRDD(JavaSparkContext sc, MatrixCharacteristics mc) {
    // compute degree of parallelism and block ranges
    long size = mc.getNumBlocks() * OptimizerUtils.estimateSizeEmptyBlock(Math.min(Math.max(mc.getRows(), 1), mc.getRowsPerBlock()), Math.min(Math.max(mc.getCols(), 1), mc.getColsPerBlock()));
    int par = (int) Math.min(Math.max(SparkExecutionContext.getDefaultParallelism(true), Math.ceil(size / InfrastructureAnalyzer.getHDFSBlockSize())), mc.getNumBlocks());
    long pNumBlocks = (long) Math.ceil((double) mc.getNumBlocks() / par);
    // generate block offsets per partition
    List<Long> offsets = LongStream.iterate(0, n -> n + pNumBlocks).limit(par).boxed().collect(Collectors.toList());
    // parallelize offsets and generate all empty blocks
    return (JavaPairRDD<MatrixIndexes, MatrixBlock>) sc.parallelize(offsets, par).flatMapToPair(new GenerateEmptyBlocks(mc, pNumBlocks));
}
Also used : Function2(org.apache.spark.api.java.function.Function2) PairFlatMapFunction(org.apache.spark.api.java.function.PairFlatMapFunction) CopyBlockPairFunction(org.apache.sysml.runtime.instructions.spark.functions.CopyBlockPairFunction) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) MatrixIndexes(org.apache.sysml.runtime.matrix.data.MatrixIndexes) SparkExecutionContext(org.apache.sysml.runtime.controlprogram.context.SparkExecutionContext) ArrayList(java.util.ArrayList) MatrixBlock(org.apache.sysml.runtime.matrix.data.MatrixBlock) CopyBlockFunction(org.apache.sysml.runtime.instructions.spark.functions.CopyBlockFunction) StorageLevel(org.apache.spark.storage.StorageLevel) InfrastructureAnalyzer(org.apache.sysml.runtime.controlprogram.parfor.stat.InfrastructureAnalyzer) FrameBlock(org.apache.sysml.runtime.matrix.data.FrameBlock) MatrixCell(org.apache.sysml.runtime.matrix.data.MatrixCell) LongStream(java.util.stream.LongStream) Iterator(java.util.Iterator) Pair(org.apache.sysml.runtime.matrix.data.Pair) HashPartitioner(org.apache.spark.HashPartitioner) Checkpoint(org.apache.sysml.lops.Checkpoint) Tuple2(scala.Tuple2) Collectors(java.util.stream.Collectors) JavaPairRDD(org.apache.spark.api.java.JavaPairRDD) OptimizerUtils(org.apache.sysml.hops.OptimizerUtils) List(java.util.List) IndexedMatrixValue(org.apache.sysml.runtime.matrix.mapred.IndexedMatrixValue) CopyBinaryCellFunction(org.apache.sysml.runtime.instructions.spark.functions.CopyBinaryCellFunction) MatrixCharacteristics(org.apache.sysml.runtime.matrix.MatrixCharacteristics) Function(org.apache.spark.api.java.function.Function) UtilFunctions(org.apache.sysml.runtime.util.UtilFunctions) JavaPairRDD(org.apache.spark.api.java.JavaPairRDD) Checkpoint(org.apache.sysml.lops.Checkpoint)

Example 78 with JavaPairRDD

use of org.apache.spark.api.java.JavaPairRDD in project mmtf-spark by sbl-sdsc.

the class StructureToBioJavaTest method test.

@Test
public void test() throws IOException {
    List<String> pdbIds = Arrays.asList("1STP", "4HHB", "1JLP", "5X6H", "5L2G", "2MK1");
    JavaPairRDD<String, StructureDataInterface> pdb = MmtfReader.downloadFullMmtfFiles(pdbIds, sc).cache();
    // 1STP: 1 L-protein chain:
    // 4HHB: 4 polymer chains
    // 1JLP: 1 L-protein chains with non-polymer capping group (NH2)
    // 5X6H: 1 L-protein and 1 DNA chain
    // 5L2G: 2 DNA chains
    // 2MK1: 0 polymer chains
    // --------------------
    // tot : 10 polymer chains
    JavaDoubleRDD chainCounts = pdb.mapValues(new StructureToBioJava()).values().mapToDouble(v -> v.getPolyChains().size());
    assertEquals(10, Math.round(chainCounts.sum()));
    // extract polymer chains and count chains again
    chainCounts = pdb.flatMapToPair(new StructureToPolymerChains()).mapValues(new StructureToBioJava()).values().mapToDouble(v -> v.getChains().size());
    assertEquals(10, Math.round(chainCounts.sum()));
}
Also used : Arrays(java.util.Arrays) SparkConf(org.apache.spark.SparkConf) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) IOException(java.io.IOException) Test(org.junit.Test) JavaPairRDD(org.apache.spark.api.java.JavaPairRDD) JavaDoubleRDD(org.apache.spark.api.java.JavaDoubleRDD) List(java.util.List) StructureDataInterface(org.rcsb.mmtf.api.StructureDataInterface) After(org.junit.After) MmtfReader(edu.sdsc.mmtf.spark.io.MmtfReader) Assert.assertEquals(org.junit.Assert.assertEquals) Before(org.junit.Before) StructureDataInterface(org.rcsb.mmtf.api.StructureDataInterface) JavaDoubleRDD(org.apache.spark.api.java.JavaDoubleRDD) Test(org.junit.Test)

Example 79 with JavaPairRDD

use of org.apache.spark.api.java.JavaPairRDD in project java_study by aloyschen.

the class RDD method java_pair.

/*
    * spark pairRDD example
     */
public void java_pair() {
    JavaSparkContext sc = getSc();
    sc.setLogLevel("ERROR");
    JavaRDD<String> lines = sc.parallelize(Arrays.asList("I am boy", "you are cold", "I am learning"));
    JavaPairRDD<String, String> pairRDD = lines.mapToPair((PairFunction<String, String, String>) s -> new Tuple2<>(s.split(" ")[0], s));
    pairRDD.foreach(line -> System.out.println("key is " + line._1));
}
Also used : Arrays(java.util.Arrays) Function2(org.apache.spark.api.java.function.Function2) Serializable(scala.Serializable) SparkConf(org.apache.spark.SparkConf) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) Tuple2(scala.Tuple2) JavaPairRDD(org.apache.spark.api.java.JavaPairRDD) ArrayList(java.util.ArrayList) List(java.util.List) Map(java.util.Map) Function(org.apache.spark.api.java.function.Function) PairFunction(org.apache.spark.api.java.function.PairFunction) JavaRDD(org.apache.spark.api.java.JavaRDD) FlatMapFunction(org.apache.spark.api.java.function.FlatMapFunction) Tuple2(scala.Tuple2) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext)

Example 80 with JavaPairRDD

use of org.apache.spark.api.java.JavaPairRDD in project cdap by caskdata.

the class SparkBatchSourceFactory method createInputRDD.

@SuppressWarnings("unchecked")
private <K, V> JavaPairRDD<K, V> createInputRDD(JavaSparkExecutionContext sec, JavaSparkContext jsc, String inputName, Class<K> keyClass, Class<V> valueClass) {
    if (streams.containsKey(inputName)) {
        Input.StreamInput streamInput = streams.get(inputName);
        FormatSpecification formatSpec = streamInput.getBodyFormatSpec();
        if (formatSpec != null) {
            return (JavaPairRDD<K, V>) sec.fromStream(streamInput.getName(), formatSpec, streamInput.getStartTime(), streamInput.getEndTime(), StructuredRecord.class);
        }
        String decoderType = streamInput.getDecoderType();
        if (decoderType == null) {
            return (JavaPairRDD<K, V>) sec.fromStream(streamInput.getName(), streamInput.getStartTime(), streamInput.getEndTime(), valueClass);
        } else {
            try {
                Class<StreamEventDecoder<K, V>> decoderClass = (Class<StreamEventDecoder<K, V>>) Thread.currentThread().getContextClassLoader().loadClass(decoderType);
                return sec.fromStream(streamInput.getName(), streamInput.getStartTime(), streamInput.getEndTime(), decoderClass, keyClass, valueClass);
            } catch (Exception e) {
                throw Throwables.propagate(e);
            }
        }
    }
    if (inputFormatProviders.containsKey(inputName)) {
        InputFormatProvider inputFormatProvider = inputFormatProviders.get(inputName);
        Configuration hConf = new Configuration();
        hConf.clear();
        for (Map.Entry<String, String> entry : inputFormatProvider.getInputFormatConfiguration().entrySet()) {
            hConf.set(entry.getKey(), entry.getValue());
        }
        ClassLoader classLoader = Objects.firstNonNull(currentThread().getContextClassLoader(), getClass().getClassLoader());
        try {
            @SuppressWarnings("unchecked") Class<InputFormat> inputFormatClass = (Class<InputFormat>) classLoader.loadClass(inputFormatProvider.getInputFormatClassName());
            return jsc.newAPIHadoopRDD(hConf, inputFormatClass, keyClass, valueClass);
        } catch (ClassNotFoundException e) {
            throw Throwables.propagate(e);
        }
    }
    if (datasetInfos.containsKey(inputName)) {
        DatasetInfo datasetInfo = datasetInfos.get(inputName);
        return sec.fromDataset(datasetInfo.getDatasetName(), datasetInfo.getDatasetArgs());
    }
    // which make sure one and only one of those source type will be specified.
    throw new IllegalStateException("Unknown source type");
}
Also used : InputFormatProvider(co.cask.cdap.api.data.batch.InputFormatProvider) Configuration(org.apache.hadoop.conf.Configuration) FormatSpecification(co.cask.cdap.api.data.format.FormatSpecification) StructuredRecord(co.cask.cdap.api.data.format.StructuredRecord) StreamEventDecoder(co.cask.cdap.api.stream.StreamEventDecoder) Input(co.cask.cdap.api.data.batch.Input) InputFormat(org.apache.hadoop.mapreduce.InputFormat) JavaPairRDD(org.apache.spark.api.java.JavaPairRDD) ImmutableMap(com.google.common.collect.ImmutableMap) HashMap(java.util.HashMap) Map(java.util.Map)

Aggregations

JavaPairRDD (org.apache.spark.api.java.JavaPairRDD)99 MatrixBlock (org.apache.sysml.runtime.matrix.data.MatrixBlock)44 JavaSparkContext (org.apache.spark.api.java.JavaSparkContext)42 MatrixIndexes (org.apache.sysml.runtime.matrix.data.MatrixIndexes)42 MatrixCharacteristics (org.apache.sysml.runtime.matrix.MatrixCharacteristics)41 Tuple2 (scala.Tuple2)35 DMLRuntimeException (org.apache.sysml.runtime.DMLRuntimeException)33 JavaRDD (org.apache.spark.api.java.JavaRDD)28 List (java.util.List)27 SparkExecutionContext (org.apache.sysml.runtime.controlprogram.context.SparkExecutionContext)24 FrameBlock (org.apache.sysml.runtime.matrix.data.FrameBlock)23 Collectors (java.util.stream.Collectors)22 IOException (java.io.IOException)17 RDDObject (org.apache.sysml.runtime.instructions.spark.data.RDDObject)16 LongWritable (org.apache.hadoop.io.LongWritable)15 Broadcast (org.apache.spark.broadcast.Broadcast)15 Text (org.apache.hadoop.io.Text)12 UserException (org.broadinstitute.hellbender.exceptions.UserException)12 Function (org.apache.spark.api.java.function.Function)11 MatrixObject (org.apache.sysml.runtime.controlprogram.caching.MatrixObject)11