Search in sources :

Example 6 with HashPartitioner

use of org.apache.spark.HashPartitioner in project gatk by broadinstitute.

the class CoverageModelEMWorkspace method joinWithWorkersAndMap.

/**
     * A generic function for handling a blockified list of objects to their corresponding compute nodes
     *
     * If Spark is enabled:
     *
     *      Joins an instance of {@code List<Tuple2<LinearlySpacedIndexBlock, V>>} with {@link #computeRDD}, calls the provided
     *      map {@code mapper} on the RDD, and the reference to the old RDD will be replaced with the new RDD.
     *
     * If Spark is disabled:
     *
     *      Only a single target-space block is assumed, such that {@code data} is a singleton. The map function
     *      {@code mapper} will be called on the value contained in {@code data} and {@link #localComputeBlock}, and
     *      the old instance of {@link CoverageModelEMComputeBlock} is replaced with the new instance returned
     *      by {@code mapper.}
     *
     * @param data the list to joined and mapped together with the compute block(s)
     * @param mapper a mapper binary function that takes a compute block together with an object of type {@code V} and
     *               returns a new compute block
     * @param <V> the type of the object to the broadcasted
     */
@UpdatesRDD
private <V> void joinWithWorkersAndMap(@Nonnull final List<Tuple2<LinearlySpacedIndexBlock, V>> data, @Nonnull final Function<Tuple2<CoverageModelEMComputeBlock, V>, CoverageModelEMComputeBlock> mapper) {
    if (sparkContextIsAvailable) {
        final JavaPairRDD<LinearlySpacedIndexBlock, V> newRDD = ctx.parallelizePairs(data, numTargetBlocks).partitionBy(new HashPartitioner(numTargetBlocks));
        computeRDD = computeRDD.join(newRDD).mapValues(mapper);
    } else {
        try {
            Utils.validateArg(data.size() == 1, "Only a single data block is expected in the local mode");
            localComputeBlock = mapper.call(new Tuple2<>(localComputeBlock, data.get(0)._2));
        } catch (Exception e) {
            throw new RuntimeException("Can not apply the map function to the local compute block: " + e.getMessage());
        }
    }
}
Also used : Tuple2(scala.Tuple2) HashPartitioner(org.apache.spark.HashPartitioner) GATKException(org.broadinstitute.hellbender.exceptions.GATKException) UserException(org.broadinstitute.hellbender.exceptions.UserException) TooManyEvaluationsException(org.apache.commons.math3.exception.TooManyEvaluationsException) IOException(java.io.IOException) NoBracketingException(org.apache.commons.math3.exception.NoBracketingException)

Example 7 with HashPartitioner

use of org.apache.spark.HashPartitioner in project gatk-protected by broadinstitute.

the class CoverageModelEMWorkspace method joinWithWorkersAndMap.

/**
     * A generic function for handling a blockified list of objects to their corresponding compute nodes
     *
     * If Spark is enabled:
     *
     *      Joins an instance of {@code List<Tuple2<LinearlySpacedIndexBlock, V>>} with {@link #computeRDD}, calls the provided
     *      map {@code mapper} on the RDD, and the reference to the old RDD will be replaced with the new RDD.
     *
     * If Spark is disabled:
     *
     *      Only a single target-space block is assumed, such that {@code data} is a singleton. The map function
     *      {@code mapper} will be called on the value contained in {@code data} and {@link #localComputeBlock}, and
     *      the old instance of {@link CoverageModelEMComputeBlock} is replaced with the new instance returned
     *      by {@code mapper.}
     *
     * @param data the list to joined and mapped together with the compute block(s)
     * @param mapper a mapper binary function that takes a compute block together with an object of type {@code V} and
     *               returns a new compute block
     * @param <V> the type of the object to the broadcasted
     */
@UpdatesRDD
private <V> void joinWithWorkersAndMap(@Nonnull final List<Tuple2<LinearlySpacedIndexBlock, V>> data, @Nonnull final Function<Tuple2<CoverageModelEMComputeBlock, V>, CoverageModelEMComputeBlock> mapper) {
    if (sparkContextIsAvailable) {
        final JavaPairRDD<LinearlySpacedIndexBlock, V> newRDD = ctx.parallelizePairs(data, numTargetBlocks).partitionBy(new HashPartitioner(numTargetBlocks));
        computeRDD = computeRDD.join(newRDD).mapValues(mapper);
    } else {
        try {
            Utils.validateArg(data.size() == 1, "Only a single data block is expected in the local mode");
            localComputeBlock = mapper.call(new Tuple2<>(localComputeBlock, data.get(0)._2));
        } catch (Exception e) {
            throw new RuntimeException("Can not apply the map function to the local compute block: " + e.getMessage());
        }
    }
}
Also used : Tuple2(scala.Tuple2) HashPartitioner(org.apache.spark.HashPartitioner) GATKException(org.broadinstitute.hellbender.exceptions.GATKException) UserException(org.broadinstitute.hellbender.exceptions.UserException) TooManyEvaluationsException(org.apache.commons.math3.exception.TooManyEvaluationsException) IOException(java.io.IOException) NoBracketingException(org.apache.commons.math3.exception.NoBracketingException)

Example 8 with HashPartitioner

use of org.apache.spark.HashPartitioner in project gatk-protected by broadinstitute.

the class CoverageModelEMWorkspace method instantiateWorkers.

/**
     * Instantiate compute block(s). If Spark is disabled, a single {@link CoverageModelEMComputeBlock} is
     * instantiated. Otherwise, a {@link JavaPairRDD} of compute nodes will be created.
     */
private void instantiateWorkers() {
    if (sparkContextIsAvailable) {
        /* initialize the RDD */
        logger.info("Initializing an RDD of compute blocks");
        computeRDD = ctx.parallelizePairs(targetBlockStream().map(tb -> new Tuple2<>(tb, new CoverageModelEMComputeBlock(tb, numSamples, numLatents, ardEnabled))).collect(Collectors.toList()), numTargetBlocks).partitionBy(new HashPartitioner(numTargetBlocks)).cache();
    } else {
        logger.info("Initializing a local compute block");
        localComputeBlock = new CoverageModelEMComputeBlock(targetBlocks.get(0), numSamples, numLatents, ardEnabled);
    }
    prevCheckpointedComputeRDD = null;
    cacheCallCounter = 0;
}
Also used : ScalarProducer(org.broadinstitute.hellbender.utils.hmm.interfaces.ScalarProducer) Function2(org.apache.spark.api.java.function.Function2) HMMSegmentProcessor(org.broadinstitute.hellbender.utils.hmm.segmentation.HMMSegmentProcessor) GermlinePloidyAnnotatedTargetCollection(org.broadinstitute.hellbender.tools.exome.sexgenotyper.GermlinePloidyAnnotatedTargetCollection) HiddenStateSegmentRecordWriter(org.broadinstitute.hellbender.utils.hmm.segmentation.HiddenStateSegmentRecordWriter) BiFunction(java.util.function.BiFunction) GATKException(org.broadinstitute.hellbender.exceptions.GATKException) SexGenotypeData(org.broadinstitute.hellbender.tools.exome.sexgenotyper.SexGenotypeData) ParamUtils(org.broadinstitute.hellbender.utils.param.ParamUtils) CallStringProducer(org.broadinstitute.hellbender.utils.hmm.interfaces.CallStringProducer) StorageLevel(org.apache.spark.storage.StorageLevel) SynchronizedUnivariateSolver(org.broadinstitute.hellbender.tools.coveragemodel.math.SynchronizedUnivariateSolver) CopyRatioExpectationsCalculator(org.broadinstitute.hellbender.tools.coveragemodel.interfaces.CopyRatioExpectationsCalculator) UnivariateSolverSpecifications(org.broadinstitute.hellbender.tools.coveragemodel.math.UnivariateSolverSpecifications) IndexRange(org.broadinstitute.hellbender.utils.IndexRange) Broadcast(org.apache.spark.broadcast.Broadcast) ExitStatus(org.broadinstitute.hellbender.tools.coveragemodel.linalg.IterativeLinearSolverNDArray.ExitStatus) SexGenotypeDataCollection(org.broadinstitute.hellbender.tools.exome.sexgenotyper.SexGenotypeDataCollection) HashPartitioner(org.apache.spark.HashPartitioner) Predicate(java.util.function.Predicate) GeneralLinearOperator(org.broadinstitute.hellbender.tools.coveragemodel.linalg.GeneralLinearOperator) Nd4j(org.nd4j.linalg.factory.Nd4j) INDArrayIndex(org.nd4j.linalg.indexing.INDArrayIndex) FastMath(org.apache.commons.math3.util.FastMath) org.broadinstitute.hellbender.tools.exome(org.broadinstitute.hellbender.tools.exome) Tuple2(scala.Tuple2) Collectors(java.util.stream.Collectors) Sets(com.google.common.collect.Sets) AbstractUnivariateSolver(org.apache.commons.math3.analysis.solvers.AbstractUnivariateSolver) FourierLinearOperatorNDArray(org.broadinstitute.hellbender.tools.coveragemodel.linalg.FourierLinearOperatorNDArray) Logger(org.apache.logging.log4j.Logger) Stream(java.util.stream.Stream) UserException(org.broadinstitute.hellbender.exceptions.UserException) UnivariateFunction(org.apache.commons.math3.analysis.UnivariateFunction) TooManyEvaluationsException(org.apache.commons.math3.exception.TooManyEvaluationsException) Utils(org.broadinstitute.hellbender.utils.Utils) Function(org.apache.spark.api.java.function.Function) DataBuffer(org.nd4j.linalg.api.buffer.DataBuffer) IntStream(java.util.stream.IntStream) java.util(java.util) NDArrayIndex(org.nd4j.linalg.indexing.NDArrayIndex) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) AlleleMetadataProducer(org.broadinstitute.hellbender.utils.hmm.interfaces.AlleleMetadataProducer) EmissionCalculationStrategy(org.broadinstitute.hellbender.tools.coveragemodel.CoverageModelCopyRatioEmissionProbabilityCalculator.EmissionCalculationStrategy) RobustBrentSolver(org.broadinstitute.hellbender.tools.coveragemodel.math.RobustBrentSolver) IntervalUtils(org.broadinstitute.hellbender.utils.IntervalUtils) Nonnull(javax.annotation.Nonnull) Nullable(javax.annotation.Nullable) HiddenStateSegmentRecord(org.broadinstitute.hellbender.utils.hmm.segmentation.HiddenStateSegmentRecord) ImmutableTriple(org.apache.commons.lang3.tuple.ImmutableTriple) IterativeLinearSolverNDArray(org.broadinstitute.hellbender.tools.coveragemodel.linalg.IterativeLinearSolverNDArray) GATKProtectedMathUtils(org.broadinstitute.hellbender.utils.GATKProtectedMathUtils) Nd4jIOUtils(org.broadinstitute.hellbender.tools.coveragemodel.nd4jutils.Nd4jIOUtils) IOException(java.io.IOException) JavaPairRDD(org.apache.spark.api.java.JavaPairRDD) ImmutablePair(org.apache.commons.lang3.tuple.ImmutablePair) File(java.io.File) INDArray(org.nd4j.linalg.api.ndarray.INDArray) VisibleForTesting(com.google.common.annotations.VisibleForTesting) Transforms(org.nd4j.linalg.ops.transforms.Transforms) LogManager(org.apache.logging.log4j.LogManager) NoBracketingException(org.apache.commons.math3.exception.NoBracketingException) Tuple2(scala.Tuple2) HashPartitioner(org.apache.spark.HashPartitioner)

Example 9 with HashPartitioner

use of org.apache.spark.HashPartitioner in project gatk by broadinstitute.

the class CoverageModelEMWorkspace method instantiateWorkers.

/**
     * Instantiate compute block(s). If Spark is disabled, a single {@link CoverageModelEMComputeBlock} is
     * instantiated. Otherwise, a {@link JavaPairRDD} of compute nodes will be created.
     */
private void instantiateWorkers() {
    if (sparkContextIsAvailable) {
        /* initialize the RDD */
        logger.info("Initializing an RDD of compute blocks");
        computeRDD = ctx.parallelizePairs(targetBlockStream().map(tb -> new Tuple2<>(tb, new CoverageModelEMComputeBlock(tb, numSamples, numLatents, ardEnabled))).collect(Collectors.toList()), numTargetBlocks).partitionBy(new HashPartitioner(numTargetBlocks)).cache();
    } else {
        logger.info("Initializing a local compute block");
        localComputeBlock = new CoverageModelEMComputeBlock(targetBlocks.get(0), numSamples, numLatents, ardEnabled);
    }
    prevCheckpointedComputeRDD = null;
    cacheCallCounter = 0;
}
Also used : ScalarProducer(org.broadinstitute.hellbender.utils.hmm.interfaces.ScalarProducer) Function2(org.apache.spark.api.java.function.Function2) HMMSegmentProcessor(org.broadinstitute.hellbender.utils.hmm.segmentation.HMMSegmentProcessor) GermlinePloidyAnnotatedTargetCollection(org.broadinstitute.hellbender.tools.exome.sexgenotyper.GermlinePloidyAnnotatedTargetCollection) HiddenStateSegmentRecordWriter(org.broadinstitute.hellbender.utils.hmm.segmentation.HiddenStateSegmentRecordWriter) BiFunction(java.util.function.BiFunction) GATKException(org.broadinstitute.hellbender.exceptions.GATKException) SexGenotypeData(org.broadinstitute.hellbender.tools.exome.sexgenotyper.SexGenotypeData) ParamUtils(org.broadinstitute.hellbender.utils.param.ParamUtils) CallStringProducer(org.broadinstitute.hellbender.utils.hmm.interfaces.CallStringProducer) StorageLevel(org.apache.spark.storage.StorageLevel) SynchronizedUnivariateSolver(org.broadinstitute.hellbender.tools.coveragemodel.math.SynchronizedUnivariateSolver) CopyRatioExpectationsCalculator(org.broadinstitute.hellbender.tools.coveragemodel.interfaces.CopyRatioExpectationsCalculator) UnivariateSolverSpecifications(org.broadinstitute.hellbender.tools.coveragemodel.math.UnivariateSolverSpecifications) IndexRange(org.broadinstitute.hellbender.utils.IndexRange) Broadcast(org.apache.spark.broadcast.Broadcast) ExitStatus(org.broadinstitute.hellbender.tools.coveragemodel.linalg.IterativeLinearSolverNDArray.ExitStatus) SexGenotypeDataCollection(org.broadinstitute.hellbender.tools.exome.sexgenotyper.SexGenotypeDataCollection) HashPartitioner(org.apache.spark.HashPartitioner) Predicate(java.util.function.Predicate) GeneralLinearOperator(org.broadinstitute.hellbender.tools.coveragemodel.linalg.GeneralLinearOperator) Nd4j(org.nd4j.linalg.factory.Nd4j) INDArrayIndex(org.nd4j.linalg.indexing.INDArrayIndex) FastMath(org.apache.commons.math3.util.FastMath) org.broadinstitute.hellbender.tools.exome(org.broadinstitute.hellbender.tools.exome) Tuple2(scala.Tuple2) Collectors(java.util.stream.Collectors) Sets(com.google.common.collect.Sets) AbstractUnivariateSolver(org.apache.commons.math3.analysis.solvers.AbstractUnivariateSolver) FourierLinearOperatorNDArray(org.broadinstitute.hellbender.tools.coveragemodel.linalg.FourierLinearOperatorNDArray) Logger(org.apache.logging.log4j.Logger) Stream(java.util.stream.Stream) UserException(org.broadinstitute.hellbender.exceptions.UserException) UnivariateFunction(org.apache.commons.math3.analysis.UnivariateFunction) TooManyEvaluationsException(org.apache.commons.math3.exception.TooManyEvaluationsException) Utils(org.broadinstitute.hellbender.utils.Utils) Function(org.apache.spark.api.java.function.Function) DataBuffer(org.nd4j.linalg.api.buffer.DataBuffer) IntStream(java.util.stream.IntStream) java.util(java.util) NDArrayIndex(org.nd4j.linalg.indexing.NDArrayIndex) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) AlleleMetadataProducer(org.broadinstitute.hellbender.utils.hmm.interfaces.AlleleMetadataProducer) EmissionCalculationStrategy(org.broadinstitute.hellbender.tools.coveragemodel.CoverageModelCopyRatioEmissionProbabilityCalculator.EmissionCalculationStrategy) RobustBrentSolver(org.broadinstitute.hellbender.tools.coveragemodel.math.RobustBrentSolver) IntervalUtils(org.broadinstitute.hellbender.utils.IntervalUtils) Nonnull(javax.annotation.Nonnull) Nullable(javax.annotation.Nullable) HiddenStateSegmentRecord(org.broadinstitute.hellbender.utils.hmm.segmentation.HiddenStateSegmentRecord) ImmutableTriple(org.apache.commons.lang3.tuple.ImmutableTriple) IterativeLinearSolverNDArray(org.broadinstitute.hellbender.tools.coveragemodel.linalg.IterativeLinearSolverNDArray) GATKProtectedMathUtils(org.broadinstitute.hellbender.utils.GATKProtectedMathUtils) Nd4jIOUtils(org.broadinstitute.hellbender.tools.coveragemodel.nd4jutils.Nd4jIOUtils) IOException(java.io.IOException) JavaPairRDD(org.apache.spark.api.java.JavaPairRDD) ImmutablePair(org.apache.commons.lang3.tuple.ImmutablePair) File(java.io.File) INDArray(org.nd4j.linalg.api.ndarray.INDArray) VisibleForTesting(com.google.common.annotations.VisibleForTesting) Transforms(org.nd4j.linalg.ops.transforms.Transforms) LogManager(org.apache.logging.log4j.LogManager) NoBracketingException(org.apache.commons.math3.exception.NoBracketingException) Tuple2(scala.Tuple2) HashPartitioner(org.apache.spark.HashPartitioner)

Example 10 with HashPartitioner

use of org.apache.spark.HashPartitioner in project gatk by broadinstitute.

the class FindBreakpointEvidenceSpark method handleAssemblies.

/**
     * Transform all the reads for a supplied set of template names in each interval into FASTQ records
     * for each interval, and do something with the list of FASTQ records for each interval (like write it to a file).
     */
@VisibleForTesting
static List<AlignedAssemblyOrExcuse> handleAssemblies(final JavaSparkContext ctx, final HopscotchUniqueMultiMap<String, Integer, QNameAndInterval> qNamesMultiMap, final JavaRDD<GATKRead> reads, final int nIntervals, final boolean includeMappingLocation, final boolean dumpFASTQs, final LocalAssemblyHandler localAssemblyHandler) {
    final Broadcast<HopscotchUniqueMultiMap<String, Integer, QNameAndInterval>> broadcastQNamesMultiMap = ctx.broadcast(qNamesMultiMap);
    final List<AlignedAssemblyOrExcuse> intervalDispositions = reads.mapPartitionsToPair(readItr -> new ReadsForQNamesFinder(broadcastQNamesMultiMap.value(), nIntervals, includeMappingLocation, dumpFASTQs).call(readItr).iterator(), false).combineByKey(x -> x, FindBreakpointEvidenceSpark::combineLists, FindBreakpointEvidenceSpark::combineLists, new HashPartitioner(nIntervals), false, null).map(localAssemblyHandler::apply).collect();
    broadcastQNamesMultiMap.destroy();
    BwaMemIndexSingleton.closeAllDistributedInstances(ctx);
    return intervalDispositions;
}
Also used : BwaMemIndexSingleton(org.broadinstitute.hellbender.utils.bwa.BwaMemIndexSingleton) IntStream(java.util.stream.IntStream) CommandLineProgramProperties(org.broadinstitute.barclay.argparser.CommandLineProgramProperties) java.util(java.util) Argument(org.broadinstitute.barclay.argparser.Argument) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) StandardArgumentDefinitions(org.broadinstitute.hellbender.cmdline.StandardArgumentDefinitions) ArgumentCollection(org.broadinstitute.barclay.argparser.ArgumentCollection) GATKRead(org.broadinstitute.hellbender.utils.read.GATKRead) GATKException(org.broadinstitute.hellbender.exceptions.GATKException) BwaMemAlignment(org.broadinstitute.hellbender.utils.bwa.BwaMemAlignment) Function(java.util.function.Function) BwaMemAligner(org.broadinstitute.hellbender.utils.bwa.BwaMemAligner) HopscotchUniqueMultiMap(org.broadinstitute.hellbender.tools.spark.utils.HopscotchUniqueMultiMap) FermiLiteAssembly(org.broadinstitute.hellbender.utils.fermi.FermiLiteAssembly) FermiLiteAssembler(org.broadinstitute.hellbender.utils.fermi.FermiLiteAssembler) BucketUtils(org.broadinstitute.hellbender.utils.gcs.BucketUtils) JavaRDD(org.apache.spark.api.java.JavaRDD) Broadcast(org.apache.spark.broadcast.Broadcast) HashPartitioner(org.apache.spark.HashPartitioner) GATKSparkTool(org.broadinstitute.hellbender.engine.spark.GATKSparkTool) Tuple2(scala.Tuple2) Collectors(java.util.stream.Collectors) StructuralVariationSparkProgramGroup(org.broadinstitute.hellbender.cmdline.programgroups.StructuralVariationSparkProgramGroup) FindBreakpointEvidenceSparkArgumentCollection(org.broadinstitute.hellbender.tools.spark.sv.StructuralVariationDiscoveryArgumentCollection.FindBreakpointEvidenceSparkArgumentCollection) Logger(org.apache.logging.log4j.Logger) UserException(org.broadinstitute.hellbender.exceptions.UserException) java.io(java.io) Utils(org.broadinstitute.hellbender.utils.Utils) VisibleForTesting(com.google.common.annotations.VisibleForTesting) htsjdk.samtools(htsjdk.samtools) MapPartitioner(org.broadinstitute.hellbender.tools.spark.utils.MapPartitioner) LogManager(org.apache.logging.log4j.LogManager) HashPartitioner(org.apache.spark.HashPartitioner) HopscotchUniqueMultiMap(org.broadinstitute.hellbender.tools.spark.utils.HopscotchUniqueMultiMap) VisibleForTesting(com.google.common.annotations.VisibleForTesting)

Aggregations

HashPartitioner (org.apache.spark.HashPartitioner)11 IOException (java.io.IOException)8 GATKException (org.broadinstitute.hellbender.exceptions.GATKException)8 Tuple2 (scala.Tuple2)8 JavaSparkContext (org.apache.spark.api.java.JavaSparkContext)7 UserException (org.broadinstitute.hellbender.exceptions.UserException)7 VisibleForTesting (com.google.common.annotations.VisibleForTesting)6 java.util (java.util)6 NoBracketingException (org.apache.commons.math3.exception.NoBracketingException)6 TooManyEvaluationsException (org.apache.commons.math3.exception.TooManyEvaluationsException)6 File (java.io.File)5 Collectors (java.util.stream.Collectors)5 IntStream (java.util.stream.IntStream)5 LogManager (org.apache.logging.log4j.LogManager)5 Logger (org.apache.logging.log4j.Logger)5 JavaPairRDD (org.apache.spark.api.java.JavaPairRDD)5 Broadcast (org.apache.spark.broadcast.Broadcast)5 StorageLevel (org.apache.spark.storage.StorageLevel)5 Sets (com.google.common.collect.Sets)4 BiFunction (java.util.function.BiFunction)4