Search in sources :

Example 1 with HopscotchMap

use of org.broadinstitute.hellbender.tools.spark.utils.HopscotchMap in project gatk by broadinstitute.

the class FindBadGenomicKmersSpark method processRefRDD.

/**
     * Do a map/reduce on an RDD of genomic sequences:
     * Kmerize, mapping to a pair <kmer,1>, reduce by summing values by key, filter out <kmer,N> where
     * N <= MAX_KMER_FREQ, and collect the high frequency kmers back in the driver.
     */
@VisibleForTesting
static List<SVKmer> processRefRDD(final int kSize, final int maxDUSTScore, final int maxKmerFreq, final JavaRDD<byte[]> refRDD) {
    final int nPartitions = refRDD.getNumPartitions();
    final int hashSize = 2 * REF_RECORDS_PER_PARTITION;
    final int arrayCap = REF_RECORDS_PER_PARTITION / 100;
    return refRDD.mapPartitions(seqItr -> {
        final HopscotchMap<SVKmer, Integer, KmerAndCount> kmerCounts = new HopscotchMap<>(hashSize);
        while (seqItr.hasNext()) {
            final byte[] seq = seqItr.next();
            SVDUSTFilteredKmerizer.stream(seq, kSize, maxDUSTScore, new SVKmerLong()).map(kmer -> kmer.canonical(kSize)).forEach(kmer -> {
                final KmerAndCount entry = kmerCounts.find(kmer);
                if (entry == null)
                    kmerCounts.add(new KmerAndCount((SVKmerLong) kmer));
                else
                    entry.bumpCount();
            });
        }
        return kmerCounts.iterator();
    }).mapToPair(entry -> new Tuple2<>(entry.getKey(), entry.getValue())).partitionBy(new HashPartitioner(nPartitions)).mapPartitions(pairItr -> {
        final HopscotchMap<SVKmer, Integer, KmerAndCount> kmerCounts = new HopscotchMap<>(hashSize);
        while (pairItr.hasNext()) {
            final Tuple2<SVKmer, Integer> pair = pairItr.next();
            final SVKmer kmer = pair._1();
            final int count = pair._2();
            KmerAndCount entry = kmerCounts.find(kmer);
            if (entry == null)
                kmerCounts.add(new KmerAndCount((SVKmerLong) kmer, count));
            else
                entry.bumpCount(count);
        }
        final List<SVKmer> highFreqKmers = new ArrayList<>(arrayCap);
        for (KmerAndCount kmerAndCount : kmerCounts) {
            if (kmerAndCount.grabCount() > maxKmerFreq)
                highFreqKmers.add(kmerAndCount.getKey());
        }
        return highFreqKmers.iterator();
    }).collect();
}
Also used : Output(com.esotericsoftware.kryo.io.Output) CommandLineProgramProperties(org.broadinstitute.barclay.argparser.CommandLineProgramProperties) java.util(java.util) ReferenceMultiSource(org.broadinstitute.hellbender.engine.datasources.ReferenceMultiSource) Argument(org.broadinstitute.barclay.argparser.Argument) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) StandardArgumentDefinitions(org.broadinstitute.hellbender.cmdline.StandardArgumentDefinitions) SAMFileHeader(htsjdk.samtools.SAMFileHeader) GATKException(org.broadinstitute.hellbender.exceptions.GATKException) Kryo(com.esotericsoftware.kryo.Kryo) BucketUtils(org.broadinstitute.hellbender.utils.gcs.BucketUtils) HopscotchMap(org.broadinstitute.hellbender.tools.spark.utils.HopscotchMap) Input(com.esotericsoftware.kryo.io.Input) HopscotchSet(org.broadinstitute.hellbender.tools.spark.utils.HopscotchSet) JavaRDD(org.apache.spark.api.java.JavaRDD) DefaultSerializer(com.esotericsoftware.kryo.DefaultSerializer) HashPartitioner(org.apache.spark.HashPartitioner) SAMSequenceDictionary(htsjdk.samtools.SAMSequenceDictionary) GATKSparkTool(org.broadinstitute.hellbender.engine.spark.GATKSparkTool) IOException(java.io.IOException) Tuple2(scala.Tuple2) InputStreamReader(java.io.InputStreamReader) StructuralVariationSparkProgramGroup(org.broadinstitute.hellbender.cmdline.programgroups.StructuralVariationSparkProgramGroup) PipelineOptions(com.google.cloud.dataflow.sdk.options.PipelineOptions) VisibleForTesting(com.google.common.annotations.VisibleForTesting) BufferedReader(java.io.BufferedReader) HopscotchMap(org.broadinstitute.hellbender.tools.spark.utils.HopscotchMap) Tuple2(scala.Tuple2) HashPartitioner(org.apache.spark.HashPartitioner) VisibleForTesting(com.google.common.annotations.VisibleForTesting)

Aggregations

DefaultSerializer (com.esotericsoftware.kryo.DefaultSerializer)1 Kryo (com.esotericsoftware.kryo.Kryo)1 Input (com.esotericsoftware.kryo.io.Input)1 Output (com.esotericsoftware.kryo.io.Output)1 PipelineOptions (com.google.cloud.dataflow.sdk.options.PipelineOptions)1 VisibleForTesting (com.google.common.annotations.VisibleForTesting)1 SAMFileHeader (htsjdk.samtools.SAMFileHeader)1 SAMSequenceDictionary (htsjdk.samtools.SAMSequenceDictionary)1 BufferedReader (java.io.BufferedReader)1 IOException (java.io.IOException)1 InputStreamReader (java.io.InputStreamReader)1 java.util (java.util)1 HashPartitioner (org.apache.spark.HashPartitioner)1 JavaRDD (org.apache.spark.api.java.JavaRDD)1 JavaSparkContext (org.apache.spark.api.java.JavaSparkContext)1 Argument (org.broadinstitute.barclay.argparser.Argument)1 CommandLineProgramProperties (org.broadinstitute.barclay.argparser.CommandLineProgramProperties)1 StandardArgumentDefinitions (org.broadinstitute.hellbender.cmdline.StandardArgumentDefinitions)1 StructuralVariationSparkProgramGroup (org.broadinstitute.hellbender.cmdline.programgroups.StructuralVariationSparkProgramGroup)1 ReferenceMultiSource (org.broadinstitute.hellbender.engine.datasources.ReferenceMultiSource)1