Search in sources :

Example 6 with PipelineOptions

use of com.google.cloud.dataflow.sdk.options.PipelineOptions in project gatk by broadinstitute.

the class FindBadGenomicKmersSpark method processFasta.

@VisibleForTesting
static List<SVKmer> processFasta(final int kSize, final int maxDUSTScore, final String fastaFilename, final PipelineOptions options) {
    try (BufferedReader rdr = new BufferedReader(new InputStreamReader(BucketUtils.openFile(fastaFilename)))) {
        final List<SVKmer> kmers = new ArrayList<>((int) BucketUtils.fileSize(fastaFilename));
        String line;
        final StringBuilder sb = new StringBuilder();
        final SVKmer kmerSeed = new SVKmerLong();
        while ((line = rdr.readLine()) != null) {
            if (line.charAt(0) != '>')
                sb.append(line);
            else if (sb.length() > 0) {
                SVDUSTFilteredKmerizer.stream(sb, kSize, maxDUSTScore, kmerSeed).map(kmer -> kmer.canonical(kSize)).forEach(kmers::add);
                sb.setLength(0);
            }
        }
        if (sb.length() > 0) {
            SVDUSTFilteredKmerizer.stream(sb, kSize, maxDUSTScore, kmerSeed).map(kmer -> kmer.canonical(kSize)).forEach(kmers::add);
        }
        return kmers;
    } catch (IOException ioe) {
        throw new GATKException("Can't read high copy kmers fasta file " + fastaFilename, ioe);
    }
}
Also used : Output(com.esotericsoftware.kryo.io.Output) CommandLineProgramProperties(org.broadinstitute.barclay.argparser.CommandLineProgramProperties) java.util(java.util) ReferenceMultiSource(org.broadinstitute.hellbender.engine.datasources.ReferenceMultiSource) Argument(org.broadinstitute.barclay.argparser.Argument) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) StandardArgumentDefinitions(org.broadinstitute.hellbender.cmdline.StandardArgumentDefinitions) SAMFileHeader(htsjdk.samtools.SAMFileHeader) GATKException(org.broadinstitute.hellbender.exceptions.GATKException) Kryo(com.esotericsoftware.kryo.Kryo) BucketUtils(org.broadinstitute.hellbender.utils.gcs.BucketUtils) HopscotchMap(org.broadinstitute.hellbender.tools.spark.utils.HopscotchMap) Input(com.esotericsoftware.kryo.io.Input) HopscotchSet(org.broadinstitute.hellbender.tools.spark.utils.HopscotchSet) JavaRDD(org.apache.spark.api.java.JavaRDD) DefaultSerializer(com.esotericsoftware.kryo.DefaultSerializer) HashPartitioner(org.apache.spark.HashPartitioner) SAMSequenceDictionary(htsjdk.samtools.SAMSequenceDictionary) GATKSparkTool(org.broadinstitute.hellbender.engine.spark.GATKSparkTool) IOException(java.io.IOException) Tuple2(scala.Tuple2) InputStreamReader(java.io.InputStreamReader) StructuralVariationSparkProgramGroup(org.broadinstitute.hellbender.cmdline.programgroups.StructuralVariationSparkProgramGroup) PipelineOptions(com.google.cloud.dataflow.sdk.options.PipelineOptions) VisibleForTesting(com.google.common.annotations.VisibleForTesting) BufferedReader(java.io.BufferedReader) InputStreamReader(java.io.InputStreamReader) BufferedReader(java.io.BufferedReader) IOException(java.io.IOException) GATKException(org.broadinstitute.hellbender.exceptions.GATKException) VisibleForTesting(com.google.common.annotations.VisibleForTesting)

Example 7 with PipelineOptions

use of com.google.cloud.dataflow.sdk.options.PipelineOptions in project gatk by broadinstitute.

the class StructuralVariationDiscoveryPipelineSpark method runTool.

@Override
protected void runTool(final JavaSparkContext ctx) {
    final SAMFileHeader header = getHeaderForReads();
    final PipelineOptions pipelineOptions = getAuthenticatedGCSOptions();
    // gather evidence, run assembly, and align
    final List<AlignedAssemblyOrExcuse> alignedAssemblyOrExcuseList = FindBreakpointEvidenceSpark.gatherEvidenceAndWriteContigSamFile(ctx, evidenceAndAssemblyArgs, header, getUnfilteredReads(), outputSAM, localLogger);
    if (alignedAssemblyOrExcuseList.isEmpty())
        return;
    // parse the contig alignments and extract necessary information
    @SuppressWarnings("unchecked") final JavaRDD<AlignedContig> parsedAlignments = new InMemoryAlignmentParser(ctx, alignedAssemblyOrExcuseList, header, localLogger).getAlignedContigs();
    if (parsedAlignments.isEmpty())
        return;
    // discover variants and write to vcf
    DiscoverVariantsFromContigAlignmentsSAMSpark.discoverVariantsAndWriteVCF(parsedAlignments, discoverStageArgs.fastaReference, ctx.broadcast(getReference()), pipelineOptions, vcfOutputFileName, localLogger);
}
Also used : PipelineOptions(com.google.cloud.dataflow.sdk.options.PipelineOptions) SAMFileHeader(htsjdk.samtools.SAMFileHeader)

Example 8 with PipelineOptions

use of com.google.cloud.dataflow.sdk.options.PipelineOptions in project gatk by broadinstitute.

the class ReferenceUtilsUnitTest method testLoadFastaDictionaryFromGCSBucket.

@Test(groups = { "bucket" })
public void testLoadFastaDictionaryFromGCSBucket() throws IOException {
    final String bucketDictionary = getGCPTestInputPath() + "org/broadinstitute/hellbender/utils/ReferenceUtilsTest.dict";
    final PipelineOptions popts = getAuthenticatedPipelineOptions();
    try (final InputStream referenceDictionaryStream = BucketUtils.openFile(bucketDictionary)) {
        final SAMSequenceDictionary dictionary = ReferenceUtils.loadFastaDictionary(referenceDictionaryStream);
        Assert.assertNotNull(dictionary, "Sequence dictionary null after loading");
        Assert.assertEquals(dictionary.size(), 4, "Wrong sequence dictionary size after loading");
    }
}
Also used : PipelineOptions(com.google.cloud.dataflow.sdk.options.PipelineOptions) SAMSequenceDictionary(htsjdk.samtools.SAMSequenceDictionary) BaseTest(org.broadinstitute.hellbender.utils.test.BaseTest) Test(org.testng.annotations.Test)

Example 9 with PipelineOptions

use of com.google.cloud.dataflow.sdk.options.PipelineOptions in project gatk by broadinstitute.

the class BucketUtilsTest method testCopyAndDeleteHDFS.

@Test
public void testCopyAndDeleteHDFS() throws Exception {
    final String src = publicTestDir + "empty.vcf";
    File dest = createTempFile("copy-empty", ".vcf");
    MiniClusterUtils.runOnIsolatedMiniCluster(cluster -> {
        final String intermediate = BucketUtils.randomRemotePath(MiniClusterUtils.getWorkingDir(cluster).toString(), "test-copy-empty", ".vcf");
        Assert.assertTrue(BucketUtils.isHadoopUrl(intermediate), "!BucketUtils.isHadoopUrl(intermediate)");
        PipelineOptions popts = null;
        BucketUtils.copyFile(src, intermediate);
        BucketUtils.copyFile(intermediate, dest.getPath());
        IOUtil.assertFilesEqual(new File(src), dest);
        Assert.assertTrue(BucketUtils.fileExists(intermediate));
        BucketUtils.deleteFile(intermediate);
        Assert.assertFalse(BucketUtils.fileExists(intermediate));
    });
}
Also used : PipelineOptions(com.google.cloud.dataflow.sdk.options.PipelineOptions) File(java.io.File) BaseTest(org.broadinstitute.hellbender.utils.test.BaseTest) Test(org.testng.annotations.Test)

Example 10 with PipelineOptions

use of com.google.cloud.dataflow.sdk.options.PipelineOptions in project gatk by broadinstitute.

the class PathSeqFilterSpark method doKmerFiltering.

@SuppressWarnings("unchecked")
private JavaRDD<GATKRead> doKmerFiltering(final JavaSparkContext ctx, final JavaRDD<GATKRead> reads) {
    final PipelineOptions options = getAuthenticatedGCSOptions();
    Input input = new Input(BucketUtils.openFile(KMER_LIB_PATH));
    Kryo kryo = new Kryo();
    kryo.setReferences(false);
    Set<SVKmer> kmerLibSet = (HopscotchSet<SVKmer>) kryo.readClassAndObject(input);
    return reads.filter(new ContainsKmerReadFilterSpark(ctx.broadcast(kmerLibSet), KMER_SIZE));
}
Also used : ContainsKmerReadFilterSpark(org.broadinstitute.hellbender.tools.spark.sv.ContainsKmerReadFilterSpark) Input(com.esotericsoftware.kryo.io.Input) HopscotchSet(org.broadinstitute.hellbender.tools.spark.utils.HopscotchSet) SVKmer(org.broadinstitute.hellbender.tools.spark.sv.SVKmer) PipelineOptions(com.google.cloud.dataflow.sdk.options.PipelineOptions) Kryo(com.esotericsoftware.kryo.Kryo)

Aggregations

PipelineOptions (com.google.cloud.dataflow.sdk.options.PipelineOptions)12 SAMSequenceDictionary (htsjdk.samtools.SAMSequenceDictionary)5 ReferenceMultiSource (org.broadinstitute.hellbender.engine.datasources.ReferenceMultiSource)5 BaseTest (org.broadinstitute.hellbender.utils.test.BaseTest)5 Test (org.testng.annotations.Test)5 SAMFileHeader (htsjdk.samtools.SAMFileHeader)4 Kryo (com.esotericsoftware.kryo.Kryo)3 IOException (java.io.IOException)3 HopscotchSet (org.broadinstitute.hellbender.tools.spark.utils.HopscotchSet)3 Input (com.esotericsoftware.kryo.io.Input)2 Output (com.esotericsoftware.kryo.io.Output)2 VisibleForTesting (com.google.common.annotations.VisibleForTesting)2 File (java.io.File)2 java.util (java.util)2 UserException (org.broadinstitute.hellbender.exceptions.UserException)2 DefaultSerializer (com.esotericsoftware.kryo.DefaultSerializer)1 JsonFactory (com.google.api.client.json.JsonFactory)1 Genomics (com.google.api.services.genomics.Genomics)1 com.google.api.services.genomics.model (com.google.api.services.genomics.model)1 PipelineOptionsFactory (com.google.cloud.dataflow.sdk.options.PipelineOptionsFactory)1