Search in sources :

Example 1 with PipelineOptions

use of com.google.cloud.dataflow.sdk.options.PipelineOptions in project gatk by broadinstitute.

the class FindBadGenomicKmersSpark method runTool.

/** Get the list of high copy number kmers in the reference, and write them to a file. */
@Override
protected void runTool(final JavaSparkContext ctx) {
    final SAMFileHeader hdr = getHeaderForReads();
    SAMSequenceDictionary dict = null;
    if (hdr != null)
        dict = hdr.getSequenceDictionary();
    final PipelineOptions options = getAuthenticatedGCSOptions();
    final ReferenceMultiSource referenceMultiSource = getReference();
    Collection<SVKmer> killList = findBadGenomicKmers(ctx, kSize, maxDUSTScore, referenceMultiSource, options, dict);
    if (highCopyFastaFilename != null) {
        killList = uniquify(killList, processFasta(kSize, maxDUSTScore, highCopyFastaFilename, options));
    }
    SVUtils.writeKmersFile(kSize, outputFile, killList);
}
Also used : ReferenceMultiSource(org.broadinstitute.hellbender.engine.datasources.ReferenceMultiSource) PipelineOptions(com.google.cloud.dataflow.sdk.options.PipelineOptions) SAMFileHeader(htsjdk.samtools.SAMFileHeader) SAMSequenceDictionary(htsjdk.samtools.SAMSequenceDictionary)

Example 2 with PipelineOptions

use of com.google.cloud.dataflow.sdk.options.PipelineOptions in project gatk by broadinstitute.

the class FindBadGenomicKmersSpark method processFasta.

@VisibleForTesting
static List<SVKmer> processFasta(final int kSize, final int maxDUSTScore, final String fastaFilename, final PipelineOptions options) {
    try (BufferedReader rdr = new BufferedReader(new InputStreamReader(BucketUtils.openFile(fastaFilename)))) {
        final List<SVKmer> kmers = new ArrayList<>((int) BucketUtils.fileSize(fastaFilename));
        String line;
        final StringBuilder sb = new StringBuilder();
        final SVKmer kmerSeed = new SVKmerLong();
        while ((line = rdr.readLine()) != null) {
            if (line.charAt(0) != '>')
                sb.append(line);
            else if (sb.length() > 0) {
                SVDUSTFilteredKmerizer.stream(sb, kSize, maxDUSTScore, kmerSeed).map(kmer -> kmer.canonical(kSize)).forEach(kmers::add);
                sb.setLength(0);
            }
        }
        if (sb.length() > 0) {
            SVDUSTFilteredKmerizer.stream(sb, kSize, maxDUSTScore, kmerSeed).map(kmer -> kmer.canonical(kSize)).forEach(kmers::add);
        }
        return kmers;
    } catch (IOException ioe) {
        throw new GATKException("Can't read high copy kmers fasta file " + fastaFilename, ioe);
    }
}
Also used : Output(com.esotericsoftware.kryo.io.Output) CommandLineProgramProperties(org.broadinstitute.barclay.argparser.CommandLineProgramProperties) java.util(java.util) ReferenceMultiSource(org.broadinstitute.hellbender.engine.datasources.ReferenceMultiSource) Argument(org.broadinstitute.barclay.argparser.Argument) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) StandardArgumentDefinitions(org.broadinstitute.hellbender.cmdline.StandardArgumentDefinitions) SAMFileHeader(htsjdk.samtools.SAMFileHeader) GATKException(org.broadinstitute.hellbender.exceptions.GATKException) Kryo(com.esotericsoftware.kryo.Kryo) BucketUtils(org.broadinstitute.hellbender.utils.gcs.BucketUtils) HopscotchMap(org.broadinstitute.hellbender.tools.spark.utils.HopscotchMap) Input(com.esotericsoftware.kryo.io.Input) HopscotchSet(org.broadinstitute.hellbender.tools.spark.utils.HopscotchSet) JavaRDD(org.apache.spark.api.java.JavaRDD) DefaultSerializer(com.esotericsoftware.kryo.DefaultSerializer) HashPartitioner(org.apache.spark.HashPartitioner) SAMSequenceDictionary(htsjdk.samtools.SAMSequenceDictionary) GATKSparkTool(org.broadinstitute.hellbender.engine.spark.GATKSparkTool) IOException(java.io.IOException) Tuple2(scala.Tuple2) InputStreamReader(java.io.InputStreamReader) StructuralVariationSparkProgramGroup(org.broadinstitute.hellbender.cmdline.programgroups.StructuralVariationSparkProgramGroup) PipelineOptions(com.google.cloud.dataflow.sdk.options.PipelineOptions) VisibleForTesting(com.google.common.annotations.VisibleForTesting) BufferedReader(java.io.BufferedReader) InputStreamReader(java.io.InputStreamReader) BufferedReader(java.io.BufferedReader) IOException(java.io.IOException) GATKException(org.broadinstitute.hellbender.exceptions.GATKException) VisibleForTesting(com.google.common.annotations.VisibleForTesting)

Example 3 with PipelineOptions

use of com.google.cloud.dataflow.sdk.options.PipelineOptions in project gatk by broadinstitute.

the class StructuralVariationDiscoveryPipelineSpark method runTool.

@Override
protected void runTool(final JavaSparkContext ctx) {
    final SAMFileHeader header = getHeaderForReads();
    final PipelineOptions pipelineOptions = getAuthenticatedGCSOptions();
    // gather evidence, run assembly, and align
    final List<AlignedAssemblyOrExcuse> alignedAssemblyOrExcuseList = FindBreakpointEvidenceSpark.gatherEvidenceAndWriteContigSamFile(ctx, evidenceAndAssemblyArgs, header, getUnfilteredReads(), outputSAM, localLogger);
    if (alignedAssemblyOrExcuseList.isEmpty())
        return;
    // parse the contig alignments and extract necessary information
    @SuppressWarnings("unchecked") final JavaRDD<AlignedContig> parsedAlignments = new InMemoryAlignmentParser(ctx, alignedAssemblyOrExcuseList, header, localLogger).getAlignedContigs();
    if (parsedAlignments.isEmpty())
        return;
    // discover variants and write to vcf
    DiscoverVariantsFromContigAlignmentsSAMSpark.discoverVariantsAndWriteVCF(parsedAlignments, discoverStageArgs.fastaReference, ctx.broadcast(getReference()), pipelineOptions, vcfOutputFileName, localLogger);
}
Also used : PipelineOptions(com.google.cloud.dataflow.sdk.options.PipelineOptions) SAMFileHeader(htsjdk.samtools.SAMFileHeader)

Example 4 with PipelineOptions

use of com.google.cloud.dataflow.sdk.options.PipelineOptions in project gatk by broadinstitute.

the class ReferenceUtilsUnitTest method testLoadFastaDictionaryFromGCSBucket.

@Test(groups = { "bucket" })
public void testLoadFastaDictionaryFromGCSBucket() throws IOException {
    final String bucketDictionary = getGCPTestInputPath() + "org/broadinstitute/hellbender/utils/ReferenceUtilsTest.dict";
    final PipelineOptions popts = getAuthenticatedPipelineOptions();
    try (final InputStream referenceDictionaryStream = BucketUtils.openFile(bucketDictionary)) {
        final SAMSequenceDictionary dictionary = ReferenceUtils.loadFastaDictionary(referenceDictionaryStream);
        Assert.assertNotNull(dictionary, "Sequence dictionary null after loading");
        Assert.assertEquals(dictionary.size(), 4, "Wrong sequence dictionary size after loading");
    }
}
Also used : PipelineOptions(com.google.cloud.dataflow.sdk.options.PipelineOptions) SAMSequenceDictionary(htsjdk.samtools.SAMSequenceDictionary) BaseTest(org.broadinstitute.hellbender.utils.test.BaseTest) Test(org.testng.annotations.Test)

Example 5 with PipelineOptions

use of com.google.cloud.dataflow.sdk.options.PipelineOptions in project gatk by broadinstitute.

the class BaseRecalibratorSparkSharded method hackilyCopyFromGCSIfNecessary.

// please add support for reading variant files from GCS.
private ArrayList<String> hackilyCopyFromGCSIfNecessary(List<String> localVariants) {
    int i = 0;
    Stopwatch hacking = Stopwatch.createStarted();
    boolean copied = false;
    ArrayList<String> ret = new ArrayList<>();
    for (String v : localVariants) {
        if (BucketUtils.isCloudStorageUrl(v)) {
            if (!copied) {
                logger.info("(HACK): copying the GCS variant file to local just so we can read it back.");
                copied = true;
            }
            // this only works with the API_KEY, but then again it's a hack so there's no point in polishing it. Please don't make me.
            PipelineOptions popts = auth.asPipelineOptionsDeprecated();
            String d = IOUtils.createTempFile("knownVariants-" + i, ".vcf").getAbsolutePath();
            try {
                BucketUtils.copyFile(v, d);
            } catch (IOException x) {
                throw new UserException.CouldNotReadInputFile(v, x);
            }
            ret.add(d);
        } else {
            ret.add(v);
        }
    }
    hacking.stop();
    if (copied) {
        logger.info("Copying the vcf took " + hacking.elapsed(TimeUnit.MILLISECONDS) + " ms.");
    }
    return ret;
}
Also used : PipelineOptions(com.google.cloud.dataflow.sdk.options.PipelineOptions) Stopwatch(com.google.common.base.Stopwatch) ArrayList(java.util.ArrayList) IOException(java.io.IOException) UserException(org.broadinstitute.hellbender.exceptions.UserException)

Aggregations

PipelineOptions (com.google.cloud.dataflow.sdk.options.PipelineOptions)12 SAMSequenceDictionary (htsjdk.samtools.SAMSequenceDictionary)5 ReferenceMultiSource (org.broadinstitute.hellbender.engine.datasources.ReferenceMultiSource)5 BaseTest (org.broadinstitute.hellbender.utils.test.BaseTest)5 Test (org.testng.annotations.Test)5 SAMFileHeader (htsjdk.samtools.SAMFileHeader)4 Kryo (com.esotericsoftware.kryo.Kryo)3 IOException (java.io.IOException)3 HopscotchSet (org.broadinstitute.hellbender.tools.spark.utils.HopscotchSet)3 Input (com.esotericsoftware.kryo.io.Input)2 Output (com.esotericsoftware.kryo.io.Output)2 VisibleForTesting (com.google.common.annotations.VisibleForTesting)2 File (java.io.File)2 java.util (java.util)2 UserException (org.broadinstitute.hellbender.exceptions.UserException)2 DefaultSerializer (com.esotericsoftware.kryo.DefaultSerializer)1 JsonFactory (com.google.api.client.json.JsonFactory)1 Genomics (com.google.api.services.genomics.Genomics)1 com.google.api.services.genomics.model (com.google.api.services.genomics.model)1 PipelineOptionsFactory (com.google.cloud.dataflow.sdk.options.PipelineOptionsFactory)1