Search in sources :

Example 1 with AlignmentRecord

use of org.bdgenomics.formats.avro.AlignmentRecord in project gatk by broadinstitute.

the class GATKReadAdaptersUnitTest method basicReadBackedByADAMRecord.

private static GATKRead basicReadBackedByADAMRecord(final SAMRecord sam) {
    final AlignmentRecord record = new AlignmentRecord();
    record.setContigName(sam.getContig());
    record.setRecordGroupSample(sam.getReadGroup().getSample());
    record.setReadName(sam.getReadName());
    record.setSequence(new String(sam.getReadBases()));
    //ADAM records are 0-based
    record.setStart((long) sam.getAlignmentStart() - 1);
    //ADAM records are 0-based
    record.setEnd((long) sam.getAlignmentEnd() - 1);
    record.setReadMapped(!sam.getReadUnmappedFlag());
    record.setCigar(sam.getCigarString());
    return new BDGAlignmentRecordToGATKReadAdapter(record, getSAMHeader());
}
Also used : AlignmentRecord(org.bdgenomics.formats.avro.AlignmentRecord)

Example 2 with AlignmentRecord

use of org.bdgenomics.formats.avro.AlignmentRecord in project gatk by broadinstitute.

the class ReadsSparkSink method writeReadsADAM.

private static void writeReadsADAM(final JavaSparkContext ctx, final String outputFile, final JavaRDD<SAMRecord> reads, final SAMFileHeader header) throws IOException {
    final SequenceDictionary seqDict = SequenceDictionary.fromSAMSequenceDictionary(header.getSequenceDictionary());
    final RecordGroupDictionary readGroups = RecordGroupDictionary.fromSAMHeader(header);
    final JavaPairRDD<Void, AlignmentRecord> rddAlignmentRecords = reads.map(read -> {
        read.setHeaderStrict(header);
        AlignmentRecord alignmentRecord = GATKReadToBDGAlignmentRecordConverter.convert(read, seqDict, readGroups);
        read.setHeaderStrict(null);
        return alignmentRecord;
    }).mapToPair(alignmentRecord -> new Tuple2<>(null, alignmentRecord));
    // instantiating a Job is necessary here in order to set the Hadoop Configuration...
    final Job job = Job.getInstance(ctx.hadoopConfiguration());
    // ...here, which sets a config property that the AvroParquetOutputFormat needs when writing data. Specifically,
    // we are writing the Avro schema to the Configuration as a JSON string. The AvroParquetOutputFormat class knows
    // how to translate objects in the Avro data model to the Parquet primitives that get written.
    AvroParquetOutputFormat.setSchema(job, AlignmentRecord.getClassSchema());
    deleteHadoopFile(outputFile, ctx.hadoopConfiguration());
    rddAlignmentRecords.saveAsNewAPIHadoopFile(outputFile, Void.class, AlignmentRecord.class, AvroParquetOutputFormat.class, job.getConfiguration());
}
Also used : NullWritable(org.apache.hadoop.io.NullWritable) CramIO(htsjdk.samtools.cram.build.CramIO) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) GATKRead(org.broadinstitute.hellbender.utils.read.GATKRead) ReadsWriteFormat(org.broadinstitute.hellbender.utils.read.ReadsWriteFormat) SAMFileHeader(htsjdk.samtools.SAMFileHeader) RecordGroupDictionary(org.bdgenomics.adam.models.RecordGroupDictionary) BamFileIoUtils(htsjdk.samtools.BamFileIoUtils) org.apache.hadoop.mapreduce(org.apache.hadoop.mapreduce) org.seqdoop.hadoop_bam(org.seqdoop.hadoop_bam) BucketUtils(org.broadinstitute.hellbender.utils.gcs.BucketUtils) Configuration(org.apache.hadoop.conf.Configuration) AvroParquetOutputFormat(org.apache.parquet.avro.AvroParquetOutputFormat) Path(org.apache.hadoop.fs.Path) AlignmentRecord(org.bdgenomics.formats.avro.AlignmentRecord) SequenceDictionary(org.bdgenomics.adam.models.SequenceDictionary) JavaRDD(org.apache.spark.api.java.JavaRDD) Broadcast(org.apache.spark.broadcast.Broadcast) IOUtils(org.broadinstitute.hellbender.utils.io.IOUtils) GATKReadToBDGAlignmentRecordConverter(org.broadinstitute.hellbender.utils.read.GATKReadToBDGAlignmentRecordConverter) IOException(java.io.IOException) Tuple2(scala.Tuple2) FileAlreadyExistsException(org.apache.hadoop.mapred.FileAlreadyExistsException) JavaPairRDD(org.apache.spark.api.java.JavaPairRDD) SAMRecord(htsjdk.samtools.SAMRecord) File(java.io.File) UserException(org.broadinstitute.hellbender.exceptions.UserException) HeaderlessSAMRecordCoordinateComparator(org.broadinstitute.hellbender.utils.read.HeaderlessSAMRecordCoordinateComparator) SAMFileMerger(org.seqdoop.hadoop_bam.util.SAMFileMerger) Comparator(java.util.Comparator) AlignmentRecord(org.bdgenomics.formats.avro.AlignmentRecord) RecordGroupDictionary(org.bdgenomics.adam.models.RecordGroupDictionary) SequenceDictionary(org.bdgenomics.adam.models.SequenceDictionary)

Example 3 with AlignmentRecord

use of org.bdgenomics.formats.avro.AlignmentRecord in project gatk by broadinstitute.

the class ReadsSparkSource method getADAMReads.

/**
     * Loads ADAM reads stored as Parquet.
     * @param inputPath path to the Parquet data
     * @return RDD of (ADAM-backed) GATKReads from the file.
     */
public JavaRDD<GATKRead> getADAMReads(final String inputPath, final List<SimpleInterval> intervals, final SAMFileHeader header) throws IOException {
    Job job = Job.getInstance(ctx.hadoopConfiguration());
    AvroParquetInputFormat.setAvroReadSchema(job, AlignmentRecord.getClassSchema());
    Broadcast<SAMFileHeader> bHeader;
    if (header == null) {
        bHeader = ctx.broadcast(null);
    } else {
        bHeader = ctx.broadcast(header);
    }
    @SuppressWarnings("unchecked") JavaRDD<AlignmentRecord> recordsRdd = ctx.newAPIHadoopFile(inputPath, AvroParquetInputFormat.class, Void.class, AlignmentRecord.class, job.getConfiguration()).values();
    JavaRDD<GATKRead> readsRdd = recordsRdd.map(record -> new BDGAlignmentRecordToGATKReadAdapter(record, bHeader.getValue()));
    JavaRDD<GATKRead> filteredRdd = readsRdd.filter(record -> samRecordOverlaps(record.convertToSAMRecord(header), intervals));
    return putPairsInSamePartition(header, filteredRdd);
}
Also used : GATKRead(org.broadinstitute.hellbender.utils.read.GATKRead) AvroParquetInputFormat(org.apache.parquet.avro.AvroParquetInputFormat) AlignmentRecord(org.bdgenomics.formats.avro.AlignmentRecord) BDGAlignmentRecordToGATKReadAdapter(org.broadinstitute.hellbender.utils.read.BDGAlignmentRecordToGATKReadAdapter) Job(org.apache.hadoop.mapreduce.Job)

Aggregations

AlignmentRecord (org.bdgenomics.formats.avro.AlignmentRecord)3 GATKRead (org.broadinstitute.hellbender.utils.read.GATKRead)2 BamFileIoUtils (htsjdk.samtools.BamFileIoUtils)1 SAMFileHeader (htsjdk.samtools.SAMFileHeader)1 SAMRecord (htsjdk.samtools.SAMRecord)1 CramIO (htsjdk.samtools.cram.build.CramIO)1 File (java.io.File)1 IOException (java.io.IOException)1 Comparator (java.util.Comparator)1 Configuration (org.apache.hadoop.conf.Configuration)1 Path (org.apache.hadoop.fs.Path)1 NullWritable (org.apache.hadoop.io.NullWritable)1 FileAlreadyExistsException (org.apache.hadoop.mapred.FileAlreadyExistsException)1 org.apache.hadoop.mapreduce (org.apache.hadoop.mapreduce)1 Job (org.apache.hadoop.mapreduce.Job)1 AvroParquetInputFormat (org.apache.parquet.avro.AvroParquetInputFormat)1 AvroParquetOutputFormat (org.apache.parquet.avro.AvroParquetOutputFormat)1 JavaPairRDD (org.apache.spark.api.java.JavaPairRDD)1 JavaRDD (org.apache.spark.api.java.JavaRDD)1 JavaSparkContext (org.apache.spark.api.java.JavaSparkContext)1