Search in sources :

Example 1 with AvroParquetOutputFormat

use of org.apache.parquet.avro.AvroParquetOutputFormat in project gatk by broadinstitute.

the class ReadsSparkSink method writeReadsADAM.

private static void writeReadsADAM(final JavaSparkContext ctx, final String outputFile, final JavaRDD<SAMRecord> reads, final SAMFileHeader header) throws IOException {
    final SequenceDictionary seqDict = SequenceDictionary.fromSAMSequenceDictionary(header.getSequenceDictionary());
    final RecordGroupDictionary readGroups = RecordGroupDictionary.fromSAMHeader(header);
    final JavaPairRDD<Void, AlignmentRecord> rddAlignmentRecords = reads.map(read -> {
        read.setHeaderStrict(header);
        AlignmentRecord alignmentRecord = GATKReadToBDGAlignmentRecordConverter.convert(read, seqDict, readGroups);
        read.setHeaderStrict(null);
        return alignmentRecord;
    }).mapToPair(alignmentRecord -> new Tuple2<>(null, alignmentRecord));
    // instantiating a Job is necessary here in order to set the Hadoop Configuration...
    final Job job = Job.getInstance(ctx.hadoopConfiguration());
    // ...here, which sets a config property that the AvroParquetOutputFormat needs when writing data. Specifically,
    // we are writing the Avro schema to the Configuration as a JSON string. The AvroParquetOutputFormat class knows
    // how to translate objects in the Avro data model to the Parquet primitives that get written.
    AvroParquetOutputFormat.setSchema(job, AlignmentRecord.getClassSchema());
    deleteHadoopFile(outputFile, ctx.hadoopConfiguration());
    rddAlignmentRecords.saveAsNewAPIHadoopFile(outputFile, Void.class, AlignmentRecord.class, AvroParquetOutputFormat.class, job.getConfiguration());
}
Also used : NullWritable(org.apache.hadoop.io.NullWritable) CramIO(htsjdk.samtools.cram.build.CramIO) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) GATKRead(org.broadinstitute.hellbender.utils.read.GATKRead) ReadsWriteFormat(org.broadinstitute.hellbender.utils.read.ReadsWriteFormat) SAMFileHeader(htsjdk.samtools.SAMFileHeader) RecordGroupDictionary(org.bdgenomics.adam.models.RecordGroupDictionary) BamFileIoUtils(htsjdk.samtools.BamFileIoUtils) org.apache.hadoop.mapreduce(org.apache.hadoop.mapreduce) org.seqdoop.hadoop_bam(org.seqdoop.hadoop_bam) BucketUtils(org.broadinstitute.hellbender.utils.gcs.BucketUtils) Configuration(org.apache.hadoop.conf.Configuration) AvroParquetOutputFormat(org.apache.parquet.avro.AvroParquetOutputFormat) Path(org.apache.hadoop.fs.Path) AlignmentRecord(org.bdgenomics.formats.avro.AlignmentRecord) SequenceDictionary(org.bdgenomics.adam.models.SequenceDictionary) JavaRDD(org.apache.spark.api.java.JavaRDD) Broadcast(org.apache.spark.broadcast.Broadcast) IOUtils(org.broadinstitute.hellbender.utils.io.IOUtils) GATKReadToBDGAlignmentRecordConverter(org.broadinstitute.hellbender.utils.read.GATKReadToBDGAlignmentRecordConverter) IOException(java.io.IOException) Tuple2(scala.Tuple2) FileAlreadyExistsException(org.apache.hadoop.mapred.FileAlreadyExistsException) JavaPairRDD(org.apache.spark.api.java.JavaPairRDD) SAMRecord(htsjdk.samtools.SAMRecord) File(java.io.File) UserException(org.broadinstitute.hellbender.exceptions.UserException) HeaderlessSAMRecordCoordinateComparator(org.broadinstitute.hellbender.utils.read.HeaderlessSAMRecordCoordinateComparator) SAMFileMerger(org.seqdoop.hadoop_bam.util.SAMFileMerger) Comparator(java.util.Comparator) AlignmentRecord(org.bdgenomics.formats.avro.AlignmentRecord) RecordGroupDictionary(org.bdgenomics.adam.models.RecordGroupDictionary) SequenceDictionary(org.bdgenomics.adam.models.SequenceDictionary)

Aggregations

BamFileIoUtils (htsjdk.samtools.BamFileIoUtils)1 SAMFileHeader (htsjdk.samtools.SAMFileHeader)1 SAMRecord (htsjdk.samtools.SAMRecord)1 CramIO (htsjdk.samtools.cram.build.CramIO)1 File (java.io.File)1 IOException (java.io.IOException)1 Comparator (java.util.Comparator)1 Configuration (org.apache.hadoop.conf.Configuration)1 Path (org.apache.hadoop.fs.Path)1 NullWritable (org.apache.hadoop.io.NullWritable)1 FileAlreadyExistsException (org.apache.hadoop.mapred.FileAlreadyExistsException)1 org.apache.hadoop.mapreduce (org.apache.hadoop.mapreduce)1 AvroParquetOutputFormat (org.apache.parquet.avro.AvroParquetOutputFormat)1 JavaPairRDD (org.apache.spark.api.java.JavaPairRDD)1 JavaRDD (org.apache.spark.api.java.JavaRDD)1 JavaSparkContext (org.apache.spark.api.java.JavaSparkContext)1 Broadcast (org.apache.spark.broadcast.Broadcast)1 RecordGroupDictionary (org.bdgenomics.adam.models.RecordGroupDictionary)1 SequenceDictionary (org.bdgenomics.adam.models.SequenceDictionary)1 AlignmentRecord (org.bdgenomics.formats.avro.AlignmentRecord)1