Search in sources :

Example 1 with HoodieJsonPayload

use of org.apache.hudi.common.HoodieJsonPayload in project hudi by apache.

the class HDFSParquetImporter method buildHoodieRecordsForImport.

protected JavaRDD<HoodieRecord<HoodieRecordPayload>> buildHoodieRecordsForImport(JavaSparkContext jsc, String schemaStr) throws IOException {
    Job job = Job.getInstance(jsc.hadoopConfiguration());
    // Allow recursive directories to be found
    job.getConfiguration().set(FileInputFormat.INPUT_DIR_RECURSIVE, "true");
    // To parallelize reading file status.
    job.getConfiguration().set(FileInputFormat.LIST_STATUS_NUM_THREADS, "1024");
    AvroReadSupport.setAvroReadSchema(jsc.hadoopConfiguration(), (new Schema.Parser().parse(schemaStr)));
    ParquetInputFormat.setReadSupportClass(job, (AvroReadSupport.class));
    HoodieEngineContext context = new HoodieSparkEngineContext(jsc);
    context.setJobStatus(this.getClass().getSimpleName(), "Build records for import");
    return jsc.newAPIHadoopFile(cfg.srcPath, ParquetInputFormat.class, Void.class, GenericRecord.class, job.getConfiguration()).coalesce(16 * cfg.parallelism).map(entry -> {
        GenericRecord genericRecord = ((Tuple2<Void, GenericRecord>) entry)._2();
        Object partitionField = genericRecord.get(cfg.partitionKey);
        if (partitionField == null) {
            throw new HoodieIOException("partition key is missing. :" + cfg.partitionKey);
        }
        Object rowField = genericRecord.get(cfg.rowKey);
        if (rowField == null) {
            throw new HoodieIOException("row field is missing. :" + cfg.rowKey);
        }
        String partitionPath = partitionField.toString();
        LOG.debug("Row Key : " + rowField + ", Partition Path is (" + partitionPath + ")");
        if (partitionField instanceof Number) {
            try {
                long ts = (long) (Double.parseDouble(partitionField.toString()) * 1000L);
                partitionPath = PARTITION_FORMATTER.format(Instant.ofEpochMilli(ts));
            } catch (NumberFormatException nfe) {
                LOG.warn("Unable to parse date from partition field. Assuming partition as (" + partitionField + ")");
            }
        }
        return new HoodieAvroRecord<>(new HoodieKey(rowField.toString(), partitionPath), new HoodieJsonPayload(genericRecord.toString()));
    });
}
Also used : HoodieSparkEngineContext(org.apache.hudi.client.common.HoodieSparkEngineContext) Schema(org.apache.avro.Schema) HoodieEngineContext(org.apache.hudi.common.engine.HoodieEngineContext) HoodieIOException(org.apache.hudi.exception.HoodieIOException) HoodieAvroRecord(org.apache.hudi.common.model.HoodieAvroRecord) HoodieJsonPayload(org.apache.hudi.common.HoodieJsonPayload) Tuple2(scala.Tuple2) HoodieKey(org.apache.hudi.common.model.HoodieKey) Job(org.apache.hadoop.mapreduce.Job) GenericRecord(org.apache.avro.generic.GenericRecord) AvroReadSupport(org.apache.parquet.avro.AvroReadSupport)

Aggregations

Schema (org.apache.avro.Schema)1 GenericRecord (org.apache.avro.generic.GenericRecord)1 Job (org.apache.hadoop.mapreduce.Job)1 HoodieSparkEngineContext (org.apache.hudi.client.common.HoodieSparkEngineContext)1 HoodieJsonPayload (org.apache.hudi.common.HoodieJsonPayload)1 HoodieEngineContext (org.apache.hudi.common.engine.HoodieEngineContext)1 HoodieAvroRecord (org.apache.hudi.common.model.HoodieAvroRecord)1 HoodieKey (org.apache.hudi.common.model.HoodieKey)1 HoodieIOException (org.apache.hudi.exception.HoodieIOException)1 AvroReadSupport (org.apache.parquet.avro.AvroReadSupport)1 Tuple2 (scala.Tuple2)1