Search in sources :

Example 1 with SerializableSchema

use of org.apache.hudi.common.config.SerializableSchema in project hudi by apache.

the class SingleSparkJobExecutionStrategy method performClustering.

@Override
public HoodieWriteMetadata<HoodieData<WriteStatus>> performClustering(final HoodieClusteringPlan clusteringPlan, final Schema schema, final String instantTime) {
    JavaSparkContext engineContext = HoodieSparkEngineContext.getSparkContext(getEngineContext());
    final TaskContextSupplier taskContextSupplier = getEngineContext().getTaskContextSupplier();
    final SerializableSchema serializableSchema = new SerializableSchema(schema);
    final List<ClusteringGroupInfo> clusteringGroupInfos = clusteringPlan.getInputGroups().stream().map(clusteringGroup -> ClusteringGroupInfo.create(clusteringGroup)).collect(Collectors.toList());
    String umask = engineContext.hadoopConfiguration().get("fs.permissions.umask-mode");
    Broadcast<String> umaskBroadcastValue = engineContext.broadcast(umask);
    JavaRDD<ClusteringGroupInfo> groupInfoJavaRDD = engineContext.parallelize(clusteringGroupInfos, clusteringGroupInfos.size());
    LOG.info("number of partitions for clustering " + groupInfoJavaRDD.getNumPartitions());
    JavaRDD<WriteStatus> writeStatusRDD = groupInfoJavaRDD.mapPartitions(clusteringOps -> {
        Configuration configuration = new Configuration();
        configuration.set("fs.permissions.umask-mode", umaskBroadcastValue.getValue());
        Iterable<ClusteringGroupInfo> clusteringOpsIterable = () -> clusteringOps;
        List<ClusteringGroupInfo> groupsInPartition = StreamSupport.stream(clusteringOpsIterable.spliterator(), false).collect(Collectors.toList());
        return groupsInPartition.stream().flatMap(clusteringOp -> runClusteringForGroup(clusteringOp, clusteringPlan.getStrategy().getStrategyParams(), Option.ofNullable(clusteringPlan.getPreserveHoodieMetadata()).orElse(false), serializableSchema, taskContextSupplier, instantTime)).iterator();
    });
    HoodieWriteMetadata<HoodieData<WriteStatus>> writeMetadata = new HoodieWriteMetadata<>();
    writeMetadata.setWriteStatuses(HoodieJavaRDD.of(writeStatusRDD));
    return writeMetadata;
}
Also used : HoodieTable(org.apache.hudi.table.HoodieTable) KeyGenUtils(org.apache.hudi.keygen.KeyGenUtils) HoodieAvroUtils(org.apache.hudi.avro.HoodieAvroUtils) RewriteAvroPayload(org.apache.hudi.common.model.RewriteAvroPayload) ConcatenatingIterator(org.apache.hudi.client.utils.ConcatenatingIterator) SerializableSchema(org.apache.hudi.common.config.SerializableSchema) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) Option(org.apache.hudi.common.util.Option) HoodieEngineContext(org.apache.hudi.common.engine.HoodieEngineContext) HoodieJavaRDD(org.apache.hudi.data.HoodieJavaRDD) BaseKeyGenerator(org.apache.hudi.keygen.BaseKeyGenerator) Logger(org.apache.log4j.Logger) HoodieFileReaderFactory(org.apache.hudi.io.storage.HoodieFileReaderFactory) Configuration(org.apache.hadoop.conf.Configuration) Map(java.util.Map) Path(org.apache.hadoop.fs.Path) HoodieSparkEngineContext(org.apache.hudi.client.common.HoodieSparkEngineContext) StreamSupport(java.util.stream.StreamSupport) HoodieWriteMetadata(org.apache.hudi.table.action.HoodieWriteMetadata) HoodieFileGroupId(org.apache.hudi.common.model.HoodieFileGroupId) HoodieSparkKeyGeneratorFactory(org.apache.hudi.keygen.factory.HoodieSparkKeyGeneratorFactory) ClusteringExecutionStrategy(org.apache.hudi.table.action.cluster.strategy.ClusteringExecutionStrategy) IndexedRecord(org.apache.avro.generic.IndexedRecord) JavaRDD(org.apache.spark.api.java.JavaRDD) Broadcast(org.apache.spark.broadcast.Broadcast) HoodieRecord(org.apache.hudi.common.model.HoodieRecord) GenericRecord(org.apache.avro.generic.GenericRecord) HoodieData(org.apache.hudi.common.data.HoodieData) Schema(org.apache.avro.Schema) TypedProperties(org.apache.hudi.common.config.TypedProperties) HoodieWriteConfig(org.apache.hudi.config.HoodieWriteConfig) Iterator(java.util.Iterator) TaskContextSupplier(org.apache.hudi.common.engine.TaskContextSupplier) HoodieClusteringPlan(org.apache.hudi.avro.model.HoodieClusteringPlan) HoodieClusteringException(org.apache.hudi.exception.HoodieClusteringException) ClusteringOperation(org.apache.hudi.common.model.ClusteringOperation) IOException(java.io.IOException) Collectors(java.util.stream.Collectors) HoodieAvroRecord(org.apache.hudi.common.model.HoodieAvroRecord) WriteStatus(org.apache.hudi.client.WriteStatus) ClusteringGroupInfo(org.apache.hudi.common.model.ClusteringGroupInfo) HoodieRecordPayload(org.apache.hudi.common.model.HoodieRecordPayload) List(java.util.List) Stream(java.util.stream.Stream) HoodieKey(org.apache.hudi.common.model.HoodieKey) HoodieIOException(org.apache.hudi.exception.HoodieIOException) LogManager(org.apache.log4j.LogManager) HoodieData(org.apache.hudi.common.data.HoodieData) Configuration(org.apache.hadoop.conf.Configuration) ClusteringGroupInfo(org.apache.hudi.common.model.ClusteringGroupInfo) TaskContextSupplier(org.apache.hudi.common.engine.TaskContextSupplier) SerializableSchema(org.apache.hudi.common.config.SerializableSchema) HoodieWriteMetadata(org.apache.hudi.table.action.HoodieWriteMetadata) JavaSparkContext(org.apache.spark.api.java.JavaSparkContext) WriteStatus(org.apache.hudi.client.WriteStatus)

Example 2 with SerializableSchema

use of org.apache.hudi.common.config.SerializableSchema in project hudi by apache.

the class TestSerializableSchema method verifySchema.

private void verifySchema(Schema schema) throws IOException {
    SerializableSchema serializableSchema = new SerializableSchema(schema);
    assertEquals(schema, serializableSchema.get());
    assertTrue(schema != serializableSchema.get());
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    ObjectOutputStream oos = new ObjectOutputStream(baos);
    serializableSchema.writeObjectTo(oos);
    oos.flush();
    oos.close();
    byte[] bytesWritten = baos.toByteArray();
    SerializableSchema newSchema = new SerializableSchema();
    newSchema.readObjectFrom(new ObjectInputStream(new ByteArrayInputStream(bytesWritten)));
    assertEquals(schema, newSchema.get());
}
Also used : ByteArrayInputStream(java.io.ByteArrayInputStream) ByteArrayOutputStream(java.io.ByteArrayOutputStream) ObjectOutputStream(java.io.ObjectOutputStream) SerializableSchema(org.apache.hudi.common.config.SerializableSchema) ObjectInputStream(java.io.ObjectInputStream)

Example 3 with SerializableSchema

use of org.apache.hudi.common.config.SerializableSchema in project hudi by apache.

the class RDDSpatialCurveSortPartitioner method repartitionRecords.

@Override
public JavaRDD<HoodieRecord<T>> repartitionRecords(JavaRDD<HoodieRecord<T>> records, int outputSparkPartitions) {
    SerializableSchema serializableSchema = new SerializableSchema(schema);
    JavaRDD<GenericRecord> genericRecordsRDD = records.map(f -> (GenericRecord) f.getData().getInsertValue(serializableSchema.get()).get());
    Dataset<Row> sourceDataset = AvroConversionUtils.createDataFrame(genericRecordsRDD.rdd(), schema.toString(), sparkEngineContext.getSqlContext().sparkSession());
    Dataset<Row> sortedDataset = reorder(sourceDataset, outputSparkPartitions);
    return HoodieSparkUtils.createRdd(sortedDataset, schema.getName(), schema.getNamespace(), false, Option.empty()).toJavaRDD().map(record -> {
        String key = record.get(HoodieRecord.RECORD_KEY_METADATA_FIELD).toString();
        String partition = record.get(HoodieRecord.PARTITION_PATH_METADATA_FIELD).toString();
        HoodieKey hoodieKey = new HoodieKey(key, partition);
        HoodieRecord hoodieRecord = new HoodieAvroRecord(hoodieKey, new RewriteAvroPayload(record));
        return hoodieRecord;
    });
}
Also used : HoodieAvroRecord(org.apache.hudi.common.model.HoodieAvroRecord) HoodieRecord(org.apache.hudi.common.model.HoodieRecord) HoodieKey(org.apache.hudi.common.model.HoodieKey) Row(org.apache.spark.sql.Row) RewriteAvroPayload(org.apache.hudi.common.model.RewriteAvroPayload) GenericRecord(org.apache.avro.generic.GenericRecord) SerializableSchema(org.apache.hudi.common.config.SerializableSchema)

Example 4 with SerializableSchema

use of org.apache.hudi.common.config.SerializableSchema in project hudi by apache.

the class RDDCustomColumnsSortPartitioner method repartitionRecords.

@Override
public JavaRDD<HoodieRecord<T>> repartitionRecords(JavaRDD<HoodieRecord<T>> records, int outputSparkPartitions) {
    final String[] sortColumns = this.sortColumnNames;
    final SerializableSchema schema = this.serializableSchema;
    final boolean consistentLogicalTimestampEnabled = this.consistentLogicalTimestampEnabled;
    return records.sortBy(record -> {
        Object recordValue = HoodieAvroUtils.getRecordColumnValues(record, sortColumns, schema, consistentLogicalTimestampEnabled);
        // null values are replaced with empty string for null_first order
        if (recordValue == null) {
            return StringUtils.EMPTY_STRING;
        } else {
            return StringUtils.objToString(record);
        }
    }, true, outputSparkPartitions);
}
Also used : SerializableSchema(org.apache.hudi.common.config.SerializableSchema)

Aggregations

SerializableSchema (org.apache.hudi.common.config.SerializableSchema)4 GenericRecord (org.apache.avro.generic.GenericRecord)2 HoodieAvroRecord (org.apache.hudi.common.model.HoodieAvroRecord)2 HoodieKey (org.apache.hudi.common.model.HoodieKey)2 HoodieRecord (org.apache.hudi.common.model.HoodieRecord)2 RewriteAvroPayload (org.apache.hudi.common.model.RewriteAvroPayload)2 ByteArrayInputStream (java.io.ByteArrayInputStream)1 ByteArrayOutputStream (java.io.ByteArrayOutputStream)1 IOException (java.io.IOException)1 ObjectInputStream (java.io.ObjectInputStream)1 ObjectOutputStream (java.io.ObjectOutputStream)1 Iterator (java.util.Iterator)1 List (java.util.List)1 Map (java.util.Map)1 Collectors (java.util.stream.Collectors)1 Stream (java.util.stream.Stream)1 StreamSupport (java.util.stream.StreamSupport)1 Schema (org.apache.avro.Schema)1 IndexedRecord (org.apache.avro.generic.IndexedRecord)1 Configuration (org.apache.hadoop.conf.Configuration)1