Search in sources :

Example 11 with WorkloadProfile

use of org.apache.hudi.table.WorkloadProfile in project hudi by apache.

the class TestUpsertPartitioner method testUpsertPartitionerWithSmallFileHandlingPickingMultipleCandidates.

@Test
public void testUpsertPartitionerWithSmallFileHandlingPickingMultipleCandidates() throws Exception {
    final String partitionPath = DEFAULT_PARTITION_PATHS[0];
    HoodieWriteConfig config = makeHoodieClientConfigBuilder().withMergeSmallFileGroupCandidatesLimit(3).withStorageConfig(HoodieStorageConfig.newBuilder().parquetMaxFileSize(2048).build()).build();
    // Bootstrap base files ("small-file targets")
    FileCreateUtils.createBaseFile(basePath, partitionPath, "002", "fg-1", 1024);
    FileCreateUtils.createBaseFile(basePath, partitionPath, "002", "fg-2", 1024);
    FileCreateUtils.createBaseFile(basePath, partitionPath, "002", "fg-3", 1024);
    FileCreateUtils.createCommit(basePath, "002");
    HoodieTestDataGenerator dataGenerator = new HoodieTestDataGenerator(new String[] { partitionPath });
    // Default estimated record size will be 1024 based on last file group created.
    // Only 1 record can be added to small file
    WorkloadProfile profile = new WorkloadProfile(buildProfile(jsc.parallelize(dataGenerator.generateInserts("003", 3))));
    HoodieTableMetaClient reloadedMetaClient = HoodieTableMetaClient.reload(this.metaClient);
    HoodieSparkTable<?> table = HoodieSparkTable.create(config, context, reloadedMetaClient);
    SparkUpsertDeltaCommitPartitioner<?> partitioner = new SparkUpsertDeltaCommitPartitioner<>(profile, context, table, config);
    assertEquals(3, partitioner.numPartitions());
    assertEquals(Arrays.asList(new BucketInfo(BucketType.UPDATE, "fg-1", partitionPath), new BucketInfo(BucketType.UPDATE, "fg-2", partitionPath), new BucketInfo(BucketType.UPDATE, "fg-3", partitionPath)), partitioner.getBucketInfos());
}
Also used : WorkloadProfile(org.apache.hudi.table.WorkloadProfile) HoodieTableMetaClient(org.apache.hudi.common.table.HoodieTableMetaClient) HoodieWriteConfig(org.apache.hudi.config.HoodieWriteConfig) SparkUpsertDeltaCommitPartitioner(org.apache.hudi.table.action.deltacommit.SparkUpsertDeltaCommitPartitioner) HoodieTestDataGenerator(org.apache.hudi.common.testutils.HoodieTestDataGenerator) Test(org.junit.jupiter.api.Test)

Example 12 with WorkloadProfile

use of org.apache.hudi.table.WorkloadProfile in project hudi by apache.

the class BaseSparkCommitActionExecutor method execute.

@Override
public HoodieWriteMetadata<HoodieData<WriteStatus>> execute(HoodieData<HoodieRecord<T>> inputRecords) {
    // Cache the tagged records, so we don't end up computing both
    // TODO: Consistent contract in HoodieWriteClient regarding preppedRecord storage level handling
    JavaRDD<HoodieRecord<T>> inputRDD = HoodieJavaRDD.getJavaRDD(inputRecords);
    if (inputRDD.getStorageLevel() == StorageLevel.NONE()) {
        inputRDD.persist(StorageLevel.MEMORY_AND_DISK_SER());
    } else {
        LOG.info("RDD PreppedRecords was persisted at: " + inputRDD.getStorageLevel());
    }
    WorkloadProfile workloadProfile = null;
    if (isWorkloadProfileNeeded()) {
        context.setJobStatus(this.getClass().getSimpleName(), "Building workload profile");
        workloadProfile = new WorkloadProfile(buildProfile(inputRecords), operationType, table.getIndex().canIndexLogFiles());
        LOG.info("Input workload profile :" + workloadProfile);
    }
    // partition using the insert partitioner
    final Partitioner partitioner = getPartitioner(workloadProfile);
    if (isWorkloadProfileNeeded()) {
        saveWorkloadProfileMetadataToInflight(workloadProfile, instantTime);
    }
    // handle records update with clustering
    HoodieData<HoodieRecord<T>> inputRecordsWithClusteringUpdate = clusteringHandleUpdate(inputRecords);
    context.setJobStatus(this.getClass().getSimpleName(), "Doing partition and writing data");
    HoodieData<WriteStatus> writeStatuses = mapPartitionsAsRDD(inputRecordsWithClusteringUpdate, partitioner);
    HoodieWriteMetadata<HoodieData<WriteStatus>> result = new HoodieWriteMetadata<>();
    updateIndexAndCommitIfNeeded(writeStatuses, result);
    return result;
}
Also used : WorkloadProfile(org.apache.hudi.table.WorkloadProfile) HoodieData(org.apache.hudi.common.data.HoodieData) HoodieRecord(org.apache.hudi.common.model.HoodieRecord) HoodieWriteMetadata(org.apache.hudi.table.action.HoodieWriteMetadata) Partitioner(org.apache.spark.Partitioner) WriteStatus(org.apache.hudi.client.WriteStatus)

Aggregations

WorkloadProfile (org.apache.hudi.table.WorkloadProfile)12 HoodieRecord (org.apache.hudi.common.model.HoodieRecord)9 HoodieWriteConfig (org.apache.hudi.config.HoodieWriteConfig)8 HoodieTestDataGenerator (org.apache.hudi.common.testutils.HoodieTestDataGenerator)6 HoodieWriteMetadata (org.apache.hudi.table.action.HoodieWriteMetadata)5 Test (org.junit.jupiter.api.Test)5 List (java.util.List)4 WriteStatus (org.apache.hudi.client.WriteStatus)4 HoodieUpsertException (org.apache.hudi.exception.HoodieUpsertException)4 WorkloadStat (org.apache.hudi.table.WorkloadStat)4 Duration (java.time.Duration)3 Instant (java.time.Instant)3 HashMap (java.util.HashMap)3 LinkedList (java.util.LinkedList)3 HoodieData (org.apache.hudi.common.data.HoodieData)3 HoodieList (org.apache.hudi.common.data.HoodieList)3 EmptyHoodieRecordPayload (org.apache.hudi.common.model.EmptyHoodieRecordPayload)3 HoodieAvroRecord (org.apache.hudi.common.model.HoodieAvroRecord)3 HoodieKey (org.apache.hudi.common.model.HoodieKey)3 HoodieSparkTable (org.apache.hudi.table.HoodieSparkTable)3