Search in sources :

Example 11 with HoodieDeltaStreamer

use of org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer in project hudi by apache.

the class TestHoodieDeltaStreamer method testDeltaStreamerWithSpecifiedOperation.

void testDeltaStreamerWithSpecifiedOperation(final String tableBasePath, WriteOperationType operationType) throws Exception {
    // Initial insert
    HoodieDeltaStreamer.Config cfg = TestHelpers.makeConfig(tableBasePath, WriteOperationType.BULK_INSERT);
    new HoodieDeltaStreamer(cfg, jsc).sync();
    TestHelpers.assertRecordCount(1000, tableBasePath + "/*/*.parquet", sqlContext);
    TestHelpers.assertDistanceCount(1000, tableBasePath + "/*/*.parquet", sqlContext);
    TestHelpers.assertCommitMetadata("00000", tableBasePath, dfs, 1);
    // setting the operationType
    cfg.operation = operationType;
    // No new data => no commits.
    cfg.sourceLimit = 0;
    new HoodieDeltaStreamer(cfg, jsc).sync();
    TestHelpers.assertRecordCount(1000, tableBasePath + "/*/*.parquet", sqlContext);
    TestHelpers.assertDistanceCount(1000, tableBasePath + "/*/*.parquet", sqlContext);
    TestHelpers.assertCommitMetadata("00000", tableBasePath, dfs, 1);
    cfg.sourceLimit = 1000;
    new HoodieDeltaStreamer(cfg, jsc).sync();
    TestHelpers.assertRecordCount(1950, tableBasePath + "/*/*.parquet", sqlContext);
    TestHelpers.assertDistanceCount(1950, tableBasePath + "/*/*.parquet", sqlContext);
    TestHelpers.assertCommitMetadata("00001", tableBasePath, dfs, 2);
}
Also used : HoodieDeltaStreamer(org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer)

Example 12 with HoodieDeltaStreamer

use of org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer in project hudi by apache.

the class TestHoodieDeltaStreamer method testKafkaConnectCheckpointProvider.

@Test
public void testKafkaConnectCheckpointProvider() throws IOException {
    String tableBasePath = dfsBasePath + "/test_table";
    String bootstrapPath = dfsBasePath + "/kafka_topic1";
    String partitionPath = bootstrapPath + "/year=2016/month=05/day=01";
    String filePath = partitionPath + "/kafka_topic1+0+100+200.parquet";
    String checkpointProviderClass = "org.apache.hudi.utilities.checkpointing.KafkaConnectHdfsProvider";
    HoodieDeltaStreamer.Config cfg = TestHelpers.makeDropAllConfig(tableBasePath, WriteOperationType.UPSERT);
    TypedProperties props = new DFSPropertiesConfiguration(dfs.getConf(), new Path(dfsBasePath + "/" + PROPS_FILENAME_TEST_SOURCE)).getProps();
    props.put("hoodie.deltastreamer.checkpoint.provider.path", bootstrapPath);
    cfg.initialCheckpointProvider = checkpointProviderClass;
    // create regular kafka connect hdfs dirs
    dfs.mkdirs(new Path(bootstrapPath));
    dfs.mkdirs(new Path(partitionPath));
    // generate parquet files using kafka connect naming convention
    HoodieTestDataGenerator dataGenerator = new HoodieTestDataGenerator();
    Helpers.saveParquetToDFS(Helpers.toGenericRecords(dataGenerator.generateInserts("000", 100)), new Path(filePath));
    HoodieDeltaStreamer deltaStreamer = new HoodieDeltaStreamer(cfg, jsc, dfs, hdfsTestService.getHadoopConf(), Option.ofNullable(props));
    assertEquals("kafka_topic1,0:200", deltaStreamer.getConfig().checkpoint);
}
Also used : Path(org.apache.hadoop.fs.Path) HoodieDeltaStreamer(org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer) TypedProperties(org.apache.hudi.common.config.TypedProperties) DFSPropertiesConfiguration(org.apache.hudi.common.config.DFSPropertiesConfiguration) HoodieTestDataGenerator(org.apache.hudi.common.testutils.HoodieTestDataGenerator) ParameterizedTest(org.junit.jupiter.params.ParameterizedTest) Test(org.junit.jupiter.api.Test)

Example 13 with HoodieDeltaStreamer

use of org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer in project hudi by apache.

the class TestHoodieDeltaStreamer method testJdbcSourceIncrementalFetchInContinuousMode.

@Test
public void testJdbcSourceIncrementalFetchInContinuousMode() {
    try (Connection connection = DriverManager.getConnection("jdbc:h2:mem:test_mem", "test", "jdbc")) {
        TypedProperties props = new TypedProperties();
        props.setProperty("hoodie.deltastreamer.jdbc.url", "jdbc:h2:mem:test_mem");
        props.setProperty("hoodie.deltastreamer.jdbc.driver.class", "org.h2.Driver");
        props.setProperty("hoodie.deltastreamer.jdbc.user", "test");
        props.setProperty("hoodie.deltastreamer.jdbc.password", "jdbc");
        props.setProperty("hoodie.deltastreamer.jdbc.table.name", "triprec");
        props.setProperty("hoodie.deltastreamer.jdbc.incr.pull", "true");
        props.setProperty("hoodie.deltastreamer.jdbc.table.incr.column.name", "id");
        props.setProperty("hoodie.datasource.write.keygenerator.class", SimpleKeyGenerator.class.getName());
        props.setProperty("hoodie.datasource.write.recordkey.field", "ID");
        props.setProperty("hoodie.datasource.write.partitionpath.field", "not_there");
        UtilitiesTestBase.Helpers.savePropsToDFS(props, dfs, dfsBasePath + "/test-jdbc-source.properties");
        int numRecords = 1000;
        int sourceLimit = 100;
        String tableBasePath = dfsBasePath + "/triprec";
        HoodieDeltaStreamer.Config cfg = TestHelpers.makeConfig(tableBasePath, WriteOperationType.INSERT, JdbcSource.class.getName(), null, "test-jdbc-source.properties", false, false, sourceLimit, false, null, null, "timestamp", null);
        cfg.continuousMode = true;
        // Add 1000 records
        JdbcTestUtils.clearAndInsert("000", numRecords, connection, new HoodieTestDataGenerator(), props);
        HoodieDeltaStreamer deltaStreamer = new HoodieDeltaStreamer(cfg, jsc);
        deltaStreamerTestRunner(deltaStreamer, cfg, (r) -> {
            TestHelpers.assertAtleastNCompactionCommits(numRecords / sourceLimit + ((numRecords % sourceLimit == 0) ? 0 : 1), tableBasePath, dfs);
            TestHelpers.assertRecordCount(numRecords, tableBasePath + "/*/*.parquet", sqlContext);
            return true;
        });
    } catch (Exception e) {
        fail(e.getMessage());
    }
}
Also used : HoodieDeltaStreamer(org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer) JdbcSource(org.apache.hudi.utilities.sources.JdbcSource) Connection(java.sql.Connection) SimpleKeyGenerator(org.apache.hudi.keygen.SimpleKeyGenerator) TypedProperties(org.apache.hudi.common.config.TypedProperties) HoodieTestDataGenerator(org.apache.hudi.common.testutils.HoodieTestDataGenerator) HoodieException(org.apache.hudi.exception.HoodieException) IOException(java.io.IOException) TableNotFoundException(org.apache.hudi.exception.TableNotFoundException) TopicExistsException(org.apache.kafka.common.errors.TopicExistsException) AnalysisException(org.apache.spark.sql.AnalysisException) ParameterizedTest(org.junit.jupiter.params.ParameterizedTest) Test(org.junit.jupiter.api.Test)

Example 14 with HoodieDeltaStreamer

use of org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer in project hudi by apache.

the class TestHoodieDeltaStreamer method testSchemaEvolution.

@ParameterizedTest
@MethodSource("schemaEvolArgs")
public void testSchemaEvolution(String tableType, boolean useUserProvidedSchema, boolean useSchemaPostProcessor) throws Exception {
    String tableBasePath = dfsBasePath + "/test_table_schema_evolution" + tableType + "_" + useUserProvidedSchema + "_" + useSchemaPostProcessor;
    defaultSchemaProviderClassName = FilebasedSchemaProvider.class.getName();
    // Insert data produced with Schema A, pass Schema A
    HoodieDeltaStreamer.Config cfg = TestHelpers.makeConfig(tableBasePath, WriteOperationType.INSERT, Collections.singletonList(TestIdentityTransformer.class.getName()), PROPS_FILENAME_TEST_SOURCE, false, true, false, null, tableType);
    cfg.configs.add("hoodie.deltastreamer.schemaprovider.source.schema.file=" + dfsBasePath + "/source.avsc");
    cfg.configs.add("hoodie.deltastreamer.schemaprovider.target.schema.file=" + dfsBasePath + "/source.avsc");
    cfg.configs.add(DataSourceWriteOptions.RECONCILE_SCHEMA().key() + "=true");
    if (!useSchemaPostProcessor) {
        cfg.configs.add(SparkAvroPostProcessor.Config.SPARK_AVRO_POST_PROCESSOR_PROP_ENABLE + "=false");
    }
    new HoodieDeltaStreamer(cfg, jsc).sync();
    TestHelpers.assertRecordCount(1000, tableBasePath + "/*/*", sqlContext);
    TestHelpers.assertCommitMetadata("00000", tableBasePath, dfs, 1);
    // Upsert data produced with Schema B, pass Schema B
    cfg = TestHelpers.makeConfig(tableBasePath, WriteOperationType.UPSERT, Collections.singletonList(TripsWithEvolvedOptionalFieldTransformer.class.getName()), PROPS_FILENAME_TEST_SOURCE, false, true, false, null, tableType);
    cfg.configs.add("hoodie.deltastreamer.schemaprovider.source.schema.file=" + dfsBasePath + "/source.avsc");
    cfg.configs.add("hoodie.deltastreamer.schemaprovider.target.schema.file=" + dfsBasePath + "/source_evolved.avsc");
    cfg.configs.add(DataSourceWriteOptions.RECONCILE_SCHEMA().key() + "=true");
    if (!useSchemaPostProcessor) {
        cfg.configs.add(SparkAvroPostProcessor.Config.SPARK_AVRO_POST_PROCESSOR_PROP_ENABLE + "=false");
    }
    new HoodieDeltaStreamer(cfg, jsc).sync();
    // out of 1000 new records, 500 are inserts, 450 are updates and 50 are deletes.
    TestHelpers.assertRecordCount(1450, tableBasePath + "/*/*", sqlContext);
    TestHelpers.assertCommitMetadata("00001", tableBasePath, dfs, 2);
    List<Row> counts = TestHelpers.countsPerCommit(tableBasePath + "/*/*", sqlContext);
    assertEquals(1450, counts.stream().mapToLong(entry -> entry.getLong(1)).sum());
    sqlContext.read().format("org.apache.hudi").load(tableBasePath + "/*/*").createOrReplaceTempView("tmp_trips");
    long recordCount = sqlContext.sparkSession().sql("select * from tmp_trips where evoluted_optional_union_field is not NULL").count();
    assertEquals(950, recordCount);
    // Upsert data produced with Schema A, pass Schema B
    if (!useUserProvidedSchema) {
        defaultSchemaProviderClassName = TestFileBasedSchemaProviderNullTargetSchema.class.getName();
    }
    cfg = TestHelpers.makeConfig(tableBasePath, WriteOperationType.UPSERT, Collections.singletonList(TestIdentityTransformer.class.getName()), PROPS_FILENAME_TEST_SOURCE, false, true, false, null, tableType);
    cfg.configs.add("hoodie.deltastreamer.schemaprovider.source.schema.file=" + dfsBasePath + "/source.avsc");
    if (useUserProvidedSchema) {
        cfg.configs.add("hoodie.deltastreamer.schemaprovider.target.schema.file=" + dfsBasePath + "/source_evolved.avsc");
    }
    if (!useSchemaPostProcessor) {
        cfg.configs.add(SparkAvroPostProcessor.Config.SPARK_AVRO_POST_PROCESSOR_PROP_ENABLE + "=false");
    }
    cfg.configs.add(DataSourceWriteOptions.RECONCILE_SCHEMA().key() + "=true");
    new HoodieDeltaStreamer(cfg, jsc).sync();
    // again, 1000 new records, 500 are inserts, 450 are updates and 50 are deletes.
    TestHelpers.assertRecordCount(1900, tableBasePath + "/*/*", sqlContext);
    TestHelpers.assertCommitMetadata("00002", tableBasePath, dfs, 3);
    counts = TestHelpers.countsPerCommit(tableBasePath + "/*/*", sqlContext);
    assertEquals(1900, counts.stream().mapToLong(entry -> entry.getLong(1)).sum());
    TableSchemaResolver tableSchemaResolver = new TableSchemaResolver(HoodieTableMetaClient.builder().setBasePath(tableBasePath).setConf(dfs.getConf()).build());
    Schema tableSchema = tableSchemaResolver.getTableAvroSchemaWithoutMetadataFields();
    assertNotNull(tableSchema);
    Schema expectedSchema = new Schema.Parser().parse(dfs.open(new Path(dfsBasePath + "/source_evolved.avsc")));
    if (!useUserProvidedSchema || useSchemaPostProcessor) {
        expectedSchema = AvroConversionUtils.convertStructTypeToAvroSchema(AvroConversionUtils.convertAvroSchemaToStructType(expectedSchema), HOODIE_RECORD_STRUCT_NAME, HOODIE_RECORD_NAMESPACE);
    }
    assertEquals(tableSchema, expectedSchema);
    // clean up and reinit
    UtilitiesTestBase.Helpers.deleteFileFromDfs(FSUtils.getFs(cfg.targetBasePath, jsc.hadoopConfiguration()), dfsBasePath + "/" + PROPS_FILENAME_TEST_SOURCE);
    writeCommonPropsToFile(dfs, dfsBasePath);
    defaultSchemaProviderClassName = FilebasedSchemaProvider.class.getName();
}
Also used : Path(org.apache.hadoop.fs.Path) Schema(org.apache.avro.Schema) TableSchemaResolver(org.apache.hudi.common.table.TableSchemaResolver) FilebasedSchemaProvider(org.apache.hudi.utilities.schema.FilebasedSchemaProvider) HoodieDeltaStreamer(org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer) Row(org.apache.spark.sql.Row) ParameterizedTest(org.junit.jupiter.params.ParameterizedTest) MethodSource(org.junit.jupiter.params.provider.MethodSource)

Example 15 with HoodieDeltaStreamer

use of org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer in project hudi by apache.

the class TestHoodieDeltaStreamer method testHoodieAsyncClusteringJobWithScheduleAndExecute.

@ParameterizedTest
@ValueSource(strings = { "execute", "schedule", "scheduleAndExecute" })
public void testHoodieAsyncClusteringJobWithScheduleAndExecute(String runningMode) throws Exception {
    String tableBasePath = dfsBasePath + "/asyncClustering2";
    HoodieDeltaStreamer ds = initialHoodieDeltaStreamer(tableBasePath, 3000, "false");
    HoodieClusteringJob scheduleClusteringJob = initialHoodieClusteringJob(tableBasePath, null, true, runningMode);
    deltaStreamerTestRunner(ds, (r) -> {
        Exception exception = null;
        TestHelpers.assertAtLeastNCommits(2, tableBasePath, dfs);
        try {
            int result = scheduleClusteringJob.cluster(0);
            if (result == 0) {
                LOG.info("Cluster success");
            } else {
                LOG.warn("Import failed");
                if (!runningMode.toLowerCase().equals(HoodieClusteringJob.EXECUTE)) {
                    return false;
                }
            }
        } catch (Exception e) {
            LOG.warn("ScheduleAndExecute clustering failed", e);
            exception = e;
            if (!runningMode.equalsIgnoreCase(HoodieClusteringJob.EXECUTE)) {
                return false;
            }
        }
        switch(runningMode.toLowerCase()) {
            case HoodieClusteringJob.SCHEDULE_AND_EXECUTE:
                {
                    TestHelpers.assertAtLeastNReplaceCommits(2, tableBasePath, dfs);
                    return true;
                }
            case HoodieClusteringJob.SCHEDULE:
                {
                    TestHelpers.assertAtLeastNReplaceRequests(2, tableBasePath, dfs);
                    TestHelpers.assertNoReplaceCommits(tableBasePath, dfs);
                    return true;
                }
            case HoodieClusteringJob.EXECUTE:
                {
                    TestHelpers.assertNoReplaceCommits(tableBasePath, dfs);
                    return true;
                }
            default:
                throw new IllegalStateException("Unexpected value: " + runningMode);
        }
    });
}
Also used : HoodieDeltaStreamer(org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer) HoodieClusteringJob(org.apache.hudi.utilities.HoodieClusteringJob) HoodieException(org.apache.hudi.exception.HoodieException) IOException(java.io.IOException) TableNotFoundException(org.apache.hudi.exception.TableNotFoundException) TopicExistsException(org.apache.kafka.common.errors.TopicExistsException) AnalysisException(org.apache.spark.sql.AnalysisException) ValueSource(org.junit.jupiter.params.provider.ValueSource) ParameterizedTest(org.junit.jupiter.params.ParameterizedTest)

Aggregations

HoodieDeltaStreamer (org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer)37 ParameterizedTest (org.junit.jupiter.params.ParameterizedTest)29 Test (org.junit.jupiter.api.Test)20 TypedProperties (org.apache.hudi.common.config.TypedProperties)10 HoodieTableMetaClient (org.apache.hudi.common.table.HoodieTableMetaClient)9 IOException (java.io.IOException)8 HoodieException (org.apache.hudi.exception.HoodieException)7 TableNotFoundException (org.apache.hudi.exception.TableNotFoundException)7 TopicExistsException (org.apache.kafka.common.errors.TopicExistsException)7 AnalysisException (org.apache.spark.sql.AnalysisException)7 Path (org.apache.hadoop.fs.Path)6 HoodieCommitMetadata (org.apache.hudi.common.model.HoodieCommitMetadata)6 FileSystem (org.apache.hadoop.fs.FileSystem)5 HoodieTimeline (org.apache.hudi.common.table.timeline.HoodieTimeline)5 HoodieClusteringJob (org.apache.hudi.utilities.HoodieClusteringJob)5 Row (org.apache.spark.sql.Row)5 ValueSource (org.junit.jupiter.params.provider.ValueSource)5 HoodieInstant (org.apache.hudi.common.table.timeline.HoodieInstant)4 Properties (java.util.Properties)3 FSDataInputStream (org.apache.hadoop.fs.FSDataInputStream)3