Examples with SourceFormatAdapter - org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter

Example 6 with SourceFormatAdapter

use of org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter in project hudi by apache.

the class TestS3EventsSource method testReadingFromSource.

/**
 * Runs the test scenario of reading data from the source.
 *
 * @throws IOException
 */
@Test
public void testReadingFromSource() throws IOException {
    SourceFormatAdapter sourceFormatAdapter = new SourceFormatAdapter(prepareCloudObjectSource());
    // 1. Extract without any checkpoint => (no data available)
    generateMessageInQueue(null);
    assertEquals(Option.empty(), sourceFormatAdapter.fetchNewDataInAvroFormat(Option.empty(), Long.MAX_VALUE).getBatch());
    // 2. Extract without any checkpoint =>  (adding new file)
    generateMessageInQueue("1");
    // Test fetching Avro format
    InputBatch<JavaRDD<GenericRecord>> fetch1 = sourceFormatAdapter.fetchNewDataInAvroFormat(Option.empty(), Long.MAX_VALUE);
    assertEquals(1, fetch1.getBatch().get().count());
    // 3. Produce new data, extract new data
    generateMessageInQueue("2");
    // Test fetching Avro format
    InputBatch<JavaRDD<GenericRecord>> fetch2 = sourceFormatAdapter.fetchNewDataInAvroFormat(Option.of(fetch1.getCheckpointForNextBatch()), Long.MAX_VALUE);
    assertEquals(1, fetch2.getBatch().get().count());
    GenericRecord s3 = (GenericRecord) fetch2.getBatch().get().rdd().first().get("s3");
    GenericRecord s3Object = (GenericRecord) s3.get("object");
    assertEquals("2.parquet", s3Object.get("key").toString());
}

Also used : GenericRecord(org.apache.avro.generic.GenericRecord) SourceFormatAdapter(org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter) JavaRDD(org.apache.spark.api.java.JavaRDD) Test(org.junit.jupiter.api.Test)

Example 7 with SourceFormatAdapter

use of org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter in project hudi by apache.

the class TestSqlSource method testSqlSourceInvalidTable.

/**
 * Runs the test scenario of reading data from the source in row format.
 * Source table doesn't exists.
 *
 * @throws IOException
 */
@Test
public void testSqlSourceInvalidTable() throws IOException {
    props.setProperty(sqlSourceConfig, "select * from not_exist_sql_table");
    sqlSource = new SqlSource(props, jsc, sparkSession, schemaProvider);
    sourceFormatAdapter = new SourceFormatAdapter(sqlSource);
    assertThrows(AnalysisException.class, () -> sourceFormatAdapter.fetchNewDataInRowFormat(Option.empty(), Long.MAX_VALUE));
}

Also used : SourceFormatAdapter(org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter) Test(org.junit.jupiter.api.Test)

Example 8 with SourceFormatAdapter

use of org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter in project hudi by apache.

the class TestSqlSource method testSqlSourceZeroRecord.

/**
 * Runs the test scenario of reading data from the source in row format.
 * Source has no records.
 *
 * @throws IOException
 */
@Test
public void testSqlSourceZeroRecord() throws IOException {
    props.setProperty(sqlSourceConfig, "select * from test_sql_table where 1=0");
    sqlSource = new SqlSource(props, jsc, sparkSession, schemaProvider);
    sourceFormatAdapter = new SourceFormatAdapter(sqlSource);
    InputBatch<Dataset<Row>> fetch1AsRows = sourceFormatAdapter.fetchNewDataInRowFormat(Option.empty(), Long.MAX_VALUE);
    assertEquals(0, fetch1AsRows.getBatch().get().count());
}

Also used : Dataset(org.apache.spark.sql.Dataset) SourceFormatAdapter(org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter) Test(org.junit.jupiter.api.Test)

Example 9 with SourceFormatAdapter

use of org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter in project hudi by apache.

the class TestSqlSource method testSqlSourceAvroFormat.

/**
 * Runs the test scenario of reading data from the source in avro format.
 *
 * @throws IOException
 */
@Test
public void testSqlSourceAvroFormat() throws IOException {
    props.setProperty(sqlSourceConfig, "select * from test_sql_table");
    sqlSource = new SqlSource(props, jsc, sparkSession, schemaProvider);
    sourceFormatAdapter = new SourceFormatAdapter(sqlSource);
    // Test fetching Avro format
    InputBatch<JavaRDD<GenericRecord>> fetch1 = sourceFormatAdapter.fetchNewDataInAvroFormat(Option.empty(), Long.MAX_VALUE);
    // Test Avro to Row format
    Dataset<Row> fetch1Rows = AvroConversionUtils.createDataFrame(JavaRDD.toRDD(fetch1.getBatch().get()), schemaProvider.getSourceSchema().toString(), sparkSession);
    assertEquals(10000, fetch1Rows.count());
}

Also used : Row(org.apache.spark.sql.Row) SourceFormatAdapter(org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter) JavaRDD(org.apache.spark.api.java.JavaRDD) Test(org.junit.jupiter.api.Test)

Example 10 with SourceFormatAdapter

use of org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter in project hudi by apache.

the class TestSqlSource method testSqlSourceMoreRecordsThanSourceLimit.

/**
 * Runs the test scenario of reading data from the source in row format.
 * Source has more records than source limit.
 *
 * @throws IOException
 */
@Test
public void testSqlSourceMoreRecordsThanSourceLimit() throws IOException {
    props.setProperty(sqlSourceConfig, "select * from test_sql_table");
    sqlSource = new SqlSource(props, jsc, sparkSession, schemaProvider);
    sourceFormatAdapter = new SourceFormatAdapter(sqlSource);
    InputBatch<Dataset<Row>> fetch1AsRows = sourceFormatAdapter.fetchNewDataInRowFormat(Option.empty(), 1000);
    assertEquals(10000, fetch1AsRows.getBatch().get().count());
}

Also used : Dataset(org.apache.spark.sql.Dataset) SourceFormatAdapter(org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter) Test(org.junit.jupiter.api.Test)

Aggregations

SourceFormatAdapter (org.apache.hudi.utilities.deltastreamer.SourceFormatAdapter)20 Test (org.junit.jupiter.api.Test)19 JavaRDD (org.apache.spark.api.java.JavaRDD)13 TypedProperties (org.apache.hudi.common.config.TypedProperties)12 HoodieTestDataGenerator (org.apache.hudi.common.testutils.HoodieTestDataGenerator)11 Dataset (org.apache.spark.sql.Dataset)11 Row (org.apache.spark.sql.Row)4 GenericRecord (org.apache.avro.generic.GenericRecord)2 IOException (java.io.IOException)1 ArrayList (java.util.ArrayList)1 Map (java.util.Map)1 UUID (java.util.UUID)1 Stream (java.util.stream.Stream)1 Schema (org.apache.avro.Schema)1 GenericData (org.apache.avro.generic.GenericData)1 FileStatus (org.apache.hadoop.fs.FileStatus)1 LocatedFileStatus (org.apache.hadoop.fs.LocatedFileStatus)1 Path (org.apache.hadoop.fs.Path)1 DebeziumConstants (org.apache.hudi.common.model.debezium.DebeziumConstants)1 Option (org.apache.hudi.common.util.Option)1