Search in sources :

Example 11 with DataSourceOptions

use of org.apache.spark.sql.sources.v2.DataSourceOptions in project spark-bigquery-connector by GoogleCloudDataproc.

the class BigQueryDataSourceV2 method createWriter.

/**
 * Returning a DataSourceWriter for the specified parameters. In case the table already exist and
 * the SaveMode is "Ignore", an Optional.empty() is returned.
 */
@Override
public Optional<DataSourceWriter> createWriter(String writeUUID, StructType schema, SaveMode mode, DataSourceOptions options) {
    Injector injector = createInjector(schema, options.asMap(), new BigQueryDataSourceWriterModule(writeUUID, schema, mode));
    // first verify if we need to do anything at all, based on the table existence and the save
    // mode.
    BigQueryClient bigQueryClient = injector.getInstance(BigQueryClient.class);
    SparkBigQueryConfig config = injector.getInstance(SparkBigQueryConfig.class);
    TableInfo table = bigQueryClient.getTable(config.getTableId());
    if (table != null) {
        // table already exists
        if (mode == SaveMode.Ignore) {
            return Optional.empty();
        }
        if (mode == SaveMode.ErrorIfExists) {
            throw new IllegalArgumentException(String.format("SaveMode is set to ErrorIfExists and table '%s' already exists. Did you want " + "to add data to the table by setting the SaveMode to Append? Example: " + "df.write.format.options.mode(\"append\").save()", BigQueryUtil.friendlyTableName(table.getTableId())));
        }
    } else {
        // table does not exist
        // If the CreateDisposition is CREATE_NEVER, and the table does not exist,
        // there's no point in writing the data to GCS in the first place as it going
        // to fail on the BigQuery side.
        boolean createNever = config.getCreateDisposition().map(createDisposition -> createDisposition == JobInfo.CreateDisposition.CREATE_NEVER).orElse(false);
        if (createNever) {
            throw new IllegalArgumentException(String.format("For table %s Create Disposition is CREATE_NEVER and the table does not exists." + " Aborting the insert", BigQueryUtil.friendlyTableName(config.getTableId())));
        }
    }
    DataSourceWriterContext dataSourceWriterContext = null;
    switch(config.getWriteMethod()) {
        case DIRECT:
            dataSourceWriterContext = injector.getInstance(BigQueryDirectDataSourceWriterContext.class);
            break;
        case INDIRECT:
            dataSourceWriterContext = injector.getInstance(BigQueryIndirectDataSourceWriterContext.class);
            break;
    }
    return Optional.of(new BigQueryDataSourceWriter(dataSourceWriterContext));
}
Also used : BigQueryDataSourceWriterModule(com.google.cloud.spark.bigquery.v2.context.BigQueryDataSourceWriterModule) WriteSupport(org.apache.spark.sql.sources.v2.WriteSupport) StructType(org.apache.spark.sql.types.StructType) SaveMode(org.apache.spark.sql.SaveMode) ReadSupport(org.apache.spark.sql.sources.v2.ReadSupport) JobInfo(com.google.cloud.bigquery.JobInfo) BigQueryClient(com.google.cloud.bigquery.connector.common.BigQueryClient) BigQueryIndirectDataSourceWriterContext(com.google.cloud.spark.bigquery.v2.context.BigQueryIndirectDataSourceWriterContext) SparkBigQueryConfig(com.google.cloud.spark.bigquery.SparkBigQueryConfig) BigQueryDataSourceReaderModule(com.google.cloud.spark.bigquery.v2.context.BigQueryDataSourceReaderModule) Injector(com.google.inject.Injector) BigQueryDataSourceReaderContext(com.google.cloud.spark.bigquery.v2.context.BigQueryDataSourceReaderContext) DataSourceWriter(org.apache.spark.sql.sources.v2.writer.DataSourceWriter) DataSourceWriterContext(com.google.cloud.spark.bigquery.v2.context.DataSourceWriterContext) BigQueryDataSourceWriterModule(com.google.cloud.spark.bigquery.v2.context.BigQueryDataSourceWriterModule) Optional(java.util.Optional) TableInfo(com.google.cloud.bigquery.TableInfo) BigQueryUtil(com.google.cloud.bigquery.connector.common.BigQueryUtil) BigQueryDirectDataSourceWriterContext(com.google.cloud.spark.bigquery.v2.context.BigQueryDirectDataSourceWriterContext) DataSourceOptions(org.apache.spark.sql.sources.v2.DataSourceOptions) DataSourceV2(org.apache.spark.sql.sources.v2.DataSourceV2) DataSourceReader(org.apache.spark.sql.sources.v2.reader.DataSourceReader) BigQueryIndirectDataSourceWriterContext(com.google.cloud.spark.bigquery.v2.context.BigQueryIndirectDataSourceWriterContext) DataSourceWriterContext(com.google.cloud.spark.bigquery.v2.context.DataSourceWriterContext) BigQueryDirectDataSourceWriterContext(com.google.cloud.spark.bigquery.v2.context.BigQueryDirectDataSourceWriterContext) BigQueryClient(com.google.cloud.bigquery.connector.common.BigQueryClient) SparkBigQueryConfig(com.google.cloud.spark.bigquery.SparkBigQueryConfig) BigQueryDirectDataSourceWriterContext(com.google.cloud.spark.bigquery.v2.context.BigQueryDirectDataSourceWriterContext) Injector(com.google.inject.Injector) BigQueryIndirectDataSourceWriterContext(com.google.cloud.spark.bigquery.v2.context.BigQueryIndirectDataSourceWriterContext) TableInfo(com.google.cloud.bigquery.TableInfo)

Example 12 with DataSourceOptions

use of org.apache.spark.sql.sources.v2.DataSourceOptions in project hudi by apache.

the class TestHoodieDataSourceInternalWriter method testDataSourceWriterInternal.

private void testDataSourceWriterInternal(Map<String, String> extraMetadata, Map<String, String> expectedExtraMetadata, boolean populateMetaFields) throws Exception {
    // init config and table
    HoodieWriteConfig cfg = getWriteConfig(populateMetaFields);
    String instantTime = "001";
    // init writer
    HoodieDataSourceInternalWriter dataSourceInternalWriter = new HoodieDataSourceInternalWriter(instantTime, cfg, STRUCT_TYPE, sqlContext.sparkSession(), hadoopConf, new DataSourceOptions(extraMetadata), populateMetaFields, false);
    DataWriter<InternalRow> writer = dataSourceInternalWriter.createWriterFactory().createDataWriter(0, RANDOM.nextLong(), RANDOM.nextLong());
    String[] partitionPaths = HoodieTestDataGenerator.DEFAULT_PARTITION_PATHS;
    List<String> partitionPathsAbs = new ArrayList<>();
    for (String partitionPath : partitionPaths) {
        partitionPathsAbs.add(basePath + "/" + partitionPath + "/*");
    }
    int size = 10 + RANDOM.nextInt(1000);
    int batches = 2;
    Dataset<Row> totalInputRows = null;
    for (int j = 0; j < batches; j++) {
        String partitionPath = HoodieTestDataGenerator.DEFAULT_PARTITION_PATHS[j % 3];
        Dataset<Row> inputRows = getRandomRows(sqlContext, size, partitionPath, false);
        writeRows(inputRows, writer);
        if (totalInputRows == null) {
            totalInputRows = inputRows;
        } else {
            totalInputRows = totalInputRows.union(inputRows);
        }
    }
    HoodieWriterCommitMessage commitMetadata = (HoodieWriterCommitMessage) writer.commit();
    List<HoodieWriterCommitMessage> commitMessages = new ArrayList<>();
    commitMessages.add(commitMetadata);
    dataSourceInternalWriter.commit(commitMessages.toArray(new HoodieWriterCommitMessage[0]));
    metaClient.reloadActiveTimeline();
    Dataset<Row> result = HoodieClientTestUtils.read(jsc, basePath, sqlContext, metaClient.getFs(), partitionPathsAbs.toArray(new String[0]));
    // verify output
    assertOutput(totalInputRows, result, instantTime, Option.empty(), populateMetaFields);
    assertWriteStatuses(commitMessages.get(0).getWriteStatuses(), batches, size, Option.empty(), Option.empty());
    // verify extra metadata
    Option<HoodieCommitMetadata> commitMetadataOption = HoodieClientTestUtils.getCommitMetadataForLatestInstant(metaClient);
    assertTrue(commitMetadataOption.isPresent());
    Map<String, String> actualExtraMetadata = new HashMap<>();
    commitMetadataOption.get().getExtraMetadata().entrySet().stream().filter(entry -> !entry.getKey().equals(HoodieCommitMetadata.SCHEMA_KEY)).forEach(entry -> actualExtraMetadata.put(entry.getKey(), entry.getValue()));
    assertEquals(actualExtraMetadata, expectedExtraMetadata);
}
Also used : InternalRow(org.apache.spark.sql.catalyst.InternalRow) Arrays(java.util.Arrays) Dataset(org.apache.spark.sql.Dataset) HoodieTestDataGenerator(org.apache.hudi.common.testutils.HoodieTestDataGenerator) Option(org.apache.hudi.common.util.Option) HashMap(java.util.HashMap) Disabled(org.junit.jupiter.api.Disabled) DataSourceWriteOptions(org.apache.hudi.DataSourceWriteOptions) DataWriter(org.apache.spark.sql.sources.v2.writer.DataWriter) ArrayList(java.util.ArrayList) Map(java.util.Map) DataSourceOptions(org.apache.spark.sql.sources.v2.DataSourceOptions) Assertions.assertEquals(org.junit.jupiter.api.Assertions.assertEquals) MethodSource(org.junit.jupiter.params.provider.MethodSource) ENCODER(org.apache.hudi.testutils.SparkDatasetTestUtils.ENCODER) HoodieWriteConfig(org.apache.hudi.config.HoodieWriteConfig) HoodieCommitMetadata(org.apache.hudi.common.model.HoodieCommitMetadata) Row(org.apache.spark.sql.Row) STRUCT_TYPE(org.apache.hudi.testutils.SparkDatasetTestUtils.STRUCT_TYPE) Arguments(org.junit.jupiter.params.provider.Arguments) Test(org.junit.jupiter.api.Test) ParameterizedTest(org.junit.jupiter.params.ParameterizedTest) List(java.util.List) Stream(java.util.stream.Stream) SparkDatasetTestUtils.toInternalRows(org.apache.hudi.testutils.SparkDatasetTestUtils.toInternalRows) Assertions.assertTrue(org.junit.jupiter.api.Assertions.assertTrue) SparkDatasetTestUtils.getRandomRows(org.apache.hudi.testutils.SparkDatasetTestUtils.getRandomRows) HoodieClientTestUtils(org.apache.hudi.testutils.HoodieClientTestUtils) Collections(java.util.Collections) HashMap(java.util.HashMap) ArrayList(java.util.ArrayList) HoodieWriteConfig(org.apache.hudi.config.HoodieWriteConfig) HoodieCommitMetadata(org.apache.hudi.common.model.HoodieCommitMetadata) DataSourceOptions(org.apache.spark.sql.sources.v2.DataSourceOptions) InternalRow(org.apache.spark.sql.catalyst.InternalRow) Row(org.apache.spark.sql.Row) InternalRow(org.apache.spark.sql.catalyst.InternalRow)

Example 13 with DataSourceOptions

use of org.apache.spark.sql.sources.v2.DataSourceOptions in project hudi by apache.

the class TestHoodieDataSourceInternalWriter method testAbort.

/**
 * Tests that DataSourceWriter.abort() will abort the written records of interest write and commit batch1 write and abort batch2 Read of entire dataset should show only records from batch1.
 * commit batch1
 * abort batch2
 * verify only records from batch1 is available to read
 */
@ParameterizedTest
@MethodSource("bulkInsertTypeParams")
public void testAbort(boolean populateMetaFields) throws Exception {
    // init config and table
    HoodieWriteConfig cfg = getWriteConfig(populateMetaFields);
    String instantTime0 = "00" + 0;
    // init writer
    HoodieDataSourceInternalWriter dataSourceInternalWriter = new HoodieDataSourceInternalWriter(instantTime0, cfg, STRUCT_TYPE, sqlContext.sparkSession(), hadoopConf, new DataSourceOptions(Collections.EMPTY_MAP), populateMetaFields, false);
    DataWriter<InternalRow> writer = dataSourceInternalWriter.createWriterFactory().createDataWriter(0, RANDOM.nextLong(), RANDOM.nextLong());
    List<String> partitionPaths = Arrays.asList(HoodieTestDataGenerator.DEFAULT_PARTITION_PATHS);
    List<String> partitionPathsAbs = new ArrayList<>();
    for (String partitionPath : partitionPaths) {
        partitionPathsAbs.add(basePath + "/" + partitionPath + "/*");
    }
    int size = 10 + RANDOM.nextInt(100);
    int batches = 1;
    Dataset<Row> totalInputRows = null;
    for (int j = 0; j < batches; j++) {
        String partitionPath = HoodieTestDataGenerator.DEFAULT_PARTITION_PATHS[j % 3];
        Dataset<Row> inputRows = getRandomRows(sqlContext, size, partitionPath, false);
        writeRows(inputRows, writer);
        if (totalInputRows == null) {
            totalInputRows = inputRows;
        } else {
            totalInputRows = totalInputRows.union(inputRows);
        }
    }
    HoodieWriterCommitMessage commitMetadata = (HoodieWriterCommitMessage) writer.commit();
    List<HoodieWriterCommitMessage> commitMessages = new ArrayList<>();
    commitMessages.add(commitMetadata);
    // commit 1st batch
    dataSourceInternalWriter.commit(commitMessages.toArray(new HoodieWriterCommitMessage[0]));
    metaClient.reloadActiveTimeline();
    Dataset<Row> result = HoodieClientTestUtils.read(jsc, basePath, sqlContext, metaClient.getFs(), partitionPathsAbs.toArray(new String[0]));
    // verify rows
    assertOutput(totalInputRows, result, instantTime0, Option.empty(), populateMetaFields);
    assertWriteStatuses(commitMessages.get(0).getWriteStatuses(), batches, size, Option.empty(), Option.empty());
    // 2nd batch. abort in the end
    String instantTime1 = "00" + 1;
    dataSourceInternalWriter = new HoodieDataSourceInternalWriter(instantTime1, cfg, STRUCT_TYPE, sqlContext.sparkSession(), hadoopConf, new DataSourceOptions(Collections.EMPTY_MAP), populateMetaFields, false);
    writer = dataSourceInternalWriter.createWriterFactory().createDataWriter(1, RANDOM.nextLong(), RANDOM.nextLong());
    for (int j = 0; j < batches; j++) {
        String partitionPath = HoodieTestDataGenerator.DEFAULT_PARTITION_PATHS[j % 3];
        Dataset<Row> inputRows = getRandomRows(sqlContext, size, partitionPath, false);
        writeRows(inputRows, writer);
    }
    commitMetadata = (HoodieWriterCommitMessage) writer.commit();
    commitMessages = new ArrayList<>();
    commitMessages.add(commitMetadata);
    // commit 1st batch
    dataSourceInternalWriter.abort(commitMessages.toArray(new HoodieWriterCommitMessage[0]));
    metaClient.reloadActiveTimeline();
    result = HoodieClientTestUtils.read(jsc, basePath, sqlContext, metaClient.getFs(), partitionPathsAbs.toArray(new String[0]));
    // verify rows
    // only rows from first batch should be present
    assertOutput(totalInputRows, result, instantTime0, Option.empty(), populateMetaFields);
}
Also used : ArrayList(java.util.ArrayList) HoodieWriteConfig(org.apache.hudi.config.HoodieWriteConfig) DataSourceOptions(org.apache.spark.sql.sources.v2.DataSourceOptions) InternalRow(org.apache.spark.sql.catalyst.InternalRow) Row(org.apache.spark.sql.Row) InternalRow(org.apache.spark.sql.catalyst.InternalRow) ParameterizedTest(org.junit.jupiter.params.ParameterizedTest) MethodSource(org.junit.jupiter.params.provider.MethodSource)

Example 14 with DataSourceOptions

use of org.apache.spark.sql.sources.v2.DataSourceOptions in project hudi by apache.

the class TestHoodieDataSourceInternalWriter method testMultipleDataSourceWrites.

@ParameterizedTest
@MethodSource("bulkInsertTypeParams")
public void testMultipleDataSourceWrites(boolean populateMetaFields) throws Exception {
    // init config and table
    HoodieWriteConfig cfg = getWriteConfig(populateMetaFields);
    int partitionCounter = 0;
    // execute N rounds
    for (int i = 0; i < 2; i++) {
        String instantTime = "00" + i;
        // init writer
        HoodieDataSourceInternalWriter dataSourceInternalWriter = new HoodieDataSourceInternalWriter(instantTime, cfg, STRUCT_TYPE, sqlContext.sparkSession(), hadoopConf, new DataSourceOptions(Collections.EMPTY_MAP), populateMetaFields, false);
        List<HoodieWriterCommitMessage> commitMessages = new ArrayList<>();
        Dataset<Row> totalInputRows = null;
        DataWriter<InternalRow> writer = dataSourceInternalWriter.createWriterFactory().createDataWriter(partitionCounter++, RANDOM.nextLong(), RANDOM.nextLong());
        int size = 10 + RANDOM.nextInt(1000);
        // one batch per partition
        int batches = 2;
        for (int j = 0; j < batches; j++) {
            String partitionPath = HoodieTestDataGenerator.DEFAULT_PARTITION_PATHS[j % 3];
            Dataset<Row> inputRows = getRandomRows(sqlContext, size, partitionPath, false);
            writeRows(inputRows, writer);
            if (totalInputRows == null) {
                totalInputRows = inputRows;
            } else {
                totalInputRows = totalInputRows.union(inputRows);
            }
        }
        HoodieWriterCommitMessage commitMetadata = (HoodieWriterCommitMessage) writer.commit();
        commitMessages.add(commitMetadata);
        dataSourceInternalWriter.commit(commitMessages.toArray(new HoodieWriterCommitMessage[0]));
        metaClient.reloadActiveTimeline();
        Dataset<Row> result = HoodieClientTestUtils.readCommit(basePath, sqlContext, metaClient.getCommitTimeline(), instantTime, populateMetaFields);
        // verify output
        assertOutput(totalInputRows, result, instantTime, Option.empty(), populateMetaFields);
        assertWriteStatuses(commitMessages.get(0).getWriteStatuses(), batches, size, Option.empty(), Option.empty());
    }
}
Also used : ArrayList(java.util.ArrayList) HoodieWriteConfig(org.apache.hudi.config.HoodieWriteConfig) DataSourceOptions(org.apache.spark.sql.sources.v2.DataSourceOptions) InternalRow(org.apache.spark.sql.catalyst.InternalRow) Row(org.apache.spark.sql.Row) InternalRow(org.apache.spark.sql.catalyst.InternalRow) ParameterizedTest(org.junit.jupiter.params.ParameterizedTest) MethodSource(org.junit.jupiter.params.provider.MethodSource)

Example 15 with DataSourceOptions

use of org.apache.spark.sql.sources.v2.DataSourceOptions in project hudi by apache.

the class TestHoodieDataSourceInternalWriter method testLargeWrites.

// takes up lot of running time with CI.
@Disabled
@ParameterizedTest
@MethodSource("bulkInsertTypeParams")
public void testLargeWrites(boolean populateMetaFields) throws Exception {
    // init config and table
    HoodieWriteConfig cfg = getWriteConfig(populateMetaFields);
    int partitionCounter = 0;
    // execute N rounds
    for (int i = 0; i < 3; i++) {
        String instantTime = "00" + i;
        // init writer
        HoodieDataSourceInternalWriter dataSourceInternalWriter = new HoodieDataSourceInternalWriter(instantTime, cfg, STRUCT_TYPE, sqlContext.sparkSession(), hadoopConf, new DataSourceOptions(Collections.EMPTY_MAP), populateMetaFields, false);
        List<HoodieWriterCommitMessage> commitMessages = new ArrayList<>();
        Dataset<Row> totalInputRows = null;
        DataWriter<InternalRow> writer = dataSourceInternalWriter.createWriterFactory().createDataWriter(partitionCounter++, RANDOM.nextLong(), RANDOM.nextLong());
        int size = 10000 + RANDOM.nextInt(10000);
        // one batch per partition
        int batches = 3;
        for (int j = 0; j < batches; j++) {
            String partitionPath = HoodieTestDataGenerator.DEFAULT_PARTITION_PATHS[j % 3];
            Dataset<Row> inputRows = getRandomRows(sqlContext, size, partitionPath, false);
            writeRows(inputRows, writer);
            if (totalInputRows == null) {
                totalInputRows = inputRows;
            } else {
                totalInputRows = totalInputRows.union(inputRows);
            }
        }
        HoodieWriterCommitMessage commitMetadata = (HoodieWriterCommitMessage) writer.commit();
        commitMessages.add(commitMetadata);
        dataSourceInternalWriter.commit(commitMessages.toArray(new HoodieWriterCommitMessage[0]));
        metaClient.reloadActiveTimeline();
        Dataset<Row> result = HoodieClientTestUtils.readCommit(basePath, sqlContext, metaClient.getCommitTimeline(), instantTime, populateMetaFields);
        // verify output
        assertOutput(totalInputRows, result, instantTime, Option.empty(), populateMetaFields);
        assertWriteStatuses(commitMessages.get(0).getWriteStatuses(), batches, size, Option.empty(), Option.empty());
    }
}
Also used : ArrayList(java.util.ArrayList) HoodieWriteConfig(org.apache.hudi.config.HoodieWriteConfig) DataSourceOptions(org.apache.spark.sql.sources.v2.DataSourceOptions) InternalRow(org.apache.spark.sql.catalyst.InternalRow) Row(org.apache.spark.sql.Row) InternalRow(org.apache.spark.sql.catalyst.InternalRow) ParameterizedTest(org.junit.jupiter.params.ParameterizedTest) MethodSource(org.junit.jupiter.params.provider.MethodSource) Disabled(org.junit.jupiter.api.Disabled)

Aggregations

DataSourceOptions (org.apache.spark.sql.sources.v2.DataSourceOptions)38 Test (org.junit.Test)33 HashMap (java.util.HashMap)13 Configuration (org.apache.hadoop.conf.Configuration)13 SQLConf (org.apache.spark.sql.internal.SQLConf)10 ArrayList (java.util.ArrayList)4 HoodieWriteConfig (org.apache.hudi.config.HoodieWriteConfig)4 Row (org.apache.spark.sql.Row)4 InternalRow (org.apache.spark.sql.catalyst.InternalRow)4 ParameterizedTest (org.junit.jupiter.params.ParameterizedTest)4 MethodSource (org.junit.jupiter.params.provider.MethodSource)4 List (java.util.List)3 DataSourceReader (org.apache.spark.sql.sources.v2.reader.DataSourceReader)3 Layout (io.tiledb.java.api.Layout)2 ByteArrayOutputStream (java.io.ByteArrayOutputStream)2 File (java.io.File)2 ObjectOutputStream (java.io.ObjectOutputStream)2 URI (java.net.URI)2 DataFile (org.apache.iceberg.DataFile)2 Table (org.apache.iceberg.Table)2