Search in sources :

Example 1 with DataSourceWriter

use of org.apache.spark.sql.sources.v2.writer.DataSourceWriter in project spark-bigquery-connector by GoogleCloudDataproc.

the class BigQueryDataSourceV2 method createWriter.

/**
 * Returning a DataSourceWriter for the specified parameters. In case the table already exist and
 * the SaveMode is "Ignore", an Optional.empty() is returned.
 */
@Override
public Optional<DataSourceWriter> createWriter(String writeUUID, StructType schema, SaveMode mode, DataSourceOptions options) {
    Injector injector = createInjector(schema, options.asMap(), new BigQueryDataSourceWriterModule(writeUUID, schema, mode));
    // first verify if we need to do anything at all, based on the table existence and the save
    // mode.
    BigQueryClient bigQueryClient = injector.getInstance(BigQueryClient.class);
    SparkBigQueryConfig config = injector.getInstance(SparkBigQueryConfig.class);
    TableInfo table = bigQueryClient.getTable(config.getTableId());
    if (table != null) {
        // table already exists
        if (mode == SaveMode.Ignore) {
            return Optional.empty();
        }
        if (mode == SaveMode.ErrorIfExists) {
            throw new IllegalArgumentException(String.format("SaveMode is set to ErrorIfExists and table '%s' already exists. Did you want " + "to add data to the table by setting the SaveMode to Append? Example: " + "df.write.format.options.mode(\"append\").save()", BigQueryUtil.friendlyTableName(table.getTableId())));
        }
    } else {
        // table does not exist
        // If the CreateDisposition is CREATE_NEVER, and the table does not exist,
        // there's no point in writing the data to GCS in the first place as it going
        // to fail on the BigQuery side.
        boolean createNever = config.getCreateDisposition().map(createDisposition -> createDisposition == JobInfo.CreateDisposition.CREATE_NEVER).orElse(false);
        if (createNever) {
            throw new IllegalArgumentException(String.format("For table %s Create Disposition is CREATE_NEVER and the table does not exists." + " Aborting the insert", BigQueryUtil.friendlyTableName(config.getTableId())));
        }
    }
    DataSourceWriterContext dataSourceWriterContext = null;
    switch(config.getWriteMethod()) {
        case DIRECT:
            dataSourceWriterContext = injector.getInstance(BigQueryDirectDataSourceWriterContext.class);
            break;
        case INDIRECT:
            dataSourceWriterContext = injector.getInstance(BigQueryIndirectDataSourceWriterContext.class);
            break;
    }
    return Optional.of(new BigQueryDataSourceWriter(dataSourceWriterContext));
}
Also used : BigQueryDataSourceWriterModule(com.google.cloud.spark.bigquery.v2.context.BigQueryDataSourceWriterModule) WriteSupport(org.apache.spark.sql.sources.v2.WriteSupport) StructType(org.apache.spark.sql.types.StructType) SaveMode(org.apache.spark.sql.SaveMode) ReadSupport(org.apache.spark.sql.sources.v2.ReadSupport) JobInfo(com.google.cloud.bigquery.JobInfo) BigQueryClient(com.google.cloud.bigquery.connector.common.BigQueryClient) BigQueryIndirectDataSourceWriterContext(com.google.cloud.spark.bigquery.v2.context.BigQueryIndirectDataSourceWriterContext) SparkBigQueryConfig(com.google.cloud.spark.bigquery.SparkBigQueryConfig) BigQueryDataSourceReaderModule(com.google.cloud.spark.bigquery.v2.context.BigQueryDataSourceReaderModule) Injector(com.google.inject.Injector) BigQueryDataSourceReaderContext(com.google.cloud.spark.bigquery.v2.context.BigQueryDataSourceReaderContext) DataSourceWriter(org.apache.spark.sql.sources.v2.writer.DataSourceWriter) DataSourceWriterContext(com.google.cloud.spark.bigquery.v2.context.DataSourceWriterContext) BigQueryDataSourceWriterModule(com.google.cloud.spark.bigquery.v2.context.BigQueryDataSourceWriterModule) Optional(java.util.Optional) TableInfo(com.google.cloud.bigquery.TableInfo) BigQueryUtil(com.google.cloud.bigquery.connector.common.BigQueryUtil) BigQueryDirectDataSourceWriterContext(com.google.cloud.spark.bigquery.v2.context.BigQueryDirectDataSourceWriterContext) DataSourceOptions(org.apache.spark.sql.sources.v2.DataSourceOptions) DataSourceV2(org.apache.spark.sql.sources.v2.DataSourceV2) DataSourceReader(org.apache.spark.sql.sources.v2.reader.DataSourceReader) BigQueryIndirectDataSourceWriterContext(com.google.cloud.spark.bigquery.v2.context.BigQueryIndirectDataSourceWriterContext) DataSourceWriterContext(com.google.cloud.spark.bigquery.v2.context.DataSourceWriterContext) BigQueryDirectDataSourceWriterContext(com.google.cloud.spark.bigquery.v2.context.BigQueryDirectDataSourceWriterContext) BigQueryClient(com.google.cloud.bigquery.connector.common.BigQueryClient) SparkBigQueryConfig(com.google.cloud.spark.bigquery.SparkBigQueryConfig) BigQueryDirectDataSourceWriterContext(com.google.cloud.spark.bigquery.v2.context.BigQueryDirectDataSourceWriterContext) Injector(com.google.inject.Injector) BigQueryIndirectDataSourceWriterContext(com.google.cloud.spark.bigquery.v2.context.BigQueryIndirectDataSourceWriterContext) TableInfo(com.google.cloud.bigquery.TableInfo)

Example 2 with DataSourceWriter

use of org.apache.spark.sql.sources.v2.writer.DataSourceWriter in project iceberg by apache.

the class IcebergSource method createWriter.

@Override
public Optional<DataSourceWriter> createWriter(String jobId, StructType dsStruct, SaveMode mode, DataSourceOptions options) {
    Preconditions.checkArgument(mode == SaveMode.Append || mode == SaveMode.Overwrite, "Save mode %s is not supported", mode);
    Configuration conf = new Configuration(lazyBaseConf());
    Table table = getTableAndResolveHadoopConfiguration(options, conf);
    SparkWriteConf writeConf = new SparkWriteConf(lazySparkSession(), table, options.asMap());
    Preconditions.checkArgument(writeConf.handleTimestampWithoutZone() || !SparkUtil.hasTimestampWithoutZone(table.schema()), SparkUtil.TIMESTAMP_WITHOUT_TIMEZONE_ERROR);
    Schema writeSchema = SparkSchemaUtil.convert(table.schema(), dsStruct);
    TypeUtil.validateWriteSchema(table.schema(), writeSchema, writeConf.checkNullability(), writeConf.checkOrdering());
    SparkUtil.validatePartitionTransforms(table.spec());
    String appId = lazySparkSession().sparkContext().applicationId();
    String wapId = writeConf.wapId();
    boolean replacePartitions = mode == SaveMode.Overwrite;
    return Optional.of(new Writer(lazySparkSession(), table, writeConf, replacePartitions, appId, wapId, writeSchema, dsStruct));
}
Also used : Table(org.apache.iceberg.Table) Configuration(org.apache.hadoop.conf.Configuration) Schema(org.apache.iceberg.Schema) SparkWriteConf(org.apache.iceberg.spark.SparkWriteConf) DataSourceWriter(org.apache.spark.sql.sources.v2.writer.DataSourceWriter) StreamWriter(org.apache.spark.sql.sources.v2.writer.streaming.StreamWriter)

Aggregations

DataSourceWriter (org.apache.spark.sql.sources.v2.writer.DataSourceWriter)2 JobInfo (com.google.cloud.bigquery.JobInfo)1 TableInfo (com.google.cloud.bigquery.TableInfo)1 BigQueryClient (com.google.cloud.bigquery.connector.common.BigQueryClient)1 BigQueryUtil (com.google.cloud.bigquery.connector.common.BigQueryUtil)1 SparkBigQueryConfig (com.google.cloud.spark.bigquery.SparkBigQueryConfig)1 BigQueryDataSourceReaderContext (com.google.cloud.spark.bigquery.v2.context.BigQueryDataSourceReaderContext)1 BigQueryDataSourceReaderModule (com.google.cloud.spark.bigquery.v2.context.BigQueryDataSourceReaderModule)1 BigQueryDataSourceWriterModule (com.google.cloud.spark.bigquery.v2.context.BigQueryDataSourceWriterModule)1 BigQueryDirectDataSourceWriterContext (com.google.cloud.spark.bigquery.v2.context.BigQueryDirectDataSourceWriterContext)1 BigQueryIndirectDataSourceWriterContext (com.google.cloud.spark.bigquery.v2.context.BigQueryIndirectDataSourceWriterContext)1 DataSourceWriterContext (com.google.cloud.spark.bigquery.v2.context.DataSourceWriterContext)1 Injector (com.google.inject.Injector)1 Optional (java.util.Optional)1 Configuration (org.apache.hadoop.conf.Configuration)1 Schema (org.apache.iceberg.Schema)1 Table (org.apache.iceberg.Table)1 SparkWriteConf (org.apache.iceberg.spark.SparkWriteConf)1 SaveMode (org.apache.spark.sql.SaveMode)1 DataSourceOptions (org.apache.spark.sql.sources.v2.DataSourceOptions)1