Search in sources :

Example 1 with ReadSessionResponse

use of com.google.cloud.bigquery.connector.common.ReadSessionResponse in project spark-bigquery-connector by GoogleCloudDataproc.

the class BigQueryDataSourceReaderContext method createConverter.

private ReadRowsResponseToInternalRowIteratorConverter createConverter(ImmutableList<String> selectedFields, ReadSessionResponse readSessionResponse, Optional<StructType> userProvidedSchema) {
    ReadRowsResponseToInternalRowIteratorConverter converter;
    DataFormat format = readSessionCreatorConfig.getReadDataFormat();
    if (format == DataFormat.AVRO) {
        Schema schema = SchemaConverters.getSchemaWithPseudoColumns(readSessionResponse.getReadTableInfo());
        if (selectedFields.isEmpty()) {
            // means select *
            selectedFields = schema.getFields().stream().map(Field::getName).collect(ImmutableList.toImmutableList());
        } else {
            Set<String> requiredColumnSet = ImmutableSet.copyOf(selectedFields);
            schema = Schema.of(schema.getFields().stream().filter(field -> requiredColumnSet.contains(field.getName())).collect(Collectors.toList()));
        }
        return ReadRowsResponseToInternalRowIteratorConverter.avro(schema, selectedFields, readSessionResponse.getReadSession().getAvroSchema().getSchema(), userProvidedSchema);
    }
    throw new IllegalArgumentException("No known converted for " + readSessionCreatorConfig.getReadDataFormat());
}
Also used : ReadRowsResponseToInternalRowIteratorConverter(com.google.cloud.spark.bigquery.ReadRowsResponseToInternalRowIteratorConverter) IntStream(java.util.stream.IntStream) Iterables(com.google.common.collect.Iterables) InternalRow(org.apache.spark.sql.catalyst.InternalRow) TableId(com.google.cloud.bigquery.TableId) LoggerFactory(org.slf4j.LoggerFactory) ArrayList(java.util.ArrayList) LinkedHashMap(java.util.LinkedHashMap) OptionalLong(java.util.OptionalLong) ImmutableList(com.google.common.collect.ImmutableList) Schema(com.google.cloud.bigquery.Schema) Map(java.util.Map) ReadSessionResponse(com.google.cloud.bigquery.connector.common.ReadSessionResponse) StructField(org.apache.spark.sql.types.StructField) StructType(org.apache.spark.sql.types.StructType) Field(com.google.cloud.bigquery.Field) TableDefinition(com.google.cloud.bigquery.TableDefinition) ReadSessionCreator(com.google.cloud.bigquery.connector.common.ReadSessionCreator) JavaConversions(scala.collection.JavaConversions) ReadStream(com.google.cloud.bigquery.storage.v1.ReadStream) ImmutableSet(com.google.common.collect.ImmutableSet) Logger(org.slf4j.Logger) ReadSessionCreatorConfig(com.google.cloud.bigquery.connector.common.ReadSessionCreatorConfig) ReadSession(com.google.cloud.bigquery.storage.v1.ReadSession) BigQueryClient(com.google.cloud.bigquery.connector.common.BigQueryClient) Set(java.util.Set) SchemaConverters(com.google.cloud.spark.bigquery.SchemaConverters) Streams(com.google.common.collect.Streams) Collectors(java.util.stream.Collectors) DataFormat(com.google.cloud.bigquery.storage.v1.DataFormat) List(java.util.List) Stream(java.util.stream.Stream) ColumnarBatch(org.apache.spark.sql.vectorized.ColumnarBatch) ReadRowsResponseToInternalRowIteratorConverter(com.google.cloud.spark.bigquery.ReadRowsResponseToInternalRowIteratorConverter) BigQueryClientFactory(com.google.cloud.bigquery.connector.common.BigQueryClientFactory) SparkFilterUtils(com.google.cloud.spark.bigquery.SparkFilterUtils) Optional(java.util.Optional) Filter(org.apache.spark.sql.sources.Filter) TableInfo(com.google.cloud.bigquery.TableInfo) BigQueryUtil(com.google.cloud.bigquery.connector.common.BigQueryUtil) BigQueryTracerFactory(com.google.cloud.bigquery.connector.common.BigQueryTracerFactory) StructField(org.apache.spark.sql.types.StructField) Field(com.google.cloud.bigquery.Field) Schema(com.google.cloud.bigquery.Schema) DataFormat(com.google.cloud.bigquery.storage.v1.DataFormat)

Example 2 with ReadSessionResponse

use of com.google.cloud.bigquery.connector.common.ReadSessionResponse in project spark-bigquery-connector by GoogleCloudDataproc.

the class BigQueryDataSourceReaderContext method planBatchInputPartitionContexts.

public Stream<InputPartitionContext<ColumnarBatch>> planBatchInputPartitionContexts() {
    if (!enableBatchRead()) {
        throw new IllegalStateException("Batch reads should not be enabled");
    }
    ImmutableList<String> selectedFields = schema.map(requiredSchema -> ImmutableList.copyOf(requiredSchema.fieldNames())).orElse(ImmutableList.copyOf(fields.keySet()));
    Optional<String> filter = getCombinedFilter();
    ReadSessionResponse readSessionResponse = readSessionCreator.create(tableId, selectedFields, filter);
    ReadSession readSession = readSessionResponse.getReadSession();
    logger.info("Created read session for {}: {} for application id: {}", tableId.toString(), readSession.getName(), applicationId);
    if (selectedFields.isEmpty()) {
        // means select *
        Schema tableSchema = SchemaConverters.getSchemaWithPseudoColumns(readSessionResponse.getReadTableInfo());
        selectedFields = tableSchema.getFields().stream().map(Field::getName).collect(ImmutableList.toImmutableList());
    }
    ImmutableList<String> partitionSelectedFields = selectedFields;
    return Streams.stream(Iterables.partition(readSession.getStreamsList(), readSessionCreatorConfig.streamsPerPartition())).map(streams -> new ArrowInputPartitionContext(bigQueryReadClientFactory, bigQueryTracerFactory, streams.stream().map(ReadStream::getName).collect(Collectors.toCollection(ArrayList::new)), readSessionCreatorConfig.toReadRowsHelperOptions(), partitionSelectedFields, readSessionResponse, userProvidedSchema));
}
Also used : IntStream(java.util.stream.IntStream) Iterables(com.google.common.collect.Iterables) InternalRow(org.apache.spark.sql.catalyst.InternalRow) TableId(com.google.cloud.bigquery.TableId) LoggerFactory(org.slf4j.LoggerFactory) ArrayList(java.util.ArrayList) LinkedHashMap(java.util.LinkedHashMap) OptionalLong(java.util.OptionalLong) ImmutableList(com.google.common.collect.ImmutableList) Schema(com.google.cloud.bigquery.Schema) Map(java.util.Map) ReadSessionResponse(com.google.cloud.bigquery.connector.common.ReadSessionResponse) StructField(org.apache.spark.sql.types.StructField) StructType(org.apache.spark.sql.types.StructType) Field(com.google.cloud.bigquery.Field) TableDefinition(com.google.cloud.bigquery.TableDefinition) ReadSessionCreator(com.google.cloud.bigquery.connector.common.ReadSessionCreator) JavaConversions(scala.collection.JavaConversions) ReadStream(com.google.cloud.bigquery.storage.v1.ReadStream) ImmutableSet(com.google.common.collect.ImmutableSet) Logger(org.slf4j.Logger) ReadSessionCreatorConfig(com.google.cloud.bigquery.connector.common.ReadSessionCreatorConfig) ReadSession(com.google.cloud.bigquery.storage.v1.ReadSession) BigQueryClient(com.google.cloud.bigquery.connector.common.BigQueryClient) Set(java.util.Set) SchemaConverters(com.google.cloud.spark.bigquery.SchemaConverters) Streams(com.google.common.collect.Streams) Collectors(java.util.stream.Collectors) DataFormat(com.google.cloud.bigquery.storage.v1.DataFormat) List(java.util.List) Stream(java.util.stream.Stream) ColumnarBatch(org.apache.spark.sql.vectorized.ColumnarBatch) ReadRowsResponseToInternalRowIteratorConverter(com.google.cloud.spark.bigquery.ReadRowsResponseToInternalRowIteratorConverter) BigQueryClientFactory(com.google.cloud.bigquery.connector.common.BigQueryClientFactory) SparkFilterUtils(com.google.cloud.spark.bigquery.SparkFilterUtils) Optional(java.util.Optional) Filter(org.apache.spark.sql.sources.Filter) TableInfo(com.google.cloud.bigquery.TableInfo) BigQueryUtil(com.google.cloud.bigquery.connector.common.BigQueryUtil) BigQueryTracerFactory(com.google.cloud.bigquery.connector.common.BigQueryTracerFactory) StructField(org.apache.spark.sql.types.StructField) Field(com.google.cloud.bigquery.Field) ReadSessionResponse(com.google.cloud.bigquery.connector.common.ReadSessionResponse) ReadSession(com.google.cloud.bigquery.storage.v1.ReadSession) Schema(com.google.cloud.bigquery.Schema)

Example 3 with ReadSessionResponse

use of com.google.cloud.bigquery.connector.common.ReadSessionResponse in project spark-bigquery-connector by GoogleCloudDataproc.

the class BigQueryDataSourceReaderContext method planInputPartitionContexts.

public Stream<InputPartitionContext<InternalRow>> planInputPartitionContexts() {
    if (isEmptySchema()) {
        // create empty projection
        return createEmptyProjectionPartitions();
    }
    ImmutableList<String> selectedFields = schema.map(requiredSchema -> ImmutableList.copyOf(requiredSchema.fieldNames())).orElse(ImmutableList.copyOf(fields.keySet()));
    Optional<String> filter = getCombinedFilter();
    ReadSessionResponse readSessionResponse = readSessionCreator.create(tableId, selectedFields, filter);
    ReadSession readSession = readSessionResponse.getReadSession();
    logger.info("Created read session for {}: {} for application id: {}", tableId.toString(), readSession.getName(), applicationId);
    return readSession.getStreamsList().stream().map(stream -> new BigQueryInputPartitionContext(bigQueryReadClientFactory, stream.getName(), readSessionCreatorConfig.toReadRowsHelperOptions(), createConverter(selectedFields, readSessionResponse, userProvidedSchema)));
}
Also used : IntStream(java.util.stream.IntStream) Iterables(com.google.common.collect.Iterables) InternalRow(org.apache.spark.sql.catalyst.InternalRow) TableId(com.google.cloud.bigquery.TableId) LoggerFactory(org.slf4j.LoggerFactory) ArrayList(java.util.ArrayList) LinkedHashMap(java.util.LinkedHashMap) OptionalLong(java.util.OptionalLong) ImmutableList(com.google.common.collect.ImmutableList) Schema(com.google.cloud.bigquery.Schema) Map(java.util.Map) ReadSessionResponse(com.google.cloud.bigquery.connector.common.ReadSessionResponse) StructField(org.apache.spark.sql.types.StructField) StructType(org.apache.spark.sql.types.StructType) Field(com.google.cloud.bigquery.Field) TableDefinition(com.google.cloud.bigquery.TableDefinition) ReadSessionCreator(com.google.cloud.bigquery.connector.common.ReadSessionCreator) JavaConversions(scala.collection.JavaConversions) ReadStream(com.google.cloud.bigquery.storage.v1.ReadStream) ImmutableSet(com.google.common.collect.ImmutableSet) Logger(org.slf4j.Logger) ReadSessionCreatorConfig(com.google.cloud.bigquery.connector.common.ReadSessionCreatorConfig) ReadSession(com.google.cloud.bigquery.storage.v1.ReadSession) BigQueryClient(com.google.cloud.bigquery.connector.common.BigQueryClient) Set(java.util.Set) SchemaConverters(com.google.cloud.spark.bigquery.SchemaConverters) Streams(com.google.common.collect.Streams) Collectors(java.util.stream.Collectors) DataFormat(com.google.cloud.bigquery.storage.v1.DataFormat) List(java.util.List) Stream(java.util.stream.Stream) ColumnarBatch(org.apache.spark.sql.vectorized.ColumnarBatch) ReadRowsResponseToInternalRowIteratorConverter(com.google.cloud.spark.bigquery.ReadRowsResponseToInternalRowIteratorConverter) BigQueryClientFactory(com.google.cloud.bigquery.connector.common.BigQueryClientFactory) SparkFilterUtils(com.google.cloud.spark.bigquery.SparkFilterUtils) Optional(java.util.Optional) Filter(org.apache.spark.sql.sources.Filter) TableInfo(com.google.cloud.bigquery.TableInfo) BigQueryUtil(com.google.cloud.bigquery.connector.common.BigQueryUtil) BigQueryTracerFactory(com.google.cloud.bigquery.connector.common.BigQueryTracerFactory) ReadSessionResponse(com.google.cloud.bigquery.connector.common.ReadSessionResponse) ReadSession(com.google.cloud.bigquery.storage.v1.ReadSession)

Aggregations

Field (com.google.cloud.bigquery.Field)3 Schema (com.google.cloud.bigquery.Schema)3 TableDefinition (com.google.cloud.bigquery.TableDefinition)3 TableId (com.google.cloud.bigquery.TableId)3 TableInfo (com.google.cloud.bigquery.TableInfo)3 BigQueryClient (com.google.cloud.bigquery.connector.common.BigQueryClient)3 BigQueryClientFactory (com.google.cloud.bigquery.connector.common.BigQueryClientFactory)3 BigQueryTracerFactory (com.google.cloud.bigquery.connector.common.BigQueryTracerFactory)3 BigQueryUtil (com.google.cloud.bigquery.connector.common.BigQueryUtil)3 ReadSessionCreator (com.google.cloud.bigquery.connector.common.ReadSessionCreator)3 ReadSessionCreatorConfig (com.google.cloud.bigquery.connector.common.ReadSessionCreatorConfig)3 ReadSessionResponse (com.google.cloud.bigquery.connector.common.ReadSessionResponse)3 DataFormat (com.google.cloud.bigquery.storage.v1.DataFormat)3 ReadSession (com.google.cloud.bigquery.storage.v1.ReadSession)3 ReadStream (com.google.cloud.bigquery.storage.v1.ReadStream)3 ReadRowsResponseToInternalRowIteratorConverter (com.google.cloud.spark.bigquery.ReadRowsResponseToInternalRowIteratorConverter)3 SchemaConverters (com.google.cloud.spark.bigquery.SchemaConverters)3 SparkFilterUtils (com.google.cloud.spark.bigquery.SparkFilterUtils)3 ImmutableList (com.google.common.collect.ImmutableList)3 ImmutableSet (com.google.common.collect.ImmutableSet)3