Search in sources :

Example 26 with ReadSession

use of com.google.cloud.bigquery.storage.v1beta2.ReadSession in project spark-bigquery-connector by GoogleCloudDataproc.

the class BigQueryDataSourceReaderContext method planInputPartitionContexts.

public Stream<InputPartitionContext<InternalRow>> planInputPartitionContexts() {
    if (isEmptySchema()) {
        // create empty projection
        return createEmptyProjectionPartitions();
    }
    ImmutableList<String> selectedFields = schema.map(requiredSchema -> ImmutableList.copyOf(requiredSchema.fieldNames())).orElse(ImmutableList.copyOf(fields.keySet()));
    Optional<String> filter = getCombinedFilter();
    ReadSessionResponse readSessionResponse = readSessionCreator.create(tableId, selectedFields, filter);
    ReadSession readSession = readSessionResponse.getReadSession();
    logger.info("Created read session for {}: {} for application id: {}", tableId.toString(), readSession.getName(), applicationId);
    return readSession.getStreamsList().stream().map(stream -> new BigQueryInputPartitionContext(bigQueryReadClientFactory, stream.getName(), readSessionCreatorConfig.toReadRowsHelperOptions(), createConverter(selectedFields, readSessionResponse, userProvidedSchema)));
}
Also used : IntStream(java.util.stream.IntStream) Iterables(com.google.common.collect.Iterables) InternalRow(org.apache.spark.sql.catalyst.InternalRow) TableId(com.google.cloud.bigquery.TableId) LoggerFactory(org.slf4j.LoggerFactory) ArrayList(java.util.ArrayList) LinkedHashMap(java.util.LinkedHashMap) OptionalLong(java.util.OptionalLong) ImmutableList(com.google.common.collect.ImmutableList) Schema(com.google.cloud.bigquery.Schema) Map(java.util.Map) ReadSessionResponse(com.google.cloud.bigquery.connector.common.ReadSessionResponse) StructField(org.apache.spark.sql.types.StructField) StructType(org.apache.spark.sql.types.StructType) Field(com.google.cloud.bigquery.Field) TableDefinition(com.google.cloud.bigquery.TableDefinition) ReadSessionCreator(com.google.cloud.bigquery.connector.common.ReadSessionCreator) JavaConversions(scala.collection.JavaConversions) ReadStream(com.google.cloud.bigquery.storage.v1.ReadStream) ImmutableSet(com.google.common.collect.ImmutableSet) Logger(org.slf4j.Logger) ReadSessionCreatorConfig(com.google.cloud.bigquery.connector.common.ReadSessionCreatorConfig) ReadSession(com.google.cloud.bigquery.storage.v1.ReadSession) BigQueryClient(com.google.cloud.bigquery.connector.common.BigQueryClient) Set(java.util.Set) SchemaConverters(com.google.cloud.spark.bigquery.SchemaConverters) Streams(com.google.common.collect.Streams) Collectors(java.util.stream.Collectors) DataFormat(com.google.cloud.bigquery.storage.v1.DataFormat) List(java.util.List) Stream(java.util.stream.Stream) ColumnarBatch(org.apache.spark.sql.vectorized.ColumnarBatch) ReadRowsResponseToInternalRowIteratorConverter(com.google.cloud.spark.bigquery.ReadRowsResponseToInternalRowIteratorConverter) BigQueryClientFactory(com.google.cloud.bigquery.connector.common.BigQueryClientFactory) SparkFilterUtils(com.google.cloud.spark.bigquery.SparkFilterUtils) Optional(java.util.Optional) Filter(org.apache.spark.sql.sources.Filter) TableInfo(com.google.cloud.bigquery.TableInfo) BigQueryUtil(com.google.cloud.bigquery.connector.common.BigQueryUtil) BigQueryTracerFactory(com.google.cloud.bigquery.connector.common.BigQueryTracerFactory) ReadSessionResponse(com.google.cloud.bigquery.connector.common.ReadSessionResponse) ReadSession(com.google.cloud.bigquery.storage.v1.ReadSession)

Example 27 with ReadSession

use of com.google.cloud.bigquery.storage.v1beta2.ReadSession in project trino by trinodb.

the class BigQuerySplitManager method readFromBigQuery.

private List<BigQuerySplit> readFromBigQuery(ConnectorSession session, TableId remoteTableId, Optional<List<ColumnHandle>> projectedColumns, int actualParallelism, Optional<String> filter) {
    log.debug("readFromBigQuery(tableId=%s, projectedColumns=%s, actualParallelism=%s, filter=[%s])", remoteTableId, projectedColumns, actualParallelism, filter);
    List<ColumnHandle> columns = projectedColumns.orElse(ImmutableList.of());
    List<String> projectedColumnsNames = columns.stream().map(column -> ((BigQueryColumnHandle) column).getName()).collect(toImmutableList());
    ReadSession readSession = new ReadSessionCreator(bigQueryClientFactory, bigQueryReadClientFactory, viewEnabled, viewExpiration).create(session, remoteTableId, projectedColumnsNames, filter, actualParallelism);
    return readSession.getStreamsList().stream().map(stream -> BigQuerySplit.forStream(stream.getName(), readSession.getAvroSchema().getSchema(), columns)).collect(toImmutableList());
}
Also used : BIGQUERY_FAILED_TO_EXECUTE_QUERY(io.trino.plugin.bigquery.BigQueryErrorCode.BIGQUERY_FAILED_TO_EXECUTE_QUERY) ConnectorSplitManager(io.trino.spi.connector.ConnectorSplitManager) Logger(io.airlift.log.Logger) IntStream.range(java.util.stream.IntStream.range) NodeManager(io.trino.spi.NodeManager) TableId(com.google.cloud.bigquery.TableId) BigQueryException(com.google.cloud.bigquery.BigQueryException) Duration(io.airlift.units.Duration) FixedSplitSource(io.trino.spi.connector.FixedSplitSource) Inject(javax.inject.Inject) NOT_SUPPORTED(io.trino.spi.StandardErrorCode.NOT_SUPPORTED) TableNotFoundException(io.trino.spi.connector.TableNotFoundException) ImmutableList(com.google.common.collect.ImmutableList) VIEW(com.google.cloud.bigquery.TableDefinition.Type.VIEW) ConnectorTableHandle(io.trino.spi.connector.ConnectorTableHandle) Objects.requireNonNull(java.util.Objects.requireNonNull) ColumnHandle(io.trino.spi.connector.ColumnHandle) TableResult(com.google.cloud.bigquery.TableResult) TABLE(com.google.cloud.bigquery.TableDefinition.Type.TABLE) ReadSession(com.google.cloud.bigquery.storage.v1.ReadSession) ImmutableList.toImmutableList(com.google.common.collect.ImmutableList.toImmutableList) TrinoException(io.trino.spi.TrinoException) ConnectorSplitSource(io.trino.spi.connector.ConnectorSplitSource) ConnectorSession(io.trino.spi.connector.ConnectorSession) TupleDomain(io.trino.spi.predicate.TupleDomain) SchemaTableName(io.trino.spi.connector.SchemaTableName) List(java.util.List) Collectors.toList(java.util.stream.Collectors.toList) DynamicFilter(io.trino.spi.connector.DynamicFilter) Optional(java.util.Optional) TableInfo(com.google.cloud.bigquery.TableInfo) ConnectorTransactionHandle(io.trino.spi.connector.ConnectorTransactionHandle) ColumnHandle(io.trino.spi.connector.ColumnHandle) ReadSession(com.google.cloud.bigquery.storage.v1.ReadSession)

Example 28 with ReadSession

use of com.google.cloud.bigquery.storage.v1beta2.ReadSession in project trino by trinodb.

the class ReadSessionCreator method create.

public ReadSession create(ConnectorSession session, TableId remoteTable, List<String> selectedFields, Optional<String> filter, int parallelism) {
    BigQueryClient client = bigQueryClientFactory.create(session);
    TableInfo tableDetails = client.getTable(remoteTable).orElseThrow(() -> new TableNotFoundException(new SchemaTableName(remoteTable.getDataset(), remoteTable.getTable())));
    TableInfo actualTable = getActualTable(client, tableDetails, selectedFields);
    List<String> filteredSelectedFields = selectedFields.stream().filter(BigQueryUtil::validColumnName).collect(toList());
    try (BigQueryReadClient bigQueryReadClient = bigQueryReadClientFactory.create(session)) {
        ReadSession.TableReadOptions.Builder readOptions = ReadSession.TableReadOptions.newBuilder().addAllSelectedFields(filteredSelectedFields);
        filter.ifPresent(readOptions::setRowRestriction);
        ReadSession readSession = bigQueryReadClient.createReadSession(CreateReadSessionRequest.newBuilder().setParent("projects/" + client.getProjectId()).setReadSession(ReadSession.newBuilder().setDataFormat(DataFormat.AVRO).setTable(toTableResourceName(actualTable.getTableId())).setReadOptions(readOptions)).setMaxStreamCount(parallelism).build());
        return readSession;
    }
}
Also used : TableNotFoundException(io.trino.spi.connector.TableNotFoundException) ReadSession(com.google.cloud.bigquery.storage.v1.ReadSession) TableInfo(com.google.cloud.bigquery.TableInfo) SchemaTableName(io.trino.spi.connector.SchemaTableName) BigQueryReadClient(com.google.cloud.bigquery.storage.v1.BigQueryReadClient)

Example 29 with ReadSession

use of com.google.cloud.bigquery.storage.v1beta2.ReadSession in project java-bigquerystorage by googleapis.

the class ITBigQueryStorageLongRunningTest method testLongRunningReadSession.

@Test
public void testLongRunningReadSession() throws InterruptedException, ExecutionException {
    // This test reads a larger table with the goal of doing a simple validation of timeout settings
    // for a longer running session.
    String table = BigQueryResource.FormatTableResource(/* projectId = */
    "bigquery-public-data", /* datasetId = */
    "samples", /* tableId = */
    "wikipedia");
    ReadSession session = client.createReadSession(/* parent = */
    parentProjectId, /* readSession = */
    ReadSession.newBuilder().setTable(table).setDataFormat(DataFormat.AVRO).build(), /* maxStreamCount = */
    5);
    assertEquals(String.format("Did not receive expected number of streams for table '%s' CreateReadSession response:%n%s", table, session.toString()), 5, session.getStreamsCount());
    List<Callable<Long>> tasks = new ArrayList<>(session.getStreamsCount());
    for (final ReadStream stream : session.getStreamsList()) {
        tasks.add(new Callable<Long>() {

            @Override
            public Long call() throws Exception {
                return readAllRowsFromStream(stream);
            }
        });
    }
    ExecutorService executor = Executors.newFixedThreadPool(tasks.size());
    List<Future<Long>> results = executor.invokeAll(tasks);
    long rowCount = 0;
    for (Future<Long> result : results) {
        rowCount += result.get();
    }
    assertEquals(313_797_035, rowCount);
}
Also used : ReadSession(com.google.cloud.bigquery.storage.v1beta2.ReadSession) ArrayList(java.util.ArrayList) Callable(java.util.concurrent.Callable) IOException(java.io.IOException) ExecutionException(java.util.concurrent.ExecutionException) ReadStream(com.google.cloud.bigquery.storage.v1beta2.ReadStream) ExecutorService(java.util.concurrent.ExecutorService) Future(java.util.concurrent.Future) Test(org.junit.Test)

Example 30 with ReadSession

use of com.google.cloud.bigquery.storage.v1beta2.ReadSession in project java-bigquerystorage by googleapis.

the class ITBigQueryStorageTest method testSimpleReadAndResume.

@Test
public void testSimpleReadAndResume() {
    String table = BigQueryResource.FormatTableResource(/* projectId = */
    "bigquery-public-data", /* datasetId = */
    "samples", /* tableId = */
    "shakespeare");
    ReadSession session = client.createReadSession(/* parent = */
    parentProjectId, /* readSession = */
    ReadSession.newBuilder().setTable(table).setDataFormat(DataFormat.AVRO).build(), /* maxStreamCount = */
    1);
    assertEquals(String.format("Did not receive expected number of streams for table '%s' CreateReadSession response:%n%s", table, session.toString()), 1, session.getStreamsCount());
    // We have to read some number of rows in order to be able to resume. More details:
    long rowCount = ReadStreamToOffset(session.getStreams(0), /* rowOffset = */
    34_846);
    ReadRowsRequest readRowsRequest = ReadRowsRequest.newBuilder().setReadStream(session.getStreams(0).getName()).setOffset(rowCount).build();
    ServerStream<ReadRowsResponse> stream = client.readRowsCallable().call(readRowsRequest);
    for (ReadRowsResponse response : stream) {
        rowCount += response.getRowCount();
    }
    // Verifies that the number of rows skipped and read equals to the total number of rows in the
    // table.
    assertEquals(164_656, rowCount);
}
Also used : ReadRowsResponse(com.google.cloud.bigquery.storage.v1beta2.ReadRowsResponse) ReadSession(com.google.cloud.bigquery.storage.v1beta2.ReadSession) ReadRowsRequest(com.google.cloud.bigquery.storage.v1beta2.ReadRowsRequest) Test(org.junit.Test)

Aggregations

ReadSession (com.google.cloud.bigquery.storage.v1.ReadSession)29 Test (org.junit.Test)23 ReadRowsRequest (com.google.cloud.bigquery.storage.v1.ReadRowsRequest)17 ReadRowsResponse (com.google.cloud.bigquery.storage.v1.ReadRowsResponse)17 CreateReadSessionRequest (com.google.cloud.bigquery.storage.v1.CreateReadSessionRequest)15 StorageClient (org.apache.beam.sdk.io.gcp.bigquery.BigQueryServices.StorageClient)14 FakeBigQueryServices (org.apache.beam.sdk.io.gcp.testing.FakeBigQueryServices)13 TableRow (com.google.api.services.bigquery.model.TableRow)10 TableRowParser (org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.TableRowParser)9 Table (com.google.api.services.bigquery.model.Table)8 TableReference (com.google.api.services.bigquery.model.TableReference)7 ByteString (com.google.protobuf.ByteString)7 TableReadOptions (com.google.cloud.bigquery.storage.v1.ReadSession.TableReadOptions)6 ReadSession (com.google.cloud.bigquery.storage.v1beta2.ReadSession)6 GenericRecord (org.apache.avro.generic.GenericRecord)6 TableInfo (com.google.cloud.bigquery.TableInfo)5 ReadRowsRequest (com.google.cloud.bigquery.storage.v1beta2.ReadRowsRequest)5 ReadRowsResponse (com.google.cloud.bigquery.storage.v1beta2.ReadRowsResponse)5 ArrayList (java.util.ArrayList)5 TableId (com.google.cloud.bigquery.TableId)4