Search in sources :

Example 1 with ColumnarBatch

use of org.apache.spark.sql.vectorized.ColumnarBatch in project iceberg by apache.

the class ColumnarBatchReader method read.

@Override
public final ColumnarBatch read(ColumnarBatch reuse, int numRowsToRead) {
    Preconditions.checkArgument(numRowsToRead > 0, "Invalid number of rows to read: %s", numRowsToRead);
    ColumnVector[] arrowColumnVectors = new ColumnVector[readers.length];
    if (reuse == null) {
        closeVectors();
    }
    for (int i = 0; i < readers.length; i += 1) {
        vectorHolders[i] = readers[i].read(vectorHolders[i], numRowsToRead);
        int numRowsInVector = vectorHolders[i].numValues();
        Preconditions.checkState(numRowsInVector == numRowsToRead, "Number of rows in the vector %s didn't match expected %s ", numRowsInVector, numRowsToRead);
        arrowColumnVectors[i] = IcebergArrowColumnVector.forHolder(vectorHolders[i], numRowsInVector);
    }
    ColumnarBatch batch = new ColumnarBatch(arrowColumnVectors);
    batch.setNumRows(numRowsToRead);
    return batch;
}
Also used : ColumnarBatch(org.apache.spark.sql.vectorized.ColumnarBatch) ColumnVector(org.apache.spark.sql.vectorized.ColumnVector)

Example 2 with ColumnarBatch

use of org.apache.spark.sql.vectorized.ColumnarBatch in project iceberg by apache.

the class TestSparkOrcReader method writeAndValidateRecords.

private void writeAndValidateRecords(Schema schema, Iterable<InternalRow> expected) throws IOException {
    final File testFile = temp.newFile();
    Assert.assertTrue("Delete should succeed", testFile.delete());
    try (FileAppender<InternalRow> writer = ORC.write(Files.localOutput(testFile)).createWriterFunc(SparkOrcWriter::new).schema(schema).build()) {
        writer.addAll(expected);
    }
    try (CloseableIterable<InternalRow> reader = ORC.read(Files.localInput(testFile)).project(schema).createReaderFunc(readOrcSchema -> new SparkOrcReader(schema, readOrcSchema)).build()) {
        final Iterator<InternalRow> actualRows = reader.iterator();
        final Iterator<InternalRow> expectedRows = expected.iterator();
        while (expectedRows.hasNext()) {
            Assert.assertTrue("Should have expected number of rows", actualRows.hasNext());
            assertEquals(schema, expectedRows.next(), actualRows.next());
        }
        Assert.assertFalse("Should not have extra rows", actualRows.hasNext());
    }
    try (CloseableIterable<ColumnarBatch> reader = ORC.read(Files.localInput(testFile)).project(schema).createBatchedReaderFunc(readOrcSchema -> VectorizedSparkOrcReaders.buildReader(schema, readOrcSchema, ImmutableMap.of())).build()) {
        final Iterator<InternalRow> actualRows = batchesToRows(reader.iterator());
        final Iterator<InternalRow> expectedRows = expected.iterator();
        while (expectedRows.hasNext()) {
            Assert.assertTrue("Should have expected number of rows", actualRows.hasNext());
            assertEquals(schema, expectedRows.next(), actualRows.next());
        }
        Assert.assertFalse("Should not have extra rows", actualRows.hasNext());
    }
}
Also used : InternalRow(org.apache.spark.sql.catalyst.InternalRow) Types(org.apache.iceberg.types.Types) Iterator(java.util.Iterator) CloseableIterable(org.apache.iceberg.io.CloseableIterable) ImmutableMap(org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap) IOException(java.io.IOException) Test(org.junit.Test) Schema(org.apache.iceberg.Schema) ORC(org.apache.iceberg.orc.ORC) File(java.io.File) VectorizedSparkOrcReaders(org.apache.iceberg.spark.data.vectorized.VectorizedSparkOrcReaders) List(java.util.List) ColumnarBatch(org.apache.spark.sql.vectorized.ColumnarBatch) NestedField.required(org.apache.iceberg.types.Types.NestedField.required) TestHelpers.assertEquals(org.apache.iceberg.spark.data.TestHelpers.assertEquals) Iterators(org.apache.iceberg.relocated.com.google.common.collect.Iterators) Assert(org.junit.Assert) Collections(java.util.Collections) Files(org.apache.iceberg.Files) FileAppender(org.apache.iceberg.io.FileAppender) ColumnarBatch(org.apache.spark.sql.vectorized.ColumnarBatch) File(java.io.File) InternalRow(org.apache.spark.sql.catalyst.InternalRow)

Example 3 with ColumnarBatch

use of org.apache.spark.sql.vectorized.ColumnarBatch in project tispark by pingcap.

the class TiColumnarBatchHelper method createColumnarBatch.

public static ColumnarBatch createColumnarBatch(TiChunk chunk) {
    int colLen = chunk.numOfCols();
    TiColumnVectorAdapter[] columns = new TiColumnVectorAdapter[colLen];
    for (int i = 0; i < colLen; i++) {
        columns[i] = new TiColumnVectorAdapter(chunk.column(i));
    }
    ColumnarBatch batch = new ColumnarBatch(columns);
    batch.setNumRows(chunk.numOfRows());
    return batch;
}
Also used : ColumnarBatch(org.apache.spark.sql.vectorized.ColumnarBatch)

Example 4 with ColumnarBatch

use of org.apache.spark.sql.vectorized.ColumnarBatch in project spark-bigquery-connector by GoogleCloudDataproc.

the class ArrowReaderIterator method toArrowRows.

private Iterator<InternalRow> toArrowRows(VectorSchemaRoot root, List<String> namesInOrder) {
    ColumnVector[] columns = namesInOrder.stream().map(name -> root.getVector(name)).map(vector -> new ArrowSchemaConverter(vector, userProvidedFieldMap.get(vector.getName()))).collect(Collectors.toList()).toArray(new ColumnVector[0]);
    ColumnarBatch batch = new ColumnarBatch(columns);
    batch.setNumRows(root.getRowCount());
    return batch.rowIterator();
}
Also used : Arrays(java.util.Arrays) InternalRow(org.apache.spark.sql.catalyst.InternalRow) LoggerFactory(org.slf4j.LoggerFactory) Function(java.util.function.Function) ImmutableList(com.google.common.collect.ImmutableList) ByteArrayInputStream(java.io.ByteArrayInputStream) Map(java.util.Map) ArrowStreamReader(org.apache.arrow.vector.ipc.ArrowStreamReader) BufferAllocator(org.apache.arrow.memory.BufferAllocator) StructField(org.apache.spark.sql.types.StructField) StructType(org.apache.spark.sql.types.StructType) ArrowReader(org.apache.arrow.vector.ipc.ArrowReader) ColumnVector(org.apache.spark.sql.vectorized.ColumnVector) Logger(org.slf4j.Logger) Iterator(java.util.Iterator) SequenceInputStream(java.io.SequenceInputStream) CommonsCompressionFactory(org.apache.arrow.compression.CommonsCompressionFactory) VectorSchemaRoot(org.apache.arrow.vector.VectorSchemaRoot) IOException(java.io.IOException) ArrowUtil(com.google.cloud.bigquery.connector.common.ArrowUtil) Collectors(java.util.stream.Collectors) ByteString(com.google.protobuf.ByteString) UncheckedIOException(java.io.UncheckedIOException) List(java.util.List) ColumnarBatch(org.apache.spark.sql.vectorized.ColumnarBatch) Optional(java.util.Optional) ColumnarBatch(org.apache.spark.sql.vectorized.ColumnarBatch) ColumnVector(org.apache.spark.sql.vectorized.ColumnVector)

Example 5 with ColumnarBatch

use of org.apache.spark.sql.vectorized.ColumnarBatch in project spark-bigquery-connector by GoogleCloudDataproc.

the class ArrowColumnBatchPartitionReaderContext method next.

public boolean next() throws IOException {
    tracer.nextBatchNeeded();
    if (closed) {
        return false;
    }
    tracer.rowsParseStarted();
    closed = !reader.loadNextBatch();
    if (closed) {
        return false;
    }
    VectorSchemaRoot root = reader.root();
    if (currentBatch == null) {
        // trying to verify from dev@spark but this object
        // should only need to get created once.  The underlying
        // vectors should stay the same.
        ColumnVector[] columns = namesInOrder.stream().map(root::getVector).map(vector -> new ArrowSchemaConverter(vector, userProvidedFieldMap.get(vector.getName()))).toArray(ColumnVector[]::new);
        currentBatch = new ColumnarBatch(columns);
    }
    currentBatch.setNumRows(root.getRowCount());
    tracer.rowsParseFinished(currentBatch.numRows());
    return true;
}
Also used : VectorLoader(org.apache.arrow.vector.VectorLoader) MoreExecutors(com.google.common.util.concurrent.MoreExecutors) Arrays(java.util.Arrays) Schema(org.apache.arrow.vector.types.pojo.Schema) ThreadPoolExecutor(java.util.concurrent.ThreadPoolExecutor) ReadRowsResponse(com.google.cloud.bigquery.storage.v1.ReadRowsResponse) ArrowSchemaConverter(com.google.cloud.spark.bigquery.ArrowSchemaConverter) ArrayList(java.util.ArrayList) IteratorMultiplexer(com.google.cloud.bigquery.connector.common.IteratorMultiplexer) ParallelArrowReader(com.google.cloud.bigquery.connector.common.ParallelArrowReader) ImmutableList(com.google.common.collect.ImmutableList) Map(java.util.Map) AutoCloseables(org.apache.arrow.util.AutoCloseables) ArrowStreamReader(org.apache.arrow.vector.ipc.ArrowStreamReader) ExecutorService(java.util.concurrent.ExecutorService) BufferAllocator(org.apache.arrow.memory.BufferAllocator) StructField(org.apache.spark.sql.types.StructField) StructType(org.apache.spark.sql.types.StructType) NonInterruptibleBlockingBytesChannel(com.google.cloud.bigquery.connector.common.NonInterruptibleBlockingBytesChannel) ArrowReader(org.apache.arrow.vector.ipc.ArrowReader) ColumnVector(org.apache.spark.sql.vectorized.ColumnVector) Iterator(java.util.Iterator) ReadRowsResponseInputStreamEnumeration(com.google.cloud.bigquery.connector.common.ReadRowsResponseInputStreamEnumeration) SynchronousQueue(java.util.concurrent.SynchronousQueue) SequenceInputStream(java.io.SequenceInputStream) CommonsCompressionFactory(org.apache.arrow.compression.CommonsCompressionFactory) VectorSchemaRoot(org.apache.arrow.vector.VectorSchemaRoot) IOException(java.io.IOException) ArrowUtil(com.google.cloud.bigquery.connector.common.ArrowUtil) Collectors(java.util.stream.Collectors) ByteString(com.google.protobuf.ByteString) TimeUnit(java.util.concurrent.TimeUnit) BigQueryStorageReadRowsTracer(com.google.cloud.bigquery.connector.common.BigQueryStorageReadRowsTracer) List(java.util.List) ColumnarBatch(org.apache.spark.sql.vectorized.ColumnarBatch) Optional(java.util.Optional) ReadRowsHelper(com.google.cloud.bigquery.connector.common.ReadRowsHelper) InputStream(java.io.InputStream) VectorSchemaRoot(org.apache.arrow.vector.VectorSchemaRoot) ColumnarBatch(org.apache.spark.sql.vectorized.ColumnarBatch) ColumnVector(org.apache.spark.sql.vectorized.ColumnVector) ArrowSchemaConverter(com.google.cloud.spark.bigquery.ArrowSchemaConverter)

Aggregations

ColumnarBatch (org.apache.spark.sql.vectorized.ColumnarBatch)15 ColumnVector (org.apache.spark.sql.vectorized.ColumnVector)6 List (java.util.List)5 ImmutableList (com.google.common.collect.ImmutableList)4 Iterator (java.util.Iterator)4 Optional (java.util.Optional)4 Collectors (java.util.stream.Collectors)4 StructType (org.apache.spark.sql.types.StructType)4 ByteString (com.google.protobuf.ByteString)3 IOException (java.io.IOException)3 ArrayList (java.util.ArrayList)3 StructField (org.apache.spark.sql.types.StructField)3 ArrowUtil (com.google.cloud.bigquery.connector.common.ArrowUtil)2 BigQueryClientFactory (com.google.cloud.bigquery.connector.common.BigQueryClientFactory)2 BigQueryStorageReadRowsTracer (com.google.cloud.bigquery.connector.common.BigQueryStorageReadRowsTracer)2 BigQueryTracerFactory (com.google.cloud.bigquery.connector.common.BigQueryTracerFactory)2 ReadRowsHelper (com.google.cloud.bigquery.connector.common.ReadRowsHelper)2 ReadSessionResponse (com.google.cloud.bigquery.connector.common.ReadSessionResponse)2 ReadRowsResponse (com.google.cloud.bigquery.storage.v1.ReadRowsResponse)2 Datatype (io.tiledb.java.api.Datatype)2