Search in sources :

Example 1 with InternalRow

use of org.apache.spark.sql.catalyst.InternalRow in project RemoteShuffleService by alibaba.

the class RssShuffleWriterSuiteJ method getUnsafeRowIterator.

private Iterator<Product2<Integer, UnsafeRow>> getUnsafeRowIterator(final int size, final AtomicInteger total, final boolean mix) {
    int current = 0;
    ListBuffer<Product2<Integer, UnsafeRow>> list = new ListBuffer<>();
    while (current < size) {
        int key = total.getAndIncrement();
        String value = key + ": " + (mix && rand.nextBoolean() ? GIANT_RECORD : NORMAL_RECORD);
        current += value.length();
        ListBuffer<Object> values = new ListBuffer<>();
        values.$plus$eq(key);
        values.$plus$eq(UTF8String.fromString(value));
        InternalRow row = InternalRow.apply(values.toSeq());
        DataType[] types = new DataType[2];
        types[0] = IntegerType$.MODULE$;
        types[1] = StringType$.MODULE$;
        UnsafeRow unsafeRow = UnsafeProjection.create(types).apply(row);
        list.$plus$eq(new Tuple2<>(key % numPartitions, unsafeRow));
    }
    return list.toIterator();
}
Also used : Product2(scala.Product2) ListBuffer(scala.collection.mutable.ListBuffer) DataType(org.apache.spark.sql.types.DataType) UTF8String(org.apache.spark.unsafe.types.UTF8String) UnsafeRow(org.apache.spark.sql.catalyst.expressions.UnsafeRow) InternalRow(org.apache.spark.sql.catalyst.InternalRow)

Example 2 with InternalRow

use of org.apache.spark.sql.catalyst.InternalRow in project iceberg by apache.

the class TestSparkParquetReadMetadataColumns method readAndValidate.

private void readAndValidate(Expression filter, Long splitStart, Long splitLength, List<InternalRow> expected) throws IOException {
    Parquet.ReadBuilder builder = Parquet.read(Files.localInput(testFile)).project(PROJECTION_SCHEMA);
    if (vectorized) {
        builder.createBatchedReaderFunc(fileSchema -> VectorizedSparkParquetReaders.buildReader(PROJECTION_SCHEMA, fileSchema, NullCheckingForGet.NULL_CHECKING_ENABLED));
        builder.recordsPerBatch(RECORDS_PER_BATCH);
    } else {
        builder = builder.createReaderFunc(msgType -> SparkParquetReaders.buildReader(PROJECTION_SCHEMA, msgType));
    }
    if (filter != null) {
        builder = builder.filter(filter);
    }
    if (splitStart != null && splitLength != null) {
        builder = builder.split(splitStart, splitLength);
    }
    try (CloseableIterable<InternalRow> reader = vectorized ? batchesToRows(builder.build()) : builder.build()) {
        final Iterator<InternalRow> actualRows = reader.iterator();
        for (InternalRow internalRow : expected) {
            Assert.assertTrue("Should have expected number of rows", actualRows.hasNext());
            TestHelpers.assertEquals(PROJECTION_SCHEMA, internalRow, actualRows.next());
        }
        Assert.assertFalse("Should not have extra rows", actualRows.hasNext());
    }
}
Also used : Parquet(org.apache.iceberg.parquet.Parquet) InternalRow(org.apache.spark.sql.catalyst.InternalRow) Types(org.apache.iceberg.types.Types) RunWith(org.junit.runner.RunWith) VectorizedSparkParquetReaders(org.apache.iceberg.spark.data.vectorized.VectorizedSparkParquetReaders) GenericInternalRow(org.apache.spark.sql.catalyst.expressions.GenericInternalRow) Lists(org.apache.iceberg.relocated.com.google.common.collect.Lists) Expression(org.apache.iceberg.expressions.Expression) Configuration(org.apache.hadoop.conf.Configuration) UTF8String(org.apache.spark.unsafe.types.UTF8String) Path(org.apache.hadoop.fs.Path) Parameterized(org.junit.runners.Parameterized) Files(org.apache.iceberg.Files) FileAppender(org.apache.iceberg.io.FileAppender) Before(org.junit.Before) StructType(org.apache.spark.sql.types.StructType) Iterator(java.util.Iterator) NullCheckingForGet(org.apache.arrow.vector.NullCheckingForGet) CloseableIterable(org.apache.iceberg.io.CloseableIterable) ParquetFileWriter(org.apache.parquet.hadoop.ParquetFileWriter) IOException(java.io.IOException) Parquet(org.apache.iceberg.parquet.Parquet) Iterables(org.apache.iceberg.relocated.com.google.common.collect.Iterables) Test(org.junit.Test) Schema(org.apache.iceberg.Schema) SparkSchemaUtil(org.apache.iceberg.spark.SparkSchemaUtil) File(java.io.File) MetadataColumns(org.apache.iceberg.MetadataColumns) ParquetFileReader(org.apache.parquet.hadoop.ParquetFileReader) List(java.util.List) ParquetReadOptions(org.apache.parquet.ParquetReadOptions) ColumnarBatch(org.apache.spark.sql.vectorized.ColumnarBatch) Rule(org.junit.Rule) NestedField.required(org.apache.iceberg.types.Types.NestedField.required) BlockMetaData(org.apache.parquet.hadoop.metadata.BlockMetaData) ParquetSchemaUtil(org.apache.iceberg.parquet.ParquetSchemaUtil) Expressions(org.apache.iceberg.expressions.Expressions) Assert(org.junit.Assert) HadoopInputFile(org.apache.parquet.hadoop.util.HadoopInputFile) TemporaryFolder(org.junit.rules.TemporaryFolder) InternalRow(org.apache.spark.sql.catalyst.InternalRow) GenericInternalRow(org.apache.spark.sql.catalyst.expressions.GenericInternalRow)

Example 3 with InternalRow

use of org.apache.spark.sql.catalyst.InternalRow in project iceberg by apache.

the class IcebergSourceDeleteBenchmark method writePosDeletes.

protected void writePosDeletes(CharSequence path, List<Long> deletedPos, int numNoise) throws IOException {
    OutputFileFactory fileFactory = newFileFactory();
    SparkFileWriterFactory writerFactory = SparkFileWriterFactory.builderFor(table()).dataFileFormat(fileFormat()).build();
    ClusteredPositionDeleteWriter<InternalRow> writer = new ClusteredPositionDeleteWriter<>(writerFactory, fileFactory, table().io(), fileFormat(), TARGET_FILE_SIZE_IN_BYTES);
    PartitionSpec unpartitionedSpec = table().specs().get(0);
    PositionDelete<InternalRow> positionDelete = PositionDelete.create();
    try (ClusteredPositionDeleteWriter<InternalRow> closeableWriter = writer) {
        for (Long pos : deletedPos) {
            positionDelete.set(path, pos, null);
            closeableWriter.write(positionDelete, unpartitionedSpec, null);
            for (int i = 0; i < numNoise; i++) {
                positionDelete.set(noisePath(path), pos, null);
                closeableWriter.write(positionDelete, unpartitionedSpec, null);
            }
        }
    }
    RowDelta rowDelta = table().newRowDelta();
    writer.result().deleteFiles().forEach(rowDelta::addDeletes);
    rowDelta.validateDeletedFiles().commit();
}
Also used : OutputFileFactory(org.apache.iceberg.io.OutputFileFactory) ClusteredPositionDeleteWriter(org.apache.iceberg.io.ClusteredPositionDeleteWriter) RowDelta(org.apache.iceberg.RowDelta) PartitionSpec(org.apache.iceberg.PartitionSpec) InternalRow(org.apache.spark.sql.catalyst.InternalRow) GenericInternalRow(org.apache.spark.sql.catalyst.expressions.GenericInternalRow)

Example 4 with InternalRow

use of org.apache.spark.sql.catalyst.InternalRow in project iceberg by apache.

the class WritersBenchmark method writePartitionedClusteredDataWriter.

@Benchmark
@Threads(1)
public void writePartitionedClusteredDataWriter(Blackhole blackhole) throws IOException {
    FileIO io = table().io();
    OutputFileFactory fileFactory = newFileFactory();
    SparkFileWriterFactory writerFactory = SparkFileWriterFactory.builderFor(table()).dataFileFormat(fileFormat()).dataSchema(table().schema()).build();
    ClusteredDataWriter<InternalRow> writer = new ClusteredDataWriter<>(writerFactory, fileFactory, io, fileFormat(), TARGET_FILE_SIZE_IN_BYTES);
    PartitionKey partitionKey = new PartitionKey(partitionedSpec, table().schema());
    StructType dataSparkType = SparkSchemaUtil.convert(table().schema());
    InternalRowWrapper internalRowWrapper = new InternalRowWrapper(dataSparkType);
    try (ClusteredDataWriter<InternalRow> closeableWriter = writer) {
        for (InternalRow row : rows) {
            partitionKey.partition(internalRowWrapper.wrap(row));
            closeableWriter.write(row, partitionedSpec, partitionKey);
        }
    }
    blackhole.consume(writer);
}
Also used : OutputFileFactory(org.apache.iceberg.io.OutputFileFactory) StructType(org.apache.spark.sql.types.StructType) ClusteredDataWriter(org.apache.iceberg.io.ClusteredDataWriter) PartitionKey(org.apache.iceberg.PartitionKey) InternalRow(org.apache.spark.sql.catalyst.InternalRow) FileIO(org.apache.iceberg.io.FileIO) Threads(org.openjdk.jmh.annotations.Threads) Benchmark(org.openjdk.jmh.annotations.Benchmark)

Example 5 with InternalRow

use of org.apache.spark.sql.catalyst.InternalRow in project iceberg by apache.

the class WritersBenchmark method writeUnpartitionedLegacyDataWriter.

@Benchmark
@Threads(1)
public void writeUnpartitionedLegacyDataWriter(Blackhole blackhole) throws IOException {
    FileIO io = table().io();
    OutputFileFactory fileFactory = newFileFactory();
    Schema writeSchema = table().schema();
    StructType sparkWriteType = SparkSchemaUtil.convert(writeSchema);
    SparkAppenderFactory appenders = SparkAppenderFactory.builderFor(table(), writeSchema, sparkWriteType).spec(unpartitionedSpec).build();
    TaskWriter<InternalRow> writer = new UnpartitionedWriter<>(unpartitionedSpec, fileFormat(), appenders, fileFactory, io, TARGET_FILE_SIZE_IN_BYTES);
    try (TaskWriter<InternalRow> closableWriter = writer) {
        for (InternalRow row : rows) {
            closableWriter.write(row);
        }
    }
    blackhole.consume(writer.complete());
}
Also used : OutputFileFactory(org.apache.iceberg.io.OutputFileFactory) StructType(org.apache.spark.sql.types.StructType) Schema(org.apache.iceberg.Schema) UnpartitionedWriter(org.apache.iceberg.io.UnpartitionedWriter) InternalRow(org.apache.spark.sql.catalyst.InternalRow) FileIO(org.apache.iceberg.io.FileIO) Threads(org.openjdk.jmh.annotations.Threads) Benchmark(org.openjdk.jmh.annotations.Benchmark)

Aggregations

InternalRow (org.apache.spark.sql.catalyst.InternalRow)110 GenericInternalRow (org.apache.spark.sql.catalyst.expressions.GenericInternalRow)33 Row (org.apache.spark.sql.Row)30 StructType (org.apache.spark.sql.types.StructType)29 Test (org.junit.Test)28 Schema (org.apache.iceberg.Schema)17 ArrayList (java.util.ArrayList)16 List (java.util.List)16 Test (org.junit.jupiter.api.Test)14 File (java.io.File)13 ParameterizedTest (org.junit.jupiter.params.ParameterizedTest)13 IOException (java.io.IOException)12 HoodieWriteConfig (org.apache.hudi.config.HoodieWriteConfig)12 Types (org.apache.iceberg.types.Types)12 OutputFileFactory (org.apache.iceberg.io.OutputFileFactory)11 GenericRecord (org.apache.avro.generic.GenericRecord)10 HoodieKey (org.apache.hudi.common.model.HoodieKey)10 FileAppender (org.apache.iceberg.io.FileAppender)10 Map (java.util.Map)9 Assert (org.junit.Assert)9