Search in sources :

Example 6 with Row$

use of org.apache.spark.sql.Row$ in project iceberg by apache.

the class SchemaEvolutionTest method before.

@Before
public void before() throws IOException {
    tableLocation = Files.createTempDirectory("temp").toFile();
    Schema schema = new Schema(optional(1, "title", Types.StringType.get()), optional(2, "price", Types.IntegerType.get()), optional(3, "author", Types.StringType.get()), optional(4, "published", Types.TimestampType.withZone()), optional(5, "genre", Types.StringType.get()));
    PartitionSpec spec = PartitionSpec.builderFor(schema).year("published").build();
    HadoopTables tables = new HadoopTables(spark.sessionState().newHadoopConf());
    table = tables.create(schema, spec, tableLocation.toString());
    Dataset<Row> df = spark.read().json(dataLocation + "/books.json");
    df.select(df.col("title"), df.col("price").cast(DataTypes.IntegerType), df.col("author"), df.col("published").cast(DataTypes.TimestampType), df.col("genre")).write().format("iceberg").mode("append").save(tableLocation.toString());
    table.refresh();
}
Also used : Schema(org.apache.iceberg.Schema) HadoopTables(org.apache.iceberg.hadoop.HadoopTables) Row(org.apache.spark.sql.Row) PartitionSpec(org.apache.iceberg.PartitionSpec) Before(org.junit.Before)

Example 7 with Row$

use of org.apache.spark.sql.Row$ in project iceberg by apache.

the class TestSparkSchema method testFailIfSparkReadSchemaIsOff.

@Test
public void testFailIfSparkReadSchemaIsOff() throws IOException {
    String tableLocation = temp.newFolder("iceberg-table").toString();
    HadoopTables tables = new HadoopTables(CONF);
    PartitionSpec spec = PartitionSpec.unpartitioned();
    tables.create(SCHEMA, spec, null, tableLocation);
    List<SimpleRecord> expectedRecords = Lists.newArrayList(new SimpleRecord(1, "a"));
    Dataset<Row> originalDf = spark.createDataFrame(expectedRecords, SimpleRecord.class);
    originalDf.select("id", "data").write().format("iceberg").mode("append").save(tableLocation);
    StructType sparkReadSchema = new StructType(new StructField[] { // wrong field name
    new StructField("idd", DataTypes.IntegerType, true, Metadata.empty()) });
    AssertHelpers.assertThrows("Iceberg should not allow a projection that contain unknown fields", java.lang.IllegalArgumentException.class, "Field idd not found in source schema", () -> spark.read().schema(sparkReadSchema).format("iceberg").load(tableLocation));
}
Also used : StructField(org.apache.spark.sql.types.StructField) StructType(org.apache.spark.sql.types.StructType) HadoopTables(org.apache.iceberg.hadoop.HadoopTables) Row(org.apache.spark.sql.Row) PartitionSpec(org.apache.iceberg.PartitionSpec) Test(org.junit.Test)

Example 8 with Row$

use of org.apache.spark.sql.Row$ in project iceberg by apache.

the class TestSparkSchema method testSparkReadSchemaIsHonored.

@Test
public void testSparkReadSchemaIsHonored() throws IOException {
    String tableLocation = temp.newFolder("iceberg-table").toString();
    HadoopTables tables = new HadoopTables(CONF);
    PartitionSpec spec = PartitionSpec.unpartitioned();
    tables.create(SCHEMA, spec, null, tableLocation);
    List<SimpleRecord> expectedRecords = Lists.newArrayList(new SimpleRecord(1, "a"));
    Dataset<Row> originalDf = spark.createDataFrame(expectedRecords, SimpleRecord.class);
    originalDf.select("id", "data").write().format("iceberg").mode("append").save(tableLocation);
    StructType sparkReadSchema = new StructType(new StructField[] { new StructField("id", DataTypes.IntegerType, true, Metadata.empty()) });
    Dataset<Row> resultDf = spark.read().schema(sparkReadSchema).format("iceberg").load(tableLocation);
    Row[] results = (Row[]) resultDf.collect();
    Assert.assertEquals("Result size matches", 1, results.length);
    Assert.assertEquals("Row length matches with sparkReadSchema", 1, results[0].length());
    Assert.assertEquals("Row content matches data", 1, results[0].getInt(0));
}
Also used : StructField(org.apache.spark.sql.types.StructField) StructType(org.apache.spark.sql.types.StructType) HadoopTables(org.apache.iceberg.hadoop.HadoopTables) Row(org.apache.spark.sql.Row) PartitionSpec(org.apache.iceberg.PartitionSpec) Test(org.junit.Test)

Example 9 with Row$

use of org.apache.spark.sql.Row$ in project iceberg by apache.

the class TestSparkSchema method testSparkReadSchemaCombinedWithProjection.

@Test
public void testSparkReadSchemaCombinedWithProjection() throws IOException {
    String tableLocation = temp.newFolder("iceberg-table").toString();
    HadoopTables tables = new HadoopTables(CONF);
    PartitionSpec spec = PartitionSpec.unpartitioned();
    tables.create(SCHEMA, spec, null, tableLocation);
    List<SimpleRecord> expectedRecords = Lists.newArrayList(new SimpleRecord(1, "a"));
    Dataset<Row> originalDf = spark.createDataFrame(expectedRecords, SimpleRecord.class);
    originalDf.select("id", "data").write().format("iceberg").mode("append").save(tableLocation);
    StructType sparkReadSchema = new StructType(new StructField[] { new StructField("id", DataTypes.IntegerType, true, Metadata.empty()), new StructField("data", DataTypes.StringType, true, Metadata.empty()) });
    Dataset<Row> resultDf = spark.read().schema(sparkReadSchema).format("iceberg").load(tableLocation).select("id");
    Row[] results = (Row[]) resultDf.collect();
    Assert.assertEquals("Result size matches", 1, results.length);
    Assert.assertEquals("Row length matches with sparkReadSchema", 1, results[0].length());
    Assert.assertEquals("Row content matches data", 1, results[0].getInt(0));
}
Also used : StructField(org.apache.spark.sql.types.StructField) StructType(org.apache.spark.sql.types.StructType) HadoopTables(org.apache.iceberg.hadoop.HadoopTables) Row(org.apache.spark.sql.Row) PartitionSpec(org.apache.iceberg.PartitionSpec) Test(org.junit.Test)

Example 10 with Row$

use of org.apache.spark.sql.Row$ in project iceberg by apache.

the class TestSparkSchema method testFailSparkReadSchemaCombinedWithProjectionWhenSchemaDoesNotContainProjection.

@Test
public void testFailSparkReadSchemaCombinedWithProjectionWhenSchemaDoesNotContainProjection() throws IOException {
    String tableLocation = temp.newFolder("iceberg-table").toString();
    HadoopTables tables = new HadoopTables(CONF);
    PartitionSpec spec = PartitionSpec.unpartitioned();
    tables.create(SCHEMA, spec, null, tableLocation);
    List<SimpleRecord> expectedRecords = Lists.newArrayList(new SimpleRecord(1, "a"));
    Dataset<Row> originalDf = spark.createDataFrame(expectedRecords, SimpleRecord.class);
    originalDf.select("id", "data").write().format("iceberg").mode("append").save(tableLocation);
    StructType sparkReadSchema = new StructType(new StructField[] { new StructField("data", DataTypes.StringType, true, Metadata.empty()) });
    AssertHelpers.assertThrows("Spark should not allow a projection that is not included in the read schema", org.apache.spark.sql.AnalysisException.class, "cannot resolve '`id`' given input columns: [data]", () -> spark.read().schema(sparkReadSchema).format("iceberg").load(tableLocation).select("id"));
}
Also used : StructField(org.apache.spark.sql.types.StructField) StructType(org.apache.spark.sql.types.StructType) HadoopTables(org.apache.iceberg.hadoop.HadoopTables) Row(org.apache.spark.sql.Row) PartitionSpec(org.apache.iceberg.PartitionSpec) Test(org.junit.Test)

Aggregations

Row (org.apache.spark.sql.Row)1045 Test (org.junit.Test)344 ArrayList (java.util.ArrayList)244 SparkSession (org.apache.spark.sql.SparkSession)243 StructType (org.apache.spark.sql.types.StructType)215 Test (org.junit.jupiter.api.Test)157 StructField (org.apache.spark.sql.types.StructField)138 Table (org.apache.iceberg.Table)127 Dataset (org.apache.spark.sql.Dataset)123 List (java.util.List)115 Script (org.apache.sysml.api.mlcontext.Script)104 JavaSparkContext (org.apache.spark.api.java.JavaSparkContext)101 IOException (java.io.IOException)78 Column (org.apache.spark.sql.Column)78 File (java.io.File)76 Collectors (java.util.stream.Collectors)73 PartitionSpec (org.apache.iceberg.PartitionSpec)70 DatasetBuilder (au.csiro.pathling.test.builders.DatasetBuilder)66 Map (java.util.Map)66 HadoopTables (org.apache.iceberg.hadoop.HadoopTables)61