Search in sources :

Example 1 with Transform

use of org.apache.spark.sql.connector.expressions.Transform in project iceberg by apache.

the class Spark3Util method toPartitionSpec.

/**
 * Converts Spark transforms into a {@link PartitionSpec}.
 *
 * @param schema the table schema
 * @param partitioning Spark Transforms
 * @return a PartitionSpec
 */
public static PartitionSpec toPartitionSpec(Schema schema, Transform[] partitioning) {
    if (partitioning == null || partitioning.length == 0) {
        return PartitionSpec.unpartitioned();
    }
    PartitionSpec.Builder builder = PartitionSpec.builderFor(schema);
    for (Transform transform : partitioning) {
        Preconditions.checkArgument(transform.references().length == 1, "Cannot convert transform with more than one column reference: %s", transform);
        String colName = DOT.join(transform.references()[0].fieldNames());
        switch(transform.name()) {
            case "identity":
                builder.identity(colName);
                break;
            case "bucket":
                builder.bucket(colName, findWidth(transform));
                break;
            case "years":
                builder.year(colName);
                break;
            case "months":
                builder.month(colName);
                break;
            case "date":
            case "days":
                builder.day(colName);
                break;
            case "date_hour":
            case "hours":
                builder.hour(colName);
                break;
            case "truncate":
                builder.truncate(colName, findWidth(transform));
                break;
            default:
                throw new UnsupportedOperationException("Transform is not supported: " + transform);
        }
    }
    return builder.build();
}
Also used : Transform(org.apache.spark.sql.connector.expressions.Transform) PartitionSpec(org.apache.iceberg.PartitionSpec)

Example 2 with Transform

use of org.apache.spark.sql.connector.expressions.Transform in project iceberg by apache.

the class BaseTableCreationSparkAction method stageDestTable.

protected StagedSparkTable stageDestTable() {
    try {
        Map<String, String> props = destTableProps();
        StructType schema = sourceTable.schema();
        Transform[] partitioning = sourceTable.partitioning();
        return (StagedSparkTable) destCatalog().stageCreate(destTableIdent(), schema, partitioning, props);
    } catch (org.apache.spark.sql.catalyst.analysis.NoSuchNamespaceException e) {
        throw new NoSuchNamespaceException("Cannot create table %s as the namespace does not exist", destTableIdent());
    } catch (org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException e) {
        throw new AlreadyExistsException("Cannot create table %s as it already exists", destTableIdent());
    }
}
Also used : StructType(org.apache.spark.sql.types.StructType) AlreadyExistsException(org.apache.iceberg.exceptions.AlreadyExistsException) NoSuchNamespaceException(org.apache.iceberg.exceptions.NoSuchNamespaceException) StagedSparkTable(org.apache.iceberg.spark.source.StagedSparkTable) Transform(org.apache.spark.sql.connector.expressions.Transform)

Example 3 with Transform

use of org.apache.spark.sql.connector.expressions.Transform in project iceberg by apache.

the class TestRemoveOrphanFilesAction3 method testSparkCatalogNamedHiveTable.

@Test
public void testSparkCatalogNamedHiveTable() throws Exception {
    spark.conf().set("spark.sql.catalog.hive", "org.apache.iceberg.spark.SparkCatalog");
    spark.conf().set("spark.sql.catalog.hive.type", "hadoop");
    spark.conf().set("spark.sql.catalog.hive.warehouse", tableLocation);
    SparkCatalog cat = (SparkCatalog) spark.sessionState().catalogManager().catalog("hive");
    String[] database = { "default" };
    Identifier id = Identifier.of(database, "table");
    Map<String, String> options = Maps.newHashMap();
    Transform[] transforms = {};
    cat.createTable(id, SparkSchemaUtil.convert(SCHEMA), transforms, options);
    SparkTable table = cat.loadTable(id);
    spark.sql("INSERT INTO hive.default.table VALUES (1,1,1)");
    String location = table.table().location().replaceFirst("file:", "");
    new File(location + "/data/trashfile").createNewFile();
    DeleteOrphanFiles.Result results = SparkActions.get().deleteOrphanFiles(table.table()).olderThan(System.currentTimeMillis() + 1000).execute();
    Assert.assertTrue("trash file should be removed", StreamSupport.stream(results.orphanFileLocations().spliterator(), false).anyMatch(file -> file.contains("file:" + location + "/data/trashfile")));
}
Also used : SparkCatalog(org.apache.iceberg.spark.SparkCatalog) Maps(org.apache.iceberg.relocated.com.google.common.collect.Maps) Test(org.junit.Test) DeleteOrphanFiles(org.apache.iceberg.actions.DeleteOrphanFiles) SparkSchemaUtil(org.apache.iceberg.spark.SparkSchemaUtil) File(java.io.File) SparkSessionCatalog(org.apache.iceberg.spark.SparkSessionCatalog) Map(java.util.Map) After(org.junit.After) Transform(org.apache.spark.sql.connector.expressions.Transform) StreamSupport(java.util.stream.StreamSupport) Identifier(org.apache.spark.sql.connector.catalog.Identifier) Assert(org.junit.Assert) SparkTable(org.apache.iceberg.spark.source.SparkTable) Identifier(org.apache.spark.sql.connector.catalog.Identifier) SparkCatalog(org.apache.iceberg.spark.SparkCatalog) DeleteOrphanFiles(org.apache.iceberg.actions.DeleteOrphanFiles) Transform(org.apache.spark.sql.connector.expressions.Transform) SparkTable(org.apache.iceberg.spark.source.SparkTable) File(java.io.File) Test(org.junit.Test)

Example 4 with Transform

use of org.apache.spark.sql.connector.expressions.Transform in project iceberg by apache.

the class TestRemoveOrphanFilesAction3 method testSparkSessionCatalogHiveTable.

@Test
public void testSparkSessionCatalogHiveTable() throws Exception {
    spark.conf().set("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog");
    spark.conf().set("spark.sql.catalog.spark_catalog.type", "hive");
    SparkSessionCatalog cat = (SparkSessionCatalog) spark.sessionState().catalogManager().v2SessionCatalog();
    String[] database = { "default" };
    Identifier id = Identifier.of(database, "sessioncattest");
    Map<String, String> options = Maps.newHashMap();
    Transform[] transforms = {};
    cat.dropTable(id);
    cat.createTable(id, SparkSchemaUtil.convert(SCHEMA), transforms, options);
    SparkTable table = (SparkTable) cat.loadTable(id);
    spark.sql("INSERT INTO default.sessioncattest VALUES (1,1,1)");
    String location = table.table().location().replaceFirst("file:", "");
    new File(location + "/data/trashfile").createNewFile();
    DeleteOrphanFiles.Result results = SparkActions.get().deleteOrphanFiles(table.table()).olderThan(System.currentTimeMillis() + 1000).execute();
    Assert.assertTrue("trash file should be removed", StreamSupport.stream(results.orphanFileLocations().spliterator(), false).anyMatch(file -> file.contains("file:" + location + "/data/trashfile")));
}
Also used : SparkSessionCatalog(org.apache.iceberg.spark.SparkSessionCatalog) SparkCatalog(org.apache.iceberg.spark.SparkCatalog) Maps(org.apache.iceberg.relocated.com.google.common.collect.Maps) Test(org.junit.Test) DeleteOrphanFiles(org.apache.iceberg.actions.DeleteOrphanFiles) SparkSchemaUtil(org.apache.iceberg.spark.SparkSchemaUtil) File(java.io.File) SparkSessionCatalog(org.apache.iceberg.spark.SparkSessionCatalog) Map(java.util.Map) After(org.junit.After) Transform(org.apache.spark.sql.connector.expressions.Transform) StreamSupport(java.util.stream.StreamSupport) Identifier(org.apache.spark.sql.connector.catalog.Identifier) Assert(org.junit.Assert) SparkTable(org.apache.iceberg.spark.source.SparkTable) Identifier(org.apache.spark.sql.connector.catalog.Identifier) DeleteOrphanFiles(org.apache.iceberg.actions.DeleteOrphanFiles) Transform(org.apache.spark.sql.connector.expressions.Transform) SparkTable(org.apache.iceberg.spark.source.SparkTable) File(java.io.File) Test(org.junit.Test)

Example 5 with Transform

use of org.apache.spark.sql.connector.expressions.Transform in project iceberg by apache.

the class Spark3Util method toIcebergTerm.

public static Term toIcebergTerm(Expression expr) {
    if (expr instanceof Transform) {
        Transform transform = (Transform) expr;
        Preconditions.checkArgument(transform.references().length == 1, "Cannot convert transform with more than one column reference: %s", transform);
        String colName = DOT.join(transform.references()[0].fieldNames());
        switch(transform.name()) {
            case "identity":
                return org.apache.iceberg.expressions.Expressions.ref(colName);
            case "bucket":
                return org.apache.iceberg.expressions.Expressions.bucket(colName, findWidth(transform));
            case "years":
                return org.apache.iceberg.expressions.Expressions.year(colName);
            case "months":
                return org.apache.iceberg.expressions.Expressions.month(colName);
            case "date":
            case "days":
                return org.apache.iceberg.expressions.Expressions.day(colName);
            case "date_hour":
            case "hours":
                return org.apache.iceberg.expressions.Expressions.hour(colName);
            case "truncate":
                return org.apache.iceberg.expressions.Expressions.truncate(colName, findWidth(transform));
            default:
                throw new UnsupportedOperationException("Transform is not supported: " + transform);
        }
    } else if (expr instanceof NamedReference) {
        NamedReference ref = (NamedReference) expr;
        return org.apache.iceberg.expressions.Expressions.ref(DOT.join(ref.fieldNames()));
    } else {
        throw new UnsupportedOperationException("Cannot convert unknown expression: " + expr);
    }
}
Also used : NamedReference(org.apache.spark.sql.connector.expressions.NamedReference) Transform(org.apache.spark.sql.connector.expressions.Transform)

Aggregations

Transform (org.apache.spark.sql.connector.expressions.Transform)8 File (java.io.File)5 Map (java.util.Map)5 StreamSupport (java.util.stream.StreamSupport)5 DeleteOrphanFiles (org.apache.iceberg.actions.DeleteOrphanFiles)5 Maps (org.apache.iceberg.relocated.com.google.common.collect.Maps)5 SparkCatalog (org.apache.iceberg.spark.SparkCatalog)5 SparkSchemaUtil (org.apache.iceberg.spark.SparkSchemaUtil)5 SparkSessionCatalog (org.apache.iceberg.spark.SparkSessionCatalog)5 SparkTable (org.apache.iceberg.spark.source.SparkTable)5 Identifier (org.apache.spark.sql.connector.catalog.Identifier)5 After (org.junit.After)5 Assert (org.junit.Assert)5 Test (org.junit.Test)5 PartitionSpec (org.apache.iceberg.PartitionSpec)1 AlreadyExistsException (org.apache.iceberg.exceptions.AlreadyExistsException)1 NoSuchNamespaceException (org.apache.iceberg.exceptions.NoSuchNamespaceException)1 StagedSparkTable (org.apache.iceberg.spark.source.StagedSparkTable)1 NamedReference (org.apache.spark.sql.connector.expressions.NamedReference)1 StructType (org.apache.spark.sql.types.StructType)1