Search in sources :

Example 1 with SparkCatalog

use of org.apache.iceberg.spark.SparkCatalog in project iceberg by apache.

the class TestRemoveOrphanFilesAction3 method testSparkCatalogNamedHiveTable.

@Test
public void testSparkCatalogNamedHiveTable() throws Exception {
    spark.conf().set("spark.sql.catalog.hive", "org.apache.iceberg.spark.SparkCatalog");
    spark.conf().set("spark.sql.catalog.hive.type", "hadoop");
    spark.conf().set("spark.sql.catalog.hive.warehouse", tableLocation);
    SparkCatalog cat = (SparkCatalog) spark.sessionState().catalogManager().catalog("hive");
    String[] database = { "default" };
    Identifier id = Identifier.of(database, "table");
    Map<String, String> options = Maps.newHashMap();
    Transform[] transforms = {};
    cat.createTable(id, SparkSchemaUtil.convert(SCHEMA), transforms, options);
    SparkTable table = cat.loadTable(id);
    spark.sql("INSERT INTO hive.default.table VALUES (1,1,1)");
    String location = table.table().location().replaceFirst("file:", "");
    new File(location + "/data/trashfile").createNewFile();
    DeleteOrphanFiles.Result results = SparkActions.get().deleteOrphanFiles(table.table()).olderThan(System.currentTimeMillis() + 1000).execute();
    Assert.assertTrue("trash file should be removed", StreamSupport.stream(results.orphanFileLocations().spliterator(), false).anyMatch(file -> file.contains("file:" + location + "/data/trashfile")));
}
Also used : SparkCatalog(org.apache.iceberg.spark.SparkCatalog) Maps(org.apache.iceberg.relocated.com.google.common.collect.Maps) Test(org.junit.Test) DeleteOrphanFiles(org.apache.iceberg.actions.DeleteOrphanFiles) SparkSchemaUtil(org.apache.iceberg.spark.SparkSchemaUtil) File(java.io.File) SparkSessionCatalog(org.apache.iceberg.spark.SparkSessionCatalog) Map(java.util.Map) After(org.junit.After) Transform(org.apache.spark.sql.connector.expressions.Transform) StreamSupport(java.util.stream.StreamSupport) Identifier(org.apache.spark.sql.connector.catalog.Identifier) Assert(org.junit.Assert) SparkTable(org.apache.iceberg.spark.source.SparkTable) Identifier(org.apache.spark.sql.connector.catalog.Identifier) SparkCatalog(org.apache.iceberg.spark.SparkCatalog) DeleteOrphanFiles(org.apache.iceberg.actions.DeleteOrphanFiles) Transform(org.apache.spark.sql.connector.expressions.Transform) SparkTable(org.apache.iceberg.spark.source.SparkTable) File(java.io.File) Test(org.junit.Test)

Example 2 with SparkCatalog

use of org.apache.iceberg.spark.SparkCatalog in project OpenLineage by OpenLineage.

the class IcebergHandler method getDatasetVersion.

@SneakyThrows
public Optional<String> getDatasetVersion(TableCatalog tableCatalog, Identifier identifier, Map<String, String> properties) {
    SparkCatalog sparkCatalog = (SparkCatalog) tableCatalog;
    SparkTable table;
    try {
        table = sparkCatalog.loadTable(identifier);
    } catch (NoSuchTableException ex) {
        return Optional.empty();
    }
    if (table.table() != null && table.table().currentSnapshot() != null) {
        return Optional.of(Long.toString(table.table().currentSnapshot().snapshotId()));
    }
    return Optional.empty();
}
Also used : SparkCatalog(org.apache.iceberg.spark.SparkCatalog) NoSuchTableException(org.apache.spark.sql.catalyst.analysis.NoSuchTableException) SparkTable(org.apache.iceberg.spark.source.SparkTable) SneakyThrows(lombok.SneakyThrows)

Example 3 with SparkCatalog

use of org.apache.iceberg.spark.SparkCatalog in project OpenLineage by OpenLineage.

the class IcebergHandler method getDatasetIdentifier.

@Override
public DatasetIdentifier getDatasetIdentifier(SparkSession session, TableCatalog tableCatalog, Identifier identifier, Map<String, String> properties) {
    SparkCatalog sparkCatalog = (SparkCatalog) tableCatalog;
    String catalogName = sparkCatalog.name();
    String prefix = String.format("spark.sql.catalog.%s", catalogName);
    Map<String, String> conf = ScalaConversionUtils.<String, String>fromMap(session.conf().getAll());
    log.info(conf.toString());
    Map<String, String> catalogConf = conf.entrySet().stream().filter(x -> x.getKey().startsWith(prefix)).filter(x -> x.getKey().length() > prefix.length()).collect(Collectors.toMap(// handle dot after prefix
    x -> x.getKey().substring(prefix.length() + 1), Map.Entry::getValue));
    log.info(catalogConf.toString());
    if (catalogConf.isEmpty() || !catalogConf.containsKey("type")) {
        throw new UnsupportedCatalogException(catalogName);
    }
    log.info(catalogConf.get("type"));
    switch(catalogConf.get("type")) {
        case "hadoop":
            return getHadoopIdentifier(catalogConf, identifier.toString());
        case "hive":
            return getHiveIdentifier(session, catalogConf.get(CatalogProperties.URI), identifier.toString());
        default:
            throw new UnsupportedCatalogException(catalogConf.get("type"));
    }
}
Also used : SparkCatalog(org.apache.iceberg.spark.SparkCatalog) SneakyThrows(lombok.SneakyThrows) DatasetIdentifier(io.openlineage.spark.agent.util.DatasetIdentifier) PathUtils(io.openlineage.spark.agent.util.PathUtils) ScalaConversionUtils(io.openlineage.spark.agent.util.ScalaConversionUtils) Collectors(java.util.stream.Collectors) CatalogProperties(org.apache.iceberg.CatalogProperties) Slf4j(lombok.extern.slf4j.Slf4j) TableCatalog(org.apache.spark.sql.connector.catalog.TableCatalog) NoSuchTableException(org.apache.spark.sql.catalyst.analysis.NoSuchTableException) TableProviderFacet(io.openlineage.spark.agent.facets.TableProviderFacet) Map(java.util.Map) Optional(java.util.Optional) Path(org.apache.hadoop.fs.Path) URI(java.net.URI) Identifier(org.apache.spark.sql.connector.catalog.Identifier) SparkTable(org.apache.iceberg.spark.source.SparkTable) Nullable(javax.annotation.Nullable) SparkConfUtils(io.openlineage.spark.agent.util.SparkConfUtils) SparkSession(org.apache.spark.sql.SparkSession) SparkCatalog(org.apache.iceberg.spark.SparkCatalog) Map(java.util.Map)

Example 4 with SparkCatalog

use of org.apache.iceberg.spark.SparkCatalog in project OpenLineage by OpenLineage.

the class IcebergHandlerTest method testGetDatasetIdentifierForHive.

@Test
public void testGetDatasetIdentifierForHive() {
    when(sparkSession.conf()).thenReturn(runtimeConfig);
    when(runtimeConfig.getAll()).thenReturn(new Map.Map2<>("spark.sql.catalog.test.type", "hive", "spark.sql.catalog.test.uri", "thrift://metastore-host:10001"));
    SparkCatalog sparkCatalog = mock(SparkCatalog.class);
    when(sparkCatalog.name()).thenReturn("test");
    DatasetIdentifier datasetIdentifier = icebergHandler.getDatasetIdentifier(sparkSession, sparkCatalog, Identifier.of(new String[] { "database", "schema" }, "table"), new HashMap<>());
    assertEquals("database.schema.table", datasetIdentifier.getName());
    assertEquals("hive://metastore-host:10001", datasetIdentifier.getNamespace());
}
Also used : SparkCatalog(org.apache.iceberg.spark.SparkCatalog) DatasetIdentifier(io.openlineage.spark.agent.util.DatasetIdentifier) HashMap(java.util.HashMap) Map(scala.collection.immutable.Map) Test(org.junit.jupiter.api.Test) ParameterizedTest(org.junit.jupiter.params.ParameterizedTest)

Example 5 with SparkCatalog

use of org.apache.iceberg.spark.SparkCatalog in project iceberg by apache.

the class TestRemoveOrphanFilesAction3 method testSparkCatalogNamedHadoopTable.

@Test
public void testSparkCatalogNamedHadoopTable() throws Exception {
    spark.conf().set("spark.sql.catalog.hadoop", "org.apache.iceberg.spark.SparkCatalog");
    spark.conf().set("spark.sql.catalog.hadoop.type", "hadoop");
    spark.conf().set("spark.sql.catalog.hadoop.warehouse", tableLocation);
    SparkCatalog cat = (SparkCatalog) spark.sessionState().catalogManager().catalog("hadoop");
    String[] database = { "default" };
    Identifier id = Identifier.of(database, "table");
    Map<String, String> options = Maps.newHashMap();
    Transform[] transforms = {};
    cat.createTable(id, SparkSchemaUtil.convert(SCHEMA), transforms, options);
    SparkTable table = cat.loadTable(id);
    spark.sql("INSERT INTO hadoop.default.table VALUES (1,1,1)");
    String location = table.table().location().replaceFirst("file:", "");
    new File(location + "/data/trashfile").createNewFile();
    DeleteOrphanFiles.Result results = SparkActions.get().deleteOrphanFiles(table.table()).olderThan(System.currentTimeMillis() + 1000).execute();
    Assert.assertTrue("trash file should be removed", StreamSupport.stream(results.orphanFileLocations().spliterator(), false).anyMatch(file -> file.contains("file:" + location + "/data/trashfile")));
}
Also used : SparkCatalog(org.apache.iceberg.spark.SparkCatalog) Maps(org.apache.iceberg.relocated.com.google.common.collect.Maps) Test(org.junit.Test) DeleteOrphanFiles(org.apache.iceberg.actions.DeleteOrphanFiles) SparkSchemaUtil(org.apache.iceberg.spark.SparkSchemaUtil) File(java.io.File) SparkSessionCatalog(org.apache.iceberg.spark.SparkSessionCatalog) Map(java.util.Map) After(org.junit.After) Transform(org.apache.spark.sql.connector.expressions.Transform) StreamSupport(java.util.stream.StreamSupport) Identifier(org.apache.spark.sql.connector.catalog.Identifier) Assert(org.junit.Assert) SparkTable(org.apache.iceberg.spark.source.SparkTable) Identifier(org.apache.spark.sql.connector.catalog.Identifier) SparkCatalog(org.apache.iceberg.spark.SparkCatalog) DeleteOrphanFiles(org.apache.iceberg.actions.DeleteOrphanFiles) Transform(org.apache.spark.sql.connector.expressions.Transform) SparkTable(org.apache.iceberg.spark.source.SparkTable) File(java.io.File) Test(org.junit.Test)

Aggregations

SparkCatalog (org.apache.iceberg.spark.SparkCatalog)11 SparkTable (org.apache.iceberg.spark.source.SparkTable)6 Identifier (org.apache.spark.sql.connector.catalog.Identifier)5 Test (org.junit.Test)5 DatasetIdentifier (io.openlineage.spark.agent.util.DatasetIdentifier)4 Map (java.util.Map)4 File (java.io.File)3 StreamSupport (java.util.stream.StreamSupport)3 DeleteOrphanFiles (org.apache.iceberg.actions.DeleteOrphanFiles)3 Maps (org.apache.iceberg.relocated.com.google.common.collect.Maps)3 SparkSchemaUtil (org.apache.iceberg.spark.SparkSchemaUtil)3 SparkSessionCatalog (org.apache.iceberg.spark.SparkSessionCatalog)3 Transform (org.apache.spark.sql.connector.expressions.Transform)3 After (org.junit.After)3 Assert (org.junit.Assert)3 ParameterizedTest (org.junit.jupiter.params.ParameterizedTest)3 HashMap (java.util.HashMap)2 SneakyThrows (lombok.SneakyThrows)2 NoSuchTableException (org.apache.spark.sql.catalyst.analysis.NoSuchTableException)2 Test (org.junit.jupiter.api.Test)2