Search in sources :

Example 1 with DeleteOrphanFiles

use of org.apache.iceberg.actions.DeleteOrphanFiles in project iceberg by apache.

the class RemoveOrphanFilesProcedure method call.

@Override
public InternalRow[] call(InternalRow args) {
    Identifier tableIdent = toIdentifier(args.getString(0), PARAMETERS[0].name());
    Long olderThanMillis = args.isNullAt(1) ? null : DateTimeUtil.microsToMillis(args.getLong(1));
    String location = args.isNullAt(2) ? null : args.getString(2);
    boolean dryRun = args.isNullAt(3) ? false : args.getBoolean(3);
    Integer maxConcurrentDeletes = args.isNullAt(4) ? null : args.getInt(4);
    Preconditions.checkArgument(maxConcurrentDeletes == null || maxConcurrentDeletes > 0, "max_concurrent_deletes should have value > 0,  value: " + maxConcurrentDeletes);
    return withIcebergTable(tableIdent, table -> {
        DeleteOrphanFiles action = actions().deleteOrphanFiles(table);
        if (olderThanMillis != null) {
            boolean isTesting = Boolean.parseBoolean(spark().conf().get("spark.testing", "false"));
            if (!isTesting) {
                validateInterval(olderThanMillis);
            }
            action.olderThan(olderThanMillis);
        }
        if (location != null) {
            action.location(location);
        }
        if (dryRun) {
            action.deleteWith(file -> {
            });
        }
        if (maxConcurrentDeletes != null && maxConcurrentDeletes > 0) {
            action.executeDeleteWith(executorService(maxConcurrentDeletes, "remove-orphans"));
        }
        DeleteOrphanFiles.Result result = action.execute();
        return toOutputRows(result);
    });
}
Also used : Identifier(org.apache.spark.sql.connector.catalog.Identifier) DeleteOrphanFiles(org.apache.iceberg.actions.DeleteOrphanFiles) UTF8String(org.apache.spark.unsafe.types.UTF8String)

Example 2 with DeleteOrphanFiles

use of org.apache.iceberg.actions.DeleteOrphanFiles in project iceberg by apache.

the class TestDeleteReachableFilesAction method testIgnoreMetadataFilesNotFound.

@Test
public void testIgnoreMetadataFilesNotFound() {
    table.updateProperties().set(TableProperties.METADATA_PREVIOUS_VERSIONS_MAX, "1").commit();
    table.newAppend().appendFile(FILE_A).commit();
    // There are three metadata json files at this point
    DeleteOrphanFiles.Result result = sparkActions().deleteOrphanFiles(table).olderThan(System.currentTimeMillis()).execute();
    Assert.assertEquals("Should delete 1 file", 1, Iterables.size(result.orphanFileLocations()));
    Assert.assertTrue("Should remove v1 file", StreamSupport.stream(result.orphanFileLocations().spliterator(), false).anyMatch(file -> file.contains("v1.metadata.json")));
    DeleteReachableFiles baseRemoveFilesSparkAction = sparkActions().deleteReachableFiles(metadataLocation(table)).io(table.io());
    DeleteReachableFiles.Result res = baseRemoveFilesSparkAction.execute();
    checkRemoveFilesResults(1, 1, 1, 4, res);
}
Also used : ImmutableSet(org.apache.iceberg.relocated.com.google.common.collect.ImmutableSet) Types(org.apache.iceberg.types.Types) ImmutableMap(org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap) NestedField.optional(org.apache.iceberg.types.Types.NestedField.optional) DeleteOrphanFiles(org.apache.iceberg.actions.DeleteOrphanFiles) DeleteReachableFiles(org.apache.iceberg.actions.DeleteReachableFiles) Lists(org.apache.iceberg.relocated.com.google.common.collect.Lists) AtomicInteger(java.util.concurrent.atomic.AtomicInteger) DataFiles(org.apache.iceberg.DataFiles) Configuration(org.apache.hadoop.conf.Configuration) StreamSupport(java.util.stream.StreamSupport) DataFile(org.apache.iceberg.DataFile) Before(org.junit.Before) AssertHelpers(org.apache.iceberg.AssertHelpers) Table(org.apache.iceberg.Table) HadoopTables(org.apache.iceberg.hadoop.HadoopTables) ConcurrentHashMap(java.util.concurrent.ConcurrentHashMap) Maps(org.apache.iceberg.relocated.com.google.common.collect.Maps) Set(java.util.Set) HasTableOperations(org.apache.iceberg.HasTableOperations) Iterables(org.apache.iceberg.relocated.com.google.common.collect.Iterables) Test(org.junit.Test) Schema(org.apache.iceberg.Schema) File(java.io.File) Executors(java.util.concurrent.Executors) ActionsProvider(org.apache.iceberg.actions.ActionsProvider) ValidationException(org.apache.iceberg.exceptions.ValidationException) Sets(org.apache.iceberg.relocated.com.google.common.collect.Sets) Rule(org.junit.Rule) PartitionSpec(org.apache.iceberg.PartitionSpec) TableProperties(org.apache.iceberg.TableProperties) TestHelpers(org.apache.iceberg.TestHelpers) Assert(org.junit.Assert) SparkTestBase(org.apache.iceberg.spark.SparkTestBase) TemporaryFolder(org.junit.rules.TemporaryFolder) DeleteOrphanFiles(org.apache.iceberg.actions.DeleteOrphanFiles) DeleteReachableFiles(org.apache.iceberg.actions.DeleteReachableFiles) Test(org.junit.Test)

Example 3 with DeleteOrphanFiles

use of org.apache.iceberg.actions.DeleteOrphanFiles in project iceberg by apache.

the class TestRemoveOrphanFilesAction method orphanedFileRemovedWithParallelTasks.

@Test
public void orphanedFileRemovedWithParallelTasks() throws InterruptedException, IOException {
    Table table = TABLES.create(SCHEMA, SPEC, Maps.newHashMap(), tableLocation);
    List<ThreeColumnRecord> records1 = Lists.newArrayList(new ThreeColumnRecord(1, "AAAAAAAAAA", "AAAA"));
    Dataset<Row> df1 = spark.createDataFrame(records1, ThreeColumnRecord.class).coalesce(1);
    // original append
    df1.select("c1", "c2", "c3").write().format("iceberg").mode("append").save(tableLocation);
    List<ThreeColumnRecord> records2 = Lists.newArrayList(new ThreeColumnRecord(2, "AAAAAAAAAA", "AAAA"));
    Dataset<Row> df2 = spark.createDataFrame(records2, ThreeColumnRecord.class).coalesce(1);
    // dynamic partition overwrite
    df2.select("c1", "c2", "c3").write().format("iceberg").mode("overwrite").save(tableLocation);
    // second append
    df2.select("c1", "c2", "c3").write().format("iceberg").mode("append").save(tableLocation);
    df2.coalesce(1).write().mode("append").parquet(tableLocation + "/data");
    df2.coalesce(1).write().mode("append").parquet(tableLocation + "/data/c2_trunc=AA");
    df2.coalesce(1).write().mode("append").parquet(tableLocation + "/data/c2_trunc=AA/c3=AAAA");
    df2.coalesce(1).write().mode("append").parquet(tableLocation + "/data/invalid/invalid");
    // sleep for 1 second to unsure files will be old enough
    Thread.sleep(1000);
    Set<String> deletedFiles = Sets.newHashSet();
    Set<String> deleteThreads = ConcurrentHashMap.newKeySet();
    AtomicInteger deleteThreadsIndex = new AtomicInteger(0);
    ExecutorService executorService = Executors.newFixedThreadPool(4, runnable -> {
        Thread thread = new Thread(runnable);
        thread.setName("remove-orphan-" + deleteThreadsIndex.getAndIncrement());
        thread.setDaemon(true);
        return thread;
    });
    DeleteOrphanFiles.Result result = SparkActions.get().deleteOrphanFiles(table).executeDeleteWith(executorService).olderThan(System.currentTimeMillis()).deleteWith(file -> {
        deleteThreads.add(Thread.currentThread().getName());
        deletedFiles.add(file);
    }).execute();
    // Verifies that the delete methods ran in the threads created by the provided ExecutorService ThreadFactory
    Assert.assertEquals(deleteThreads, Sets.newHashSet("remove-orphan-0", "remove-orphan-1", "remove-orphan-2", "remove-orphan-3"));
    Assert.assertEquals("Should delete 4 files", 4, deletedFiles.size());
}
Also used : Arrays(java.util.Arrays) Types(org.apache.iceberg.types.Types) Dataset(org.apache.spark.sql.Dataset) FileSystem(org.apache.hadoop.fs.FileSystem) NestedField.optional(org.apache.iceberg.types.Types.NestedField.optional) DeleteOrphanFiles(org.apache.iceberg.actions.DeleteOrphanFiles) FileStatus(org.apache.hadoop.fs.FileStatus) Lists(org.apache.iceberg.relocated.com.google.common.collect.Lists) AtomicInteger(java.util.concurrent.atomic.AtomicInteger) Map(java.util.Map) Configuration(org.apache.hadoop.conf.Configuration) Path(org.apache.hadoop.fs.Path) StreamSupport(java.util.stream.StreamSupport) Namespace(org.apache.iceberg.catalog.Namespace) ExecutorService(java.util.concurrent.ExecutorService) ThreeColumnRecord(org.apache.iceberg.spark.source.ThreeColumnRecord) Before(org.junit.Before) AssertHelpers(org.apache.iceberg.AssertHelpers) TableIdentifier(org.apache.iceberg.catalog.TableIdentifier) HadoopCatalog(org.apache.iceberg.hadoop.HadoopCatalog) Table(org.apache.iceberg.Table) HiddenPathFilter(org.apache.iceberg.hadoop.HiddenPathFilter) HadoopTables(org.apache.iceberg.hadoop.HadoopTables) ConcurrentHashMap(java.util.concurrent.ConcurrentHashMap) Maps(org.apache.iceberg.relocated.com.google.common.collect.Maps) Set(java.util.Set) IOException(java.io.IOException) Iterables(org.apache.iceberg.relocated.com.google.common.collect.Iterables) Test(org.junit.Test) Row(org.apache.spark.sql.Row) Schema(org.apache.iceberg.Schema) Collectors(java.util.stream.Collectors) File(java.io.File) Executors(java.util.concurrent.Executors) Encoders(org.apache.spark.sql.Encoders) ValidationException(org.apache.iceberg.exceptions.ValidationException) Sets(org.apache.iceberg.relocated.com.google.common.collect.Sets) List(java.util.List) Rule(org.junit.Rule) PartitionSpec(org.apache.iceberg.PartitionSpec) TableProperties(org.apache.iceberg.TableProperties) Assert(org.junit.Assert) SparkTestBase(org.apache.iceberg.spark.SparkTestBase) TemporaryFolder(org.junit.rules.TemporaryFolder) Snapshot(org.apache.iceberg.Snapshot) Table(org.apache.iceberg.Table) DeleteOrphanFiles(org.apache.iceberg.actions.DeleteOrphanFiles) ThreeColumnRecord(org.apache.iceberg.spark.source.ThreeColumnRecord) AtomicInteger(java.util.concurrent.atomic.AtomicInteger) ExecutorService(java.util.concurrent.ExecutorService) Row(org.apache.spark.sql.Row) Test(org.junit.Test)

Example 4 with DeleteOrphanFiles

use of org.apache.iceberg.actions.DeleteOrphanFiles in project iceberg by apache.

the class TestRemoveOrphanFilesAction3 method testSparkCatalogNamedHiveTable.

@Test
public void testSparkCatalogNamedHiveTable() throws Exception {
    spark.conf().set("spark.sql.catalog.hive", "org.apache.iceberg.spark.SparkCatalog");
    spark.conf().set("spark.sql.catalog.hive.type", "hadoop");
    spark.conf().set("spark.sql.catalog.hive.warehouse", tableLocation);
    SparkCatalog cat = (SparkCatalog) spark.sessionState().catalogManager().catalog("hive");
    String[] database = { "default" };
    Identifier id = Identifier.of(database, "table");
    Map<String, String> options = Maps.newHashMap();
    Transform[] transforms = {};
    cat.createTable(id, SparkSchemaUtil.convert(SCHEMA), transforms, options);
    SparkTable table = cat.loadTable(id);
    spark.sql("INSERT INTO hive.default.table VALUES (1,1,1)");
    String location = table.table().location().replaceFirst("file:", "");
    new File(location + "/data/trashfile").createNewFile();
    DeleteOrphanFiles.Result results = SparkActions.get().deleteOrphanFiles(table.table()).olderThan(System.currentTimeMillis() + 1000).execute();
    Assert.assertTrue("trash file should be removed", StreamSupport.stream(results.orphanFileLocations().spliterator(), false).anyMatch(file -> file.contains("file:" + location + "/data/trashfile")));
}
Also used : SparkCatalog(org.apache.iceberg.spark.SparkCatalog) Maps(org.apache.iceberg.relocated.com.google.common.collect.Maps) Test(org.junit.Test) DeleteOrphanFiles(org.apache.iceberg.actions.DeleteOrphanFiles) SparkSchemaUtil(org.apache.iceberg.spark.SparkSchemaUtil) File(java.io.File) SparkSessionCatalog(org.apache.iceberg.spark.SparkSessionCatalog) Map(java.util.Map) After(org.junit.After) Transform(org.apache.spark.sql.connector.expressions.Transform) StreamSupport(java.util.stream.StreamSupport) Identifier(org.apache.spark.sql.connector.catalog.Identifier) Assert(org.junit.Assert) SparkTable(org.apache.iceberg.spark.source.SparkTable) Identifier(org.apache.spark.sql.connector.catalog.Identifier) SparkCatalog(org.apache.iceberg.spark.SparkCatalog) DeleteOrphanFiles(org.apache.iceberg.actions.DeleteOrphanFiles) Transform(org.apache.spark.sql.connector.expressions.Transform) SparkTable(org.apache.iceberg.spark.source.SparkTable) File(java.io.File) Test(org.junit.Test)

Example 5 with DeleteOrphanFiles

use of org.apache.iceberg.actions.DeleteOrphanFiles in project iceberg by apache.

the class TestRemoveOrphanFilesAction3 method testSparkSessionCatalogHiveTable.

@Test
public void testSparkSessionCatalogHiveTable() throws Exception {
    spark.conf().set("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog");
    spark.conf().set("spark.sql.catalog.spark_catalog.type", "hive");
    SparkSessionCatalog cat = (SparkSessionCatalog) spark.sessionState().catalogManager().v2SessionCatalog();
    String[] database = { "default" };
    Identifier id = Identifier.of(database, "sessioncattest");
    Map<String, String> options = Maps.newHashMap();
    Transform[] transforms = {};
    cat.dropTable(id);
    cat.createTable(id, SparkSchemaUtil.convert(SCHEMA), transforms, options);
    SparkTable table = (SparkTable) cat.loadTable(id);
    spark.sql("INSERT INTO default.sessioncattest VALUES (1,1,1)");
    String location = table.table().location().replaceFirst("file:", "");
    new File(location + "/data/trashfile").createNewFile();
    DeleteOrphanFiles.Result results = SparkActions.get().deleteOrphanFiles(table.table()).olderThan(System.currentTimeMillis() + 1000).execute();
    Assert.assertTrue("trash file should be removed", StreamSupport.stream(results.orphanFileLocations().spliterator(), false).anyMatch(file -> file.contains("file:" + location + "/data/trashfile")));
}
Also used : SparkSessionCatalog(org.apache.iceberg.spark.SparkSessionCatalog) SparkCatalog(org.apache.iceberg.spark.SparkCatalog) Maps(org.apache.iceberg.relocated.com.google.common.collect.Maps) Test(org.junit.Test) DeleteOrphanFiles(org.apache.iceberg.actions.DeleteOrphanFiles) SparkSchemaUtil(org.apache.iceberg.spark.SparkSchemaUtil) File(java.io.File) SparkSessionCatalog(org.apache.iceberg.spark.SparkSessionCatalog) Map(java.util.Map) After(org.junit.After) Transform(org.apache.spark.sql.connector.expressions.Transform) StreamSupport(java.util.stream.StreamSupport) Identifier(org.apache.spark.sql.connector.catalog.Identifier) Assert(org.junit.Assert) SparkTable(org.apache.iceberg.spark.source.SparkTable) Identifier(org.apache.spark.sql.connector.catalog.Identifier) DeleteOrphanFiles(org.apache.iceberg.actions.DeleteOrphanFiles) Transform(org.apache.spark.sql.connector.expressions.Transform) SparkTable(org.apache.iceberg.spark.source.SparkTable) File(java.io.File) Test(org.junit.Test)

Aggregations

DeleteOrphanFiles (org.apache.iceberg.actions.DeleteOrphanFiles)10 Test (org.junit.Test)9 File (java.io.File)8 StreamSupport (java.util.stream.StreamSupport)8 Maps (org.apache.iceberg.relocated.com.google.common.collect.Maps)8 Assert (org.junit.Assert)8 Map (java.util.Map)7 Identifier (org.apache.spark.sql.connector.catalog.Identifier)6 SparkCatalog (org.apache.iceberg.spark.SparkCatalog)5 SparkSchemaUtil (org.apache.iceberg.spark.SparkSchemaUtil)5 SparkSessionCatalog (org.apache.iceberg.spark.SparkSessionCatalog)5 SparkTable (org.apache.iceberg.spark.source.SparkTable)5 Transform (org.apache.spark.sql.connector.expressions.Transform)5 After (org.junit.After)5 Configuration (org.apache.hadoop.conf.Configuration)4 Table (org.apache.iceberg.Table)4 Set (java.util.Set)3 ConcurrentHashMap (java.util.concurrent.ConcurrentHashMap)3 Executors (java.util.concurrent.Executors)3 AtomicInteger (java.util.concurrent.atomic.AtomicInteger)3