Search in sources :

Example 1 with CheckpointEntryIterator

use of io.trino.plugin.deltalake.transactionlog.checkpoint.CheckpointEntryIterator in project trino by trinodb.

the class TestDeltaLakeFileStatistics method testParseParquetStatistics.

@Test
public void testParseParquetStatistics() throws Exception {
    File statsFile = new File(getClass().getResource("/databricks/pruning/parquet_struct_statistics/_delta_log/00000000000000000010.checkpoint.parquet").getFile());
    Path checkpointPath = new Path(statsFile.toURI());
    TypeManager typeManager = TESTING_TYPE_MANAGER;
    CheckpointSchemaManager checkpointSchemaManager = new CheckpointSchemaManager(typeManager);
    HdfsConfig hdfsConfig = new HdfsConfig();
    HdfsConfiguration hdfsConfiguration = new HiveHdfsConfiguration(new HdfsConfigurationInitializer(hdfsConfig), ImmutableSet.of());
    HdfsEnvironment hdfsEnvironment = new HdfsEnvironment(hdfsConfiguration, hdfsConfig, new NoHdfsAuthentication());
    FileSystem fs = hdfsEnvironment.getFileSystem(new HdfsEnvironment.HdfsContext(SESSION), checkpointPath);
    CheckpointEntryIterator metadataEntryIterator = new CheckpointEntryIterator(checkpointPath, SESSION, fs.getFileStatus(checkpointPath).getLen(), checkpointSchemaManager, typeManager, ImmutableSet.of(METADATA), Optional.empty(), hdfsEnvironment, new FileFormatDataSourceStats(), new ParquetReaderConfig().toParquetReaderOptions(), true);
    MetadataEntry metadataEntry = getOnlyElement(metadataEntryIterator).getMetaData();
    CheckpointEntryIterator checkpointEntryIterator = new CheckpointEntryIterator(checkpointPath, SESSION, fs.getFileStatus(checkpointPath).getLen(), checkpointSchemaManager, typeManager, ImmutableSet.of(CheckpointEntryIterator.EntryType.ADD), Optional.of(metadataEntry), hdfsEnvironment, new FileFormatDataSourceStats(), new ParquetReaderConfig().toParquetReaderOptions(), true);
    DeltaLakeTransactionLogEntry matchingAddFileEntry = null;
    while (checkpointEntryIterator.hasNext()) {
        DeltaLakeTransactionLogEntry entry = checkpointEntryIterator.next();
        if (entry.getAdd() != null && entry.getAdd().getPath().contains("part-00000-17951bea-0d04-43c1-979c-ea1fac19b382-c000.snappy.parquet")) {
            assertNull(matchingAddFileEntry);
            matchingAddFileEntry = entry;
        }
    }
    assertNotNull(matchingAddFileEntry);
    assertThat(matchingAddFileEntry.getAdd().getStats()).isPresent();
    testStatisticsValues(matchingAddFileEntry.getAdd().getStats().get());
}
Also used : Path(org.apache.hadoop.fs.Path) HdfsConfigurationInitializer(io.trino.plugin.hive.HdfsConfigurationInitializer) HiveHdfsConfiguration(io.trino.plugin.hive.HiveHdfsConfiguration) DeltaLakeTransactionLogEntry(io.trino.plugin.deltalake.transactionlog.DeltaLakeTransactionLogEntry) HdfsConfig(io.trino.plugin.hive.HdfsConfig) FileFormatDataSourceStats(io.trino.plugin.hive.FileFormatDataSourceStats) CheckpointEntryIterator(io.trino.plugin.deltalake.transactionlog.checkpoint.CheckpointEntryIterator) HiveHdfsConfiguration(io.trino.plugin.hive.HiveHdfsConfiguration) HdfsConfiguration(io.trino.plugin.hive.HdfsConfiguration) NoHdfsAuthentication(io.trino.plugin.hive.authentication.NoHdfsAuthentication) HdfsEnvironment(io.trino.plugin.hive.HdfsEnvironment) CheckpointSchemaManager(io.trino.plugin.deltalake.transactionlog.checkpoint.CheckpointSchemaManager) FileSystem(org.apache.hadoop.fs.FileSystem) TypeManager(io.trino.spi.type.TypeManager) MetadataEntry(io.trino.plugin.deltalake.transactionlog.MetadataEntry) File(java.io.File) ParquetReaderConfig(io.trino.plugin.hive.parquet.ParquetReaderConfig) Test(org.testng.annotations.Test)

Example 2 with CheckpointEntryIterator

use of io.trino.plugin.deltalake.transactionlog.checkpoint.CheckpointEntryIterator in project trino by trinodb.

the class TableSnapshot method getCheckpointTransactionLogEntries.

private Stream<DeltaLakeTransactionLogEntry> getCheckpointTransactionLogEntries(ConnectorSession session, Set<CheckpointEntryIterator.EntryType> entryTypes, Optional<MetadataEntry> metadataEntry, CheckpointSchemaManager checkpointSchemaManager, TypeManager typeManager, FileSystem fileSystem, HdfsEnvironment hdfsEnvironment, FileFormatDataSourceStats stats, LastCheckpoint checkpoint, Path checkpointPath) throws IOException {
    FileStatus fileStatus;
    try {
        fileStatus = fileSystem.getFileStatus(checkpointPath);
    } catch (FileNotFoundException e) {
        throw new TrinoException(DELTA_LAKE_INVALID_SCHEMA, format("%s mentions a non-existent checkpoint file for table: %s", checkpoint, table));
    }
    Iterator<DeltaLakeTransactionLogEntry> checkpointEntryIterator = new CheckpointEntryIterator(checkpointPath, session, fileStatus.getLen(), checkpointSchemaManager, typeManager, entryTypes, metadataEntry, hdfsEnvironment, stats, parquetReaderOptions, checkpointRowStatisticsWritingEnabled);
    return stream(checkpointEntryIterator);
}
Also used : FileStatus(org.apache.hadoop.fs.FileStatus) FileNotFoundException(java.io.FileNotFoundException) TrinoException(io.trino.spi.TrinoException) CheckpointEntryIterator(io.trino.plugin.deltalake.transactionlog.checkpoint.CheckpointEntryIterator)

Aggregations

CheckpointEntryIterator (io.trino.plugin.deltalake.transactionlog.checkpoint.CheckpointEntryIterator)2 DeltaLakeTransactionLogEntry (io.trino.plugin.deltalake.transactionlog.DeltaLakeTransactionLogEntry)1 MetadataEntry (io.trino.plugin.deltalake.transactionlog.MetadataEntry)1 CheckpointSchemaManager (io.trino.plugin.deltalake.transactionlog.checkpoint.CheckpointSchemaManager)1 FileFormatDataSourceStats (io.trino.plugin.hive.FileFormatDataSourceStats)1 HdfsConfig (io.trino.plugin.hive.HdfsConfig)1 HdfsConfiguration (io.trino.plugin.hive.HdfsConfiguration)1 HdfsConfigurationInitializer (io.trino.plugin.hive.HdfsConfigurationInitializer)1 HdfsEnvironment (io.trino.plugin.hive.HdfsEnvironment)1 HiveHdfsConfiguration (io.trino.plugin.hive.HiveHdfsConfiguration)1 NoHdfsAuthentication (io.trino.plugin.hive.authentication.NoHdfsAuthentication)1 ParquetReaderConfig (io.trino.plugin.hive.parquet.ParquetReaderConfig)1 TrinoException (io.trino.spi.TrinoException)1 TypeManager (io.trino.spi.type.TypeManager)1 File (java.io.File)1 FileNotFoundException (java.io.FileNotFoundException)1 FileStatus (org.apache.hadoop.fs.FileStatus)1 FileSystem (org.apache.hadoop.fs.FileSystem)1 Path (org.apache.hadoop.fs.Path)1 Test (org.testng.annotations.Test)1