Search in sources :

Example 1 with ParquetReaderConfig

use of io.trino.plugin.hive.parquet.ParquetReaderConfig in project trino by trinodb.

the class TestHiveFileFormats method testParquetPageSource.

@Test(dataProvider = "validRowAndFileSizePadding")
public void testParquetPageSource(int rowCount, long fileSizePadding) throws Exception {
    List<TestColumn> testColumns = getTestColumnsSupportedByParquet();
    assertThatFileFormat(PARQUET).withColumns(testColumns).withSession(PARQUET_SESSION).withRowsCount(rowCount).withFileSizePadding(fileSizePadding).isReadableByPageSource(new ParquetPageSourceFactory(HDFS_ENVIRONMENT, STATS, new ParquetReaderConfig(), new HiveConfig()));
}
Also used : ParquetPageSourceFactory(io.trino.plugin.hive.parquet.ParquetPageSourceFactory) ParquetReaderConfig(io.trino.plugin.hive.parquet.ParquetReaderConfig) Test(org.testng.annotations.Test)

Example 2 with ParquetReaderConfig

use of io.trino.plugin.hive.parquet.ParquetReaderConfig in project trino by trinodb.

the class TestHiveFileFormats method testParquetProjectedColumns.

@Test(dataProvider = "rowCount")
public void testParquetProjectedColumns(int rowCount) throws Exception {
    List<TestColumn> supportedColumns = getTestColumnsSupportedByParquet();
    List<TestColumn> regularColumns = getRegularColumns(supportedColumns);
    List<TestColumn> partitionColumns = getPartitionColumns(supportedColumns);
    // Created projected columns for all regular supported columns
    ImmutableList.Builder<TestColumn> writeColumnsBuilder = ImmutableList.builder();
    ImmutableList.Builder<TestColumn> readeColumnsBuilder = ImmutableList.builder();
    generateProjectedColumns(regularColumns, writeColumnsBuilder, readeColumnsBuilder);
    List<TestColumn> writeColumns = writeColumnsBuilder.addAll(partitionColumns).build();
    List<TestColumn> readColumns = readeColumnsBuilder.addAll(partitionColumns).build();
    assertThatFileFormat(PARQUET).withWriteColumns(writeColumns).withReadColumns(readColumns).withRowsCount(rowCount).withSession(PARQUET_SESSION).isReadableByPageSource(new ParquetPageSourceFactory(HDFS_ENVIRONMENT, STATS, new ParquetReaderConfig(), new HiveConfig()));
    assertThatFileFormat(PARQUET).withWriteColumns(writeColumns).withReadColumns(readColumns).withRowsCount(rowCount).withSession(PARQUET_SESSION_USE_NAME).isReadableByPageSource(new ParquetPageSourceFactory(HDFS_ENVIRONMENT, STATS, new ParquetReaderConfig(), new HiveConfig()));
}
Also used : ImmutableList.toImmutableList(com.google.common.collect.ImmutableList.toImmutableList) ImmutableList(com.google.common.collect.ImmutableList) ParquetPageSourceFactory(io.trino.plugin.hive.parquet.ParquetPageSourceFactory) ParquetReaderConfig(io.trino.plugin.hive.parquet.ParquetReaderConfig) Test(org.testng.annotations.Test)

Example 3 with ParquetReaderConfig

use of io.trino.plugin.hive.parquet.ParquetReaderConfig in project trino by trinodb.

the class TestHiveFileFormats method testOrcOptimizedWriter.

@Test(dataProvider = "validRowAndFileSizePadding")
public void testOrcOptimizedWriter(int rowCount, long fileSizePadding) throws Exception {
    HiveSessionProperties hiveSessionProperties = new HiveSessionProperties(new HiveConfig(), new OrcReaderConfig(), new OrcWriterConfig().setValidationPercentage(100.0), new ParquetReaderConfig(), new ParquetWriterConfig());
    ConnectorSession session = TestingConnectorSession.builder().setPropertyMetadata(hiveSessionProperties.getSessionProperties()).build();
    // A Trino page cannot contain a map with null keys, so a page based writer cannot write null keys
    List<TestColumn> testColumns = TEST_COLUMNS.stream().filter(TestHiveFileFormats::withoutNullMapKeyTests).collect(toList());
    assertThatFileFormat(ORC).withColumns(testColumns).withRowsCount(rowCount).withSession(session).withFileSizePadding(fileSizePadding).withFileWriterFactory(new OrcFileWriterFactory(HDFS_ENVIRONMENT, TESTING_TYPE_MANAGER, new NodeVersion("test"), STATS, new OrcWriterOptions())).isReadableByRecordCursor(createGenericHiveRecordCursorProvider(HDFS_ENVIRONMENT)).isReadableByPageSource(new OrcPageSourceFactory(new OrcReaderOptions(), HDFS_ENVIRONMENT, STATS, UTC));
}
Also used : ParquetWriterConfig(io.trino.plugin.hive.parquet.ParquetWriterConfig) OrcWriterConfig(io.trino.plugin.hive.orc.OrcWriterConfig) OrcPageSourceFactory(io.trino.plugin.hive.orc.OrcPageSourceFactory) OrcFileWriterFactory(io.trino.plugin.hive.orc.OrcFileWriterFactory) OrcWriterOptions(io.trino.orc.OrcWriterOptions) OrcReaderConfig(io.trino.plugin.hive.orc.OrcReaderConfig) OrcReaderOptions(io.trino.orc.OrcReaderOptions) ConnectorSession(io.trino.spi.connector.ConnectorSession) TestingConnectorSession(io.trino.testing.TestingConnectorSession) ParquetReaderConfig(io.trino.plugin.hive.parquet.ParquetReaderConfig) Test(org.testng.annotations.Test)

Example 4 with ParquetReaderConfig

use of io.trino.plugin.hive.parquet.ParquetReaderConfig in project trino by trinodb.

the class TestHiveFileFormats method testOptimizedParquetWriter.

@Test(dataProvider = "rowCount")
public void testOptimizedParquetWriter(int rowCount) throws Exception {
    ConnectorSession session = getHiveSession(new HiveConfig(), new ParquetWriterConfig().setParquetOptimizedWriterEnabled(true));
    assertTrue(HiveSessionProperties.isParquetOptimizedWriterEnabled(session));
    List<TestColumn> testColumns = getTestColumnsSupportedByParquet();
    assertThatFileFormat(PARQUET).withSession(session).withColumns(testColumns).withRowsCount(rowCount).withFileWriterFactory(new ParquetFileWriterFactory(HDFS_ENVIRONMENT, new NodeVersion("test-version"), TESTING_TYPE_MANAGER)).isReadableByPageSource(new ParquetPageSourceFactory(HDFS_ENVIRONMENT, STATS, new ParquetReaderConfig(), new HiveConfig()));
}
Also used : ParquetWriterConfig(io.trino.plugin.hive.parquet.ParquetWriterConfig) ConnectorSession(io.trino.spi.connector.ConnectorSession) TestingConnectorSession(io.trino.testing.TestingConnectorSession) ParquetFileWriterFactory(io.trino.plugin.hive.parquet.ParquetFileWriterFactory) ParquetPageSourceFactory(io.trino.plugin.hive.parquet.ParquetPageSourceFactory) ParquetReaderConfig(io.trino.plugin.hive.parquet.ParquetReaderConfig) Test(org.testng.annotations.Test)

Example 5 with ParquetReaderConfig

use of io.trino.plugin.hive.parquet.ParquetReaderConfig in project trino by trinodb.

the class TestOrcPageSourceMemoryTracking method testMaxReadBytes.

@Test(dataProvider = "rowCount")
public void testMaxReadBytes(int rowCount) throws Exception {
    int maxReadBytes = 1_000;
    HiveSessionProperties hiveSessionProperties = new HiveSessionProperties(new HiveConfig(), new OrcReaderConfig().setMaxBlockSize(DataSize.ofBytes(maxReadBytes)), new OrcWriterConfig(), new ParquetReaderConfig(), new ParquetWriterConfig());
    ConnectorSession session = TestingConnectorSession.builder().setPropertyMetadata(hiveSessionProperties.getSessionProperties()).build();
    FileFormatDataSourceStats stats = new FileFormatDataSourceStats();
    // Build a table where every row gets larger, so we can test that the "batchSize" reduces
    int numColumns = 5;
    int step = 250;
    ImmutableList.Builder<TestColumn> columnBuilder = ImmutableList.<TestColumn>builder().add(new TestColumn("p_empty_string", javaStringObjectInspector, () -> "", true));
    GrowingTestColumn[] dataColumns = new GrowingTestColumn[numColumns];
    for (int i = 0; i < numColumns; i++) {
        dataColumns[i] = new GrowingTestColumn("p_string" + "_" + i, javaStringObjectInspector, () -> Long.toHexString(random.nextLong()), false, step * (i + 1));
        columnBuilder.add(dataColumns[i]);
    }
    List<TestColumn> testColumns = columnBuilder.build();
    File tempFile = File.createTempFile("trino_test_orc_page_source_max_read_bytes", "orc");
    tempFile.delete();
    TestPreparer testPreparer = new TestPreparer(tempFile.getAbsolutePath(), testColumns, rowCount, rowCount);
    ConnectorPageSource pageSource = testPreparer.newPageSource(stats, session);
    try {
        int positionCount = 0;
        while (true) {
            Page page = pageSource.getNextPage();
            if (pageSource.isFinished()) {
                break;
            }
            assertNotNull(page);
            page = page.getLoadedPage();
            positionCount += page.getPositionCount();
            // ignore the first MAX_BATCH_SIZE rows given the sizes are set when loading the blocks
            if (positionCount > MAX_BATCH_SIZE) {
                // either the block is bounded by maxReadBytes or we just load one single large block
                // an error margin MAX_BATCH_SIZE / step is needed given the block sizes are increasing
                assertTrue(page.getSizeInBytes() < maxReadBytes * (MAX_BATCH_SIZE / step) || 1 == page.getPositionCount());
            }
        }
        // verify the stats are correctly recorded
        Distribution distribution = stats.getMaxCombinedBytesPerRow().getAllTime();
        assertEquals((int) distribution.getCount(), 1);
        // the block is VariableWidthBlock that contains valueIsNull and offsets arrays as overhead
        assertEquals((int) distribution.getMax(), Arrays.stream(dataColumns).mapToInt(GrowingTestColumn::getMaxSize).sum() + (Integer.BYTES + Byte.BYTES) * numColumns);
        pageSource.close();
    } finally {
        tempFile.delete();
    }
}
Also used : ImmutableList.toImmutableList(com.google.common.collect.ImmutableList.toImmutableList) ImmutableList(com.google.common.collect.ImmutableList) ParquetWriterConfig(io.trino.plugin.hive.parquet.ParquetWriterConfig) OrcWriterConfig(io.trino.plugin.hive.orc.OrcWriterConfig) Page(io.trino.spi.Page) ConnectorPageSource(io.trino.spi.connector.ConnectorPageSource) OrcReaderConfig(io.trino.plugin.hive.orc.OrcReaderConfig) Distribution(io.airlift.stats.Distribution) ConnectorSession(io.trino.spi.connector.ConnectorSession) TestingConnectorSession(io.trino.testing.TestingConnectorSession) SequenceFile(org.apache.hadoop.io.SequenceFile) File(java.io.File) OrcFile(org.apache.hadoop.hive.ql.io.orc.OrcFile) ParquetReaderConfig(io.trino.plugin.hive.parquet.ParquetReaderConfig) Test(org.testng.annotations.Test)

Aggregations

ParquetReaderConfig (io.trino.plugin.hive.parquet.ParquetReaderConfig)15 Test (org.testng.annotations.Test)10 ParquetPageSourceFactory (io.trino.plugin.hive.parquet.ParquetPageSourceFactory)7 FileFormatDataSourceStats (io.trino.plugin.hive.FileFormatDataSourceStats)6 HdfsEnvironment (io.trino.plugin.hive.HdfsEnvironment)5 OrcReaderOptions (io.trino.orc.OrcReaderOptions)4 CheckpointSchemaManager (io.trino.plugin.deltalake.transactionlog.checkpoint.CheckpointSchemaManager)4 HdfsConfig (io.trino.plugin.hive.HdfsConfig)4 HdfsConfiguration (io.trino.plugin.hive.HdfsConfiguration)4 HdfsConfigurationInitializer (io.trino.plugin.hive.HdfsConfigurationInitializer)4 HiveHdfsConfiguration (io.trino.plugin.hive.HiveHdfsConfiguration)4 NoHdfsAuthentication (io.trino.plugin.hive.authentication.NoHdfsAuthentication)4 OrcPageSourceFactory (io.trino.plugin.hive.orc.OrcPageSourceFactory)4 File (java.io.File)4 ImmutableList (com.google.common.collect.ImmutableList)3 ImmutableList.toImmutableList (com.google.common.collect.ImmutableList.toImmutableList)3 ParquetWriterConfig (io.trino.plugin.hive.parquet.ParquetWriterConfig)3 RcFilePageSourceFactory (io.trino.plugin.hive.rcfile.RcFilePageSourceFactory)3 ConnectorSession (io.trino.spi.connector.ConnectorSession)3 TestingConnectorSession (io.trino.testing.TestingConnectorSession)3