Search in sources :

Example 1 with OrcReaderConfig

use of io.trino.plugin.hive.orc.OrcReaderConfig in project trino by trinodb.

the class TestHiveFileFormats method testOrcOptimizedWriter.

@Test(dataProvider = "validRowAndFileSizePadding")
public void testOrcOptimizedWriter(int rowCount, long fileSizePadding) throws Exception {
    HiveSessionProperties hiveSessionProperties = new HiveSessionProperties(new HiveConfig(), new OrcReaderConfig(), new OrcWriterConfig().setValidationPercentage(100.0), new ParquetReaderConfig(), new ParquetWriterConfig());
    ConnectorSession session = TestingConnectorSession.builder().setPropertyMetadata(hiveSessionProperties.getSessionProperties()).build();
    // A Trino page cannot contain a map with null keys, so a page based writer cannot write null keys
    List<TestColumn> testColumns = TEST_COLUMNS.stream().filter(TestHiveFileFormats::withoutNullMapKeyTests).collect(toList());
    assertThatFileFormat(ORC).withColumns(testColumns).withRowsCount(rowCount).withSession(session).withFileSizePadding(fileSizePadding).withFileWriterFactory(new OrcFileWriterFactory(HDFS_ENVIRONMENT, TESTING_TYPE_MANAGER, new NodeVersion("test"), STATS, new OrcWriterOptions())).isReadableByRecordCursor(createGenericHiveRecordCursorProvider(HDFS_ENVIRONMENT)).isReadableByPageSource(new OrcPageSourceFactory(new OrcReaderOptions(), HDFS_ENVIRONMENT, STATS, UTC));
}
Also used : ParquetWriterConfig(io.trino.plugin.hive.parquet.ParquetWriterConfig) OrcWriterConfig(io.trino.plugin.hive.orc.OrcWriterConfig) OrcPageSourceFactory(io.trino.plugin.hive.orc.OrcPageSourceFactory) OrcFileWriterFactory(io.trino.plugin.hive.orc.OrcFileWriterFactory) OrcWriterOptions(io.trino.orc.OrcWriterOptions) OrcReaderConfig(io.trino.plugin.hive.orc.OrcReaderConfig) OrcReaderOptions(io.trino.orc.OrcReaderOptions) ConnectorSession(io.trino.spi.connector.ConnectorSession) TestingConnectorSession(io.trino.testing.TestingConnectorSession) ParquetReaderConfig(io.trino.plugin.hive.parquet.ParquetReaderConfig) Test(org.testng.annotations.Test)

Example 2 with OrcReaderConfig

use of io.trino.plugin.hive.orc.OrcReaderConfig in project trino by trinodb.

the class TestOrcPageSourceMemoryTracking method testMaxReadBytes.

@Test(dataProvider = "rowCount")
public void testMaxReadBytes(int rowCount) throws Exception {
    int maxReadBytes = 1_000;
    HiveSessionProperties hiveSessionProperties = new HiveSessionProperties(new HiveConfig(), new OrcReaderConfig().setMaxBlockSize(DataSize.ofBytes(maxReadBytes)), new OrcWriterConfig(), new ParquetReaderConfig(), new ParquetWriterConfig());
    ConnectorSession session = TestingConnectorSession.builder().setPropertyMetadata(hiveSessionProperties.getSessionProperties()).build();
    FileFormatDataSourceStats stats = new FileFormatDataSourceStats();
    // Build a table where every row gets larger, so we can test that the "batchSize" reduces
    int numColumns = 5;
    int step = 250;
    ImmutableList.Builder<TestColumn> columnBuilder = ImmutableList.<TestColumn>builder().add(new TestColumn("p_empty_string", javaStringObjectInspector, () -> "", true));
    GrowingTestColumn[] dataColumns = new GrowingTestColumn[numColumns];
    for (int i = 0; i < numColumns; i++) {
        dataColumns[i] = new GrowingTestColumn("p_string" + "_" + i, javaStringObjectInspector, () -> Long.toHexString(random.nextLong()), false, step * (i + 1));
        columnBuilder.add(dataColumns[i]);
    }
    List<TestColumn> testColumns = columnBuilder.build();
    File tempFile = File.createTempFile("trino_test_orc_page_source_max_read_bytes", "orc");
    tempFile.delete();
    TestPreparer testPreparer = new TestPreparer(tempFile.getAbsolutePath(), testColumns, rowCount, rowCount);
    ConnectorPageSource pageSource = testPreparer.newPageSource(stats, session);
    try {
        int positionCount = 0;
        while (true) {
            Page page = pageSource.getNextPage();
            if (pageSource.isFinished()) {
                break;
            }
            assertNotNull(page);
            page = page.getLoadedPage();
            positionCount += page.getPositionCount();
            // ignore the first MAX_BATCH_SIZE rows given the sizes are set when loading the blocks
            if (positionCount > MAX_BATCH_SIZE) {
                // either the block is bounded by maxReadBytes or we just load one single large block
                // an error margin MAX_BATCH_SIZE / step is needed given the block sizes are increasing
                assertTrue(page.getSizeInBytes() < maxReadBytes * (MAX_BATCH_SIZE / step) || 1 == page.getPositionCount());
            }
        }
        // verify the stats are correctly recorded
        Distribution distribution = stats.getMaxCombinedBytesPerRow().getAllTime();
        assertEquals((int) distribution.getCount(), 1);
        // the block is VariableWidthBlock that contains valueIsNull and offsets arrays as overhead
        assertEquals((int) distribution.getMax(), Arrays.stream(dataColumns).mapToInt(GrowingTestColumn::getMaxSize).sum() + (Integer.BYTES + Byte.BYTES) * numColumns);
        pageSource.close();
    } finally {
        tempFile.delete();
    }
}
Also used : ImmutableList.toImmutableList(com.google.common.collect.ImmutableList.toImmutableList) ImmutableList(com.google.common.collect.ImmutableList) ParquetWriterConfig(io.trino.plugin.hive.parquet.ParquetWriterConfig) OrcWriterConfig(io.trino.plugin.hive.orc.OrcWriterConfig) Page(io.trino.spi.Page) ConnectorPageSource(io.trino.spi.connector.ConnectorPageSource) OrcReaderConfig(io.trino.plugin.hive.orc.OrcReaderConfig) Distribution(io.airlift.stats.Distribution) ConnectorSession(io.trino.spi.connector.ConnectorSession) TestingConnectorSession(io.trino.testing.TestingConnectorSession) SequenceFile(org.apache.hadoop.io.SequenceFile) File(java.io.File) OrcFile(org.apache.hadoop.hive.ql.io.orc.OrcFile) ParquetReaderConfig(io.trino.plugin.hive.parquet.ParquetReaderConfig) Test(org.testng.annotations.Test)

Example 3 with OrcReaderConfig

use of io.trino.plugin.hive.orc.OrcReaderConfig in project trino by trinodb.

the class TestRubixCaching method setup.

@BeforeClass
public void setup() throws IOException {
    cacheStoragePath = getStoragePath("/");
    config = new HdfsConfig();
    List<PropertyMetadata<?>> hiveSessionProperties = getHiveSessionProperties(new HiveConfig(), new RubixEnabledConfig().setCacheEnabled(true), new OrcReaderConfig()).getSessionProperties();
    context = new HdfsContext(TestingConnectorSession.builder().setPropertyMetadata(hiveSessionProperties).build());
    nonCachingFileSystem = getNonCachingFileSystem();
}
Also used : OrcReaderConfig(io.trino.plugin.hive.orc.OrcReaderConfig) HdfsConfig(io.trino.plugin.hive.HdfsConfig) PropertyMetadata(io.trino.spi.session.PropertyMetadata) HdfsContext(io.trino.plugin.hive.HdfsEnvironment.HdfsContext) HiveConfig(io.trino.plugin.hive.HiveConfig) BeforeClass(org.testng.annotations.BeforeClass)

Example 4 with OrcReaderConfig

use of io.trino.plugin.hive.orc.OrcReaderConfig in project trino by trinodb.

the class ParquetTester method assertMaxReadBytes.

void assertMaxReadBytes(List<ObjectInspector> objectInspectors, Iterable<?>[] writeValues, Iterable<?>[] readValues, List<String> columnNames, List<Type> columnTypes, Optional<MessageType> parquetSchema, DataSize maxReadBlockSize) throws Exception {
    CompressionCodecName compressionCodecName = UNCOMPRESSED;
    HiveSessionProperties hiveSessionProperties = new HiveSessionProperties(new HiveConfig().setHiveStorageFormat(HiveStorageFormat.PARQUET).setUseParquetColumnNames(false), new OrcReaderConfig(), new OrcWriterConfig(), new ParquetReaderConfig().setMaxReadBlockSize(maxReadBlockSize), new ParquetWriterConfig());
    ConnectorSession session = TestingConnectorSession.builder().setPropertyMetadata(hiveSessionProperties.getSessionProperties()).build();
    try (TempFile tempFile = new TempFile("test", "parquet")) {
        JobConf jobConf = new JobConf();
        jobConf.setEnum(COMPRESSION, compressionCodecName);
        jobConf.setBoolean(ENABLE_DICTIONARY, true);
        jobConf.setEnum(WRITER_VERSION, PARQUET_1_0);
        writeParquetColumn(jobConf, tempFile.getFile(), compressionCodecName, createTableProperties(columnNames, objectInspectors), getStandardStructObjectInspector(columnNames, objectInspectors), getIterators(writeValues), parquetSchema, false);
        Iterator<?>[] expectedValues = getIterators(readValues);
        try (ConnectorPageSource pageSource = fileFormat.createFileFormatReader(session, HDFS_ENVIRONMENT, tempFile.getFile(), columnNames, columnTypes)) {
            assertPageSource(columnTypes, expectedValues, pageSource, Optional.of(getParquetMaxReadBlockSize(session).toBytes()));
            assertFalse(stream(expectedValues).allMatch(Iterator::hasNext));
        }
    }
}
Also used : OrcWriterConfig(io.trino.plugin.hive.orc.OrcWriterConfig) ConnectorPageSource(io.trino.spi.connector.ConnectorPageSource) HiveSessionProperties(io.trino.plugin.hive.HiveSessionProperties) HiveConfig(io.trino.plugin.hive.HiveConfig) OrcReaderConfig(io.trino.plugin.hive.orc.OrcReaderConfig) CompressionCodecName(org.apache.parquet.hadoop.metadata.CompressionCodecName) AbstractIterator(com.google.common.collect.AbstractIterator) Iterator(java.util.Iterator) ConnectorSession(io.trino.spi.connector.ConnectorSession) TestingConnectorSession(io.trino.testing.TestingConnectorSession) JobConf(org.apache.hadoop.mapred.JobConf)

Example 5 with OrcReaderConfig

use of io.trino.plugin.hive.orc.OrcReaderConfig in project trino by trinodb.

the class TestHiveFileFormats method testOrcUseColumnNameLowerCaseConversion.

@Test(dataProvider = "rowCount")
public void testOrcUseColumnNameLowerCaseConversion(int rowCount) throws Exception {
    List<TestColumn> testColumnsUpperCase = TEST_COLUMNS.stream().map(testColumn -> new TestColumn(testColumn.getName().toUpperCase(Locale.ENGLISH), testColumn.getObjectInspector(), testColumn.getWriteValue(), testColumn.getExpectedValue(), testColumn.isPartitionKey())).collect(toList());
    ConnectorSession session = getHiveSession(new HiveConfig(), new OrcReaderConfig().setUseColumnNames(true));
    assertThatFileFormat(ORC).withWriteColumns(testColumnsUpperCase).withRowsCount(rowCount).withReadColumns(TEST_COLUMNS).withSession(session).isReadableByPageSource(new OrcPageSourceFactory(new OrcReaderOptions(), HDFS_ENVIRONMENT, STATS, UTC));
}
Also used : OrcFileWriterFactory(io.trino.plugin.hive.orc.OrcFileWriterFactory) ParquetFileWriterFactory(io.trino.plugin.hive.parquet.ParquetFileWriterFactory) Test(org.testng.annotations.Test) NO_ACID_TRANSACTION(io.trino.plugin.hive.acid.AcidTransaction.NO_ACID_TRANSACTION) HiveTestUtils.createGenericHiveRecordCursorProvider(io.trino.plugin.hive.HiveTestUtils.createGenericHiveRecordCursorProvider) TrinoExceptionAssert.assertTrinoExceptionThrownBy(io.trino.testing.assertions.TrinoExceptionAssert.assertTrinoExceptionThrownBy) PARQUET(io.trino.plugin.hive.HiveStorageFormat.PARQUET) FileSplit(org.apache.hadoop.mapred.FileSplit) Locale(java.util.Locale) Configuration(org.apache.hadoop.conf.Configuration) StructuralTestUtil.rowBlockOf(io.trino.testing.StructuralTestUtil.rowBlockOf) Slices.utf8Slice(io.airlift.slice.Slices.utf8Slice) ConnectorPageSource(io.trino.spi.connector.ConnectorPageSource) ObjectInspector(org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector) AVRO(io.trino.plugin.hive.HiveStorageFormat.AVRO) SERIALIZATION_LIB(org.apache.hadoop.hive.serde.serdeConstants.SERIALIZATION_LIB) LzoCodec(io.airlift.compress.lzo.LzoCodec) ImmutableSet(com.google.common.collect.ImmutableSet) TimeZone(java.util.TimeZone) MapObjectInspector(org.apache.hadoop.hive.serde2.objectinspector.MapObjectInspector) BeforeClass(org.testng.annotations.BeforeClass) ImmutableList.toImmutableList(com.google.common.collect.ImmutableList.toImmutableList) Set(java.util.Set) Assert.assertNotNull(org.testng.Assert.assertNotNull) StructObjectInspector(org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector) Instant(java.time.Instant) Collectors(java.util.stream.Collectors) HDFS_ENVIRONMENT(io.trino.plugin.hive.HiveTestUtils.HDFS_ENVIRONMENT) String.format(java.lang.String.format) Preconditions.checkState(com.google.common.base.Preconditions.checkState) List(java.util.List) ColumnMapping.buildColumnMappings(io.trino.plugin.hive.HivePageSourceProvider.ColumnMapping.buildColumnMappings) OrcReaderConfig(io.trino.plugin.hive.orc.OrcReaderConfig) VarcharTypeInfo(org.apache.hadoop.hive.serde2.typeinfo.VarcharTypeInfo) Optional(java.util.Optional) ParquetReaderConfig(io.trino.plugin.hive.parquet.ParquetReaderConfig) StructField(org.apache.hadoop.hive.serde2.objectinspector.StructField) RcFilePageSourceFactory(io.trino.plugin.hive.rcfile.RcFilePageSourceFactory) ListObjectInspector(org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector) DataProvider(org.testng.annotations.DataProvider) PrimitiveCategory(org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector.PrimitiveCategory) Type(io.trino.spi.type.Type) Assert.assertEquals(org.testng.Assert.assertEquals) CSV(io.trino.plugin.hive.HiveStorageFormat.CSV) OptionalInt(java.util.OptionalInt) PrimitiveObjectInspectorFactory.getPrimitiveJavaObjectInspector(org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory.getPrimitiveJavaObjectInspector) LzopCodec(io.airlift.compress.lzo.LzopCodec) SymlinkTextInputFormat(org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat) ArrayList(java.util.ArrayList) HashSet(java.util.HashSet) ParquetPageSourceFactory(io.trino.plugin.hive.parquet.ParquetPageSourceFactory) HiveVarchar(org.apache.hadoop.hive.common.type.HiveVarchar) ParquetWriterConfig(io.trino.plugin.hive.parquet.ParquetWriterConfig) Lists(com.google.common.collect.Lists) ImmutableList(com.google.common.collect.ImmutableList) PrimitiveObjectInspector(org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector) SEQUENCEFILE(io.trino.plugin.hive.HiveStorageFormat.SEQUENCEFILE) OrcReaderOptions(io.trino.orc.OrcReaderOptions) OrcPageSourceFactory(io.trino.plugin.hive.orc.OrcPageSourceFactory) RecordPageSource(io.trino.spi.connector.RecordPageSource) Objects.requireNonNull(java.util.Objects.requireNonNull) TEXTFILE(io.trino.plugin.hive.HiveStorageFormat.TEXTFILE) JSON(io.trino.plugin.hive.HiveStorageFormat.JSON) OrcWriterConfig(io.trino.plugin.hive.orc.OrcWriterConfig) RCBINARY(io.trino.plugin.hive.HiveStorageFormat.RCBINARY) RecordCursor(io.trino.spi.connector.RecordCursor) Properties(java.util.Properties) ORC(io.trino.plugin.hive.HiveStorageFormat.ORC) HiveTestUtils.getTypes(io.trino.plugin.hive.HiveTestUtils.getTypes) TESTING_TYPE_MANAGER(io.trino.type.InternalTypeManager.TESTING_TYPE_MANAGER) IOException(java.io.IOException) ConnectorSession(io.trino.spi.connector.ConnectorSession) ObjectInspectorFactory.getStandardStructObjectInspector(org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory.getStandardStructObjectInspector) TupleDomain(io.trino.spi.predicate.TupleDomain) UTC(org.joda.time.DateTimeZone.UTC) File(java.io.File) TestingConnectorSession(io.trino.testing.TestingConnectorSession) SESSION(io.trino.plugin.hive.HiveTestUtils.SESSION) HiveTestUtils.getHiveSession(io.trino.plugin.hive.HiveTestUtils.getHiveSession) Collectors.toList(java.util.stream.Collectors.toList) OrcWriterOptions(io.trino.orc.OrcWriterOptions) RCTEXT(io.trino.plugin.hive.HiveStorageFormat.RCTEXT) FILE_INPUT_FORMAT(org.apache.hadoop.hive.metastore.api.hive_metastoreConstants.FILE_INPUT_FORMAT) Assert.assertTrue(org.testng.Assert.assertTrue) PrimitiveObjectInspectorFactory.javaStringObjectInspector(org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory.javaStringObjectInspector) OrcReaderConfig(io.trino.plugin.hive.orc.OrcReaderConfig) OrcReaderOptions(io.trino.orc.OrcReaderOptions) ConnectorSession(io.trino.spi.connector.ConnectorSession) TestingConnectorSession(io.trino.testing.TestingConnectorSession) OrcPageSourceFactory(io.trino.plugin.hive.orc.OrcPageSourceFactory) Test(org.testng.annotations.Test)

Aggregations

OrcReaderConfig (io.trino.plugin.hive.orc.OrcReaderConfig)7 ConnectorSession (io.trino.spi.connector.ConnectorSession)6 TestingConnectorSession (io.trino.testing.TestingConnectorSession)6 Test (org.testng.annotations.Test)5 OrcReaderOptions (io.trino.orc.OrcReaderOptions)4 OrcPageSourceFactory (io.trino.plugin.hive.orc.OrcPageSourceFactory)4 OrcWriterConfig (io.trino.plugin.hive.orc.OrcWriterConfig)4 ImmutableList (com.google.common.collect.ImmutableList)3 ImmutableList.toImmutableList (com.google.common.collect.ImmutableList.toImmutableList)3 ParquetReaderConfig (io.trino.plugin.hive.parquet.ParquetReaderConfig)3 ConnectorPageSource (io.trino.spi.connector.ConnectorPageSource)3 OrcWriterOptions (io.trino.orc.OrcWriterOptions)2 HiveConfig (io.trino.plugin.hive.HiveConfig)2 OrcFileWriterFactory (io.trino.plugin.hive.orc.OrcFileWriterFactory)2 ParquetWriterConfig (io.trino.plugin.hive.parquet.ParquetWriterConfig)2 Preconditions.checkState (com.google.common.base.Preconditions.checkState)1 AbstractIterator (com.google.common.collect.AbstractIterator)1 ImmutableSet (com.google.common.collect.ImmutableSet)1 Lists (com.google.common.collect.Lists)1 LzoCodec (io.airlift.compress.lzo.LzoCodec)1