Search in sources :

Example 1 with ParquetColumnChunkPageWriteStore

use of org.apache.parquet.hadoop.ParquetColumnChunkPageWriteStore in project drill by axbaretto.

the class ParquetRecordWriter method newSchema.

private void newSchema() throws IOException {
    List<Type> types = Lists.newArrayList();
    for (MaterializedField field : batchSchema) {
        if (field.getName().equalsIgnoreCase(WriterPrel.PARTITION_COMPARATOR_FIELD)) {
            continue;
        }
        types.add(getType(field));
    }
    schema = new MessageType("root", types);
    // We don't want this number to be too small, ideally we divide the block equally across the columns.
    // It is unlikely all columns are going to be the same size.
    // Its value is likely below Integer.MAX_VALUE (2GB), although rowGroupSize is a long type.
    // Therefore this size is cast to int, since allocating byte array in under layer needs to
    // limit the array size in an int scope.
    int initialBlockBufferSize = max(MINIMUM_BUFFER_SIZE, blockSize / this.schema.getColumns().size() / 5);
    // We don't want this number to be too small either. Ideally, slightly bigger than the page size,
    // but not bigger than the block buffer
    int initialPageBufferSize = max(MINIMUM_BUFFER_SIZE, min(pageSize + pageSize / 10, initialBlockBufferSize));
    // TODO: Use initialSlabSize from ParquetProperties once drill will be updated to the latest version of Parquet library
    int initialSlabSize = CapacityByteArrayOutputStream.initialSlabSizeHeuristic(64, pageSize, 10);
    // TODO: Replace ParquetColumnChunkPageWriteStore with ColumnChunkPageWriteStore from parquet library
    // once PARQUET-1006 will be resolved
    pageStore = new ParquetColumnChunkPageWriteStore(codecFactory.getCompressor(codec), schema, initialSlabSize, pageSize, new ParquetDirectByteBufferAllocator(oContext));
    store = new ColumnWriteStoreV1(pageStore, pageSize, initialPageBufferSize, enableDictionary, writerVersion, new ParquetDirectByteBufferAllocator(oContext));
    MessageColumnIO columnIO = new ColumnIOFactory(false).getColumnIO(this.schema);
    consumer = columnIO.getRecordWriter(store);
    setUp(schema, consumer);
}
Also used : PrimitiveType(org.apache.parquet.schema.PrimitiveType) GroupType(org.apache.parquet.schema.GroupType) MessageType(org.apache.parquet.schema.MessageType) MinorType(org.apache.drill.common.types.TypeProtos.MinorType) Type(org.apache.parquet.schema.Type) OriginalType(org.apache.parquet.schema.OriginalType) ParquetColumnChunkPageWriteStore(org.apache.parquet.hadoop.ParquetColumnChunkPageWriteStore) ColumnWriteStoreV1(org.apache.parquet.column.impl.ColumnWriteStoreV1) MaterializedField(org.apache.drill.exec.record.MaterializedField) MessageColumnIO(org.apache.parquet.io.MessageColumnIO) MessageType(org.apache.parquet.schema.MessageType) ColumnIOFactory(org.apache.parquet.io.ColumnIOFactory)

Example 2 with ParquetColumnChunkPageWriteStore

use of org.apache.parquet.hadoop.ParquetColumnChunkPageWriteStore in project drill by apache.

the class ParquetRecordWriter method newSchema.

private void newSchema() throws IOException {
    List<Type> types = new ArrayList<>();
    for (MaterializedField field : batchSchema) {
        if (field.getName().equalsIgnoreCase(WriterPrel.PARTITION_COMPARATOR_FIELD)) {
            continue;
        }
        types.add(getType(field));
    }
    schema = new MessageType("root", types);
    // We don't want this number to be too small, ideally we divide the block equally across the columns.
    // It is unlikely all columns are going to be the same size.
    // Its value is likely below Integer.MAX_VALUE (2GB), although rowGroupSize is a long type.
    // Therefore this size is cast to int, since allocating byte array in under layer needs to
    // limit the array size in an int scope.
    int initialBlockBufferSize = this.schema.getColumns().size() > 0 ? max(MINIMUM_BUFFER_SIZE, blockSize / this.schema.getColumns().size() / 5) : MINIMUM_BUFFER_SIZE;
    // We don't want this number to be too small either. Ideally, slightly bigger than the page size,
    // but not bigger than the block buffer
    int initialPageBufferSize = max(MINIMUM_BUFFER_SIZE, min(pageSize + pageSize / 10, initialBlockBufferSize));
    ValuesWriterFactory valWriterFactory = writerVersion == WriterVersion.PARQUET_1_0 ? new DefaultV1ValuesWriterFactory() : new DefaultV2ValuesWriterFactory();
    ParquetProperties parquetProperties = ParquetProperties.builder().withPageSize(pageSize).withDictionaryEncoding(enableDictionary).withDictionaryPageSize(initialPageBufferSize).withAllocator(new ParquetDirectByteBufferAllocator(oContext)).withValuesWriterFactory(valWriterFactory).withWriterVersion(writerVersion).build();
    // TODO: Replace ParquetColumnChunkPageWriteStore with ColumnChunkPageWriteStore from parquet library
    // once DRILL-7906 (PARQUET-1006) will be resolved
    pageStore = new ParquetColumnChunkPageWriteStore(codecFactory.getCompressor(codec), schema, parquetProperties.getInitialSlabSize(), pageSize, parquetProperties.getAllocator(), parquetProperties.getColumnIndexTruncateLength(), parquetProperties.getPageWriteChecksumEnabled());
    store = writerVersion == WriterVersion.PARQUET_1_0 ? new ColumnWriteStoreV1(schema, pageStore, parquetProperties) : new ColumnWriteStoreV2(schema, pageStore, parquetProperties);
    MessageColumnIO columnIO = new ColumnIOFactory(false).getColumnIO(this.schema);
    consumer = columnIO.getRecordWriter(store);
    setUp(schema, consumer);
}
Also used : ArrayList(java.util.ArrayList) ParquetColumnChunkPageWriteStore(org.apache.parquet.hadoop.ParquetColumnChunkPageWriteStore) ParquetProperties(org.apache.parquet.column.ParquetProperties) ColumnWriteStoreV1(org.apache.parquet.column.impl.ColumnWriteStoreV1) MaterializedField(org.apache.drill.exec.record.MaterializedField) MessageColumnIO(org.apache.parquet.io.MessageColumnIO) ColumnWriteStoreV2(org.apache.parquet.column.impl.ColumnWriteStoreV2) ColumnIOFactory(org.apache.parquet.io.ColumnIOFactory) DefaultV1ValuesWriterFactory(org.apache.parquet.column.values.factory.DefaultV1ValuesWriterFactory) PrimitiveType(org.apache.parquet.schema.PrimitiveType) GroupType(org.apache.parquet.schema.GroupType) MessageType(org.apache.parquet.schema.MessageType) MinorType(org.apache.drill.common.types.TypeProtos.MinorType) Type(org.apache.parquet.schema.Type) OriginalType(org.apache.parquet.schema.OriginalType) DefaultV2ValuesWriterFactory(org.apache.parquet.column.values.factory.DefaultV2ValuesWriterFactory) MessageType(org.apache.parquet.schema.MessageType) DefaultV1ValuesWriterFactory(org.apache.parquet.column.values.factory.DefaultV1ValuesWriterFactory) DefaultV2ValuesWriterFactory(org.apache.parquet.column.values.factory.DefaultV2ValuesWriterFactory) ValuesWriterFactory(org.apache.parquet.column.values.factory.ValuesWriterFactory)

Aggregations

MinorType (org.apache.drill.common.types.TypeProtos.MinorType)2 MaterializedField (org.apache.drill.exec.record.MaterializedField)2 ColumnWriteStoreV1 (org.apache.parquet.column.impl.ColumnWriteStoreV1)2 ParquetColumnChunkPageWriteStore (org.apache.parquet.hadoop.ParquetColumnChunkPageWriteStore)2 ColumnIOFactory (org.apache.parquet.io.ColumnIOFactory)2 MessageColumnIO (org.apache.parquet.io.MessageColumnIO)2 GroupType (org.apache.parquet.schema.GroupType)2 MessageType (org.apache.parquet.schema.MessageType)2 OriginalType (org.apache.parquet.schema.OriginalType)2 PrimitiveType (org.apache.parquet.schema.PrimitiveType)2 Type (org.apache.parquet.schema.Type)2 ArrayList (java.util.ArrayList)1 ParquetProperties (org.apache.parquet.column.ParquetProperties)1 ColumnWriteStoreV2 (org.apache.parquet.column.impl.ColumnWriteStoreV2)1 DefaultV1ValuesWriterFactory (org.apache.parquet.column.values.factory.DefaultV1ValuesWriterFactory)1 DefaultV2ValuesWriterFactory (org.apache.parquet.column.values.factory.DefaultV2ValuesWriterFactory)1 ValuesWriterFactory (org.apache.parquet.column.values.factory.ValuesWriterFactory)1