Search in sources :

Example 1 with MessageType

use of org.apache.hive.iceberg.org.apache.parquet.schema.MessageType in project hive by apache.

the class HiveVectorizedReader method parquetRecordReader.

private static RecordReader<NullWritable, VectorizedRowBatch> parquetRecordReader(JobConf job, Reporter reporter, FileScanTask task, Path path, long start, long length) throws IOException {
    InputSplit split = new FileSplit(path, start, length, job);
    VectorizedParquetInputFormat inputFormat = new VectorizedParquetInputFormat();
    MessageType fileSchema = ParquetFileReader.readFooter(job, path).getFileMetaData().getSchema();
    MessageType typeWithIds = null;
    Schema expectedSchema = task.spec().schema();
    if (ParquetSchemaUtil.hasIds(fileSchema)) {
        typeWithIds = ParquetSchemaUtil.pruneColumns(fileSchema, expectedSchema);
    } else {
        typeWithIds = ParquetSchemaUtil.pruneColumnsFallback(ParquetSchemaUtil.addFallbackIds(fileSchema), expectedSchema);
    }
    ParquetSchemaFieldNameVisitor psv = new ParquetSchemaFieldNameVisitor(fileSchema);
    TypeWithSchemaVisitor.visit(expectedSchema.asStruct(), typeWithIds, psv);
    job.set(IOConstants.COLUMNS, psv.retrieveColumnNameList());
    return inputFormat.getRecordReader(split, job, reporter);
}
Also used : VectorizedParquetInputFormat(org.apache.hadoop.hive.ql.io.parquet.VectorizedParquetInputFormat) Schema(org.apache.iceberg.Schema) FileSplit(org.apache.hadoop.mapred.FileSplit) InputSplit(org.apache.hadoop.mapred.InputSplit) MessageType(org.apache.hive.iceberg.org.apache.parquet.schema.MessageType)

Example 2 with MessageType

use of org.apache.hive.iceberg.org.apache.parquet.schema.MessageType in project hive by apache.

the class ParquetSchemaFieldNameVisitor method struct.

@Override
public Type struct(Types.StructType expected, GroupType struct, List<Type> fields) {
    boolean isMessageType = struct instanceof MessageType;
    List<Types.NestedField> expectedFields = expected != null ? expected.fields() : ImmutableList.of();
    List<Type> types = Lists.newArrayListWithExpectedSize(expectedFields.size());
    for (Types.NestedField field : expectedFields) {
        int id = field.fieldId();
        if (MetadataColumns.metadataFieldIds().contains(id)) {
            continue;
        }
        Type fieldInPrunedFileSchema = typesById.get(id);
        if (fieldInPrunedFileSchema == null) {
            if (!originalFileSchema.containsField(field.name())) {
                // Must be a new field - it isn't in this parquet file yet, so add the new field name instead of null
                appendToColNamesList(isMessageType, field.name());
            } else {
                // This field is found in the parquet file with a different ID, so it must have been recreated since.
                // Inserting a dummy col name to force Hive Parquet reader returning null for this column.
                appendToColNamesList(isMessageType, DUMMY_COL_NAME);
            }
        } else {
            // Already present column in this parquet file, add the original name
            types.add(fieldInPrunedFileSchema);
            appendToColNamesList(isMessageType, fieldInPrunedFileSchema.getName());
        }
    }
    if (!isMessageType) {
        GroupType groupType = new GroupType(Type.Repetition.REPEATED, fieldNames.peek(), types);
        typesById.put(struct.getId().intValue(), groupType);
        return groupType;
    } else {
        return new MessageType("table", types);
    }
}
Also used : Types(org.apache.iceberg.types.Types) MessageType(org.apache.hive.iceberg.org.apache.parquet.schema.MessageType) Type(org.apache.hive.iceberg.org.apache.parquet.schema.Type) PrimitiveType(org.apache.hive.iceberg.org.apache.parquet.schema.PrimitiveType) GroupType(org.apache.hive.iceberg.org.apache.parquet.schema.GroupType) GroupType(org.apache.hive.iceberg.org.apache.parquet.schema.GroupType) MessageType(org.apache.hive.iceberg.org.apache.parquet.schema.MessageType)

Aggregations

MessageType (org.apache.hive.iceberg.org.apache.parquet.schema.MessageType)2 VectorizedParquetInputFormat (org.apache.hadoop.hive.ql.io.parquet.VectorizedParquetInputFormat)1 FileSplit (org.apache.hadoop.mapred.FileSplit)1 InputSplit (org.apache.hadoop.mapred.InputSplit)1 GroupType (org.apache.hive.iceberg.org.apache.parquet.schema.GroupType)1 PrimitiveType (org.apache.hive.iceberg.org.apache.parquet.schema.PrimitiveType)1 Type (org.apache.hive.iceberg.org.apache.parquet.schema.Type)1 Schema (org.apache.iceberg.Schema)1 Types (org.apache.iceberg.types.Types)1