Search in sources :

Example 6 with GroupType

use of org.apache.parquet.schema.GroupType in project hive by apache.

the class DataWritableReadSupport method getProjectedType.

private static Type getProjectedType(TypeInfo colType, Type fieldType) {
    switch(colType.getCategory()) {
        case STRUCT:
            List<Type> groupFields = getProjectedGroupFields(fieldType.asGroupType(), ((StructTypeInfo) colType).getAllStructFieldNames(), ((StructTypeInfo) colType).getAllStructFieldTypeInfos());
            Type[] typesArray = groupFields.toArray(new Type[0]);
            return Types.buildGroup(fieldType.getRepetition()).addFields(typesArray).named(fieldType.getName());
        case LIST:
            TypeInfo elemType = ((ListTypeInfo) colType).getListElementTypeInfo();
            if (elemType.getCategory() == ObjectInspector.Category.STRUCT) {
                Type subFieldType = fieldType.asGroupType().getType(0);
                if (!subFieldType.isPrimitive()) {
                    String subFieldName = subFieldType.getName();
                    Text name = new Text(subFieldName);
                    if (name.equals(ParquetHiveSerDe.ARRAY) || name.equals(ParquetHiveSerDe.LIST)) {
                        subFieldType = new GroupType(Repetition.REPEATED, subFieldName, getProjectedType(elemType, subFieldType.asGroupType().getType(0)));
                    } else {
                        subFieldType = getProjectedType(elemType, subFieldType);
                    }
                    return Types.buildGroup(Repetition.OPTIONAL).as(OriginalType.LIST).addFields(subFieldType).named(fieldType.getName());
                }
            }
            break;
        default:
    }
    return fieldType;
}
Also used : OriginalType(org.apache.parquet.schema.OriginalType) GroupType(org.apache.parquet.schema.GroupType) MessageType(org.apache.parquet.schema.MessageType) Type(org.apache.parquet.schema.Type) GroupType(org.apache.parquet.schema.GroupType) ListTypeInfo(org.apache.hadoop.hive.serde2.typeinfo.ListTypeInfo) Text(org.apache.hadoop.io.Text) ListTypeInfo(org.apache.hadoop.hive.serde2.typeinfo.ListTypeInfo) StructTypeInfo(org.apache.hadoop.hive.serde2.typeinfo.StructTypeInfo) TypeInfo(org.apache.hadoop.hive.serde2.typeinfo.TypeInfo)

Example 7 with GroupType

use of org.apache.parquet.schema.GroupType in project hive by apache.

the class DataWritableReadSupport method projectLeafTypes.

private static List<Type> projectLeafTypes(List<Type> types, List<FieldNode> nodes) {
    List<Type> res = new ArrayList<>();
    if (nodes.isEmpty()) {
        return res;
    }
    Map<String, FieldNode> fieldMap = new HashMap<>();
    for (FieldNode n : nodes) {
        fieldMap.put(n.getFieldName().toLowerCase(), n);
    }
    for (Type type : types) {
        String tn = type.getName().toLowerCase();
        if (fieldMap.containsKey(tn)) {
            FieldNode f = fieldMap.get(tn);
            if (f.getNodes().isEmpty()) {
                // no child, no need for pruning
                res.add(type);
            } else {
                if (type instanceof GroupType) {
                    GroupType groupType = type.asGroupType();
                    List<Type> ts = projectLeafTypes(groupType.getFields(), f.getNodes());
                    GroupType g = buildProjectedGroupType(groupType, ts);
                    if (g != null) {
                        res.add(g);
                    }
                } else {
                    throw new RuntimeException("Primitive type " + f.getFieldName() + "should not " + "doesn't match type" + f.toString());
                }
            }
        }
    }
    return res;
}
Also used : OriginalType(org.apache.parquet.schema.OriginalType) GroupType(org.apache.parquet.schema.GroupType) MessageType(org.apache.parquet.schema.MessageType) Type(org.apache.parquet.schema.Type) FieldNode(org.apache.hadoop.hive.ql.optimizer.FieldNode) GroupType(org.apache.parquet.schema.GroupType) HashMap(java.util.HashMap) ArrayList(java.util.ArrayList)

Example 8 with GroupType

use of org.apache.parquet.schema.GroupType in project drill by apache.

the class ParquetRecordWriter method getType.

private Type getType(MaterializedField field) {
    MinorType minorType = field.getType().getMinorType();
    DataMode dataMode = field.getType().getMode();
    switch(minorType) {
        case MAP:
            List<Type> types = Lists.newArrayList();
            for (MaterializedField childField : field.getChildren()) {
                types.add(getType(childField));
            }
            return new GroupType(dataMode == DataMode.REPEATED ? Repetition.REPEATED : Repetition.OPTIONAL, field.getLastName(), types);
        case LIST:
            throw new UnsupportedOperationException("Unsupported type " + minorType);
        default:
            return getPrimitiveType(field);
    }
}
Also used : PrimitiveType(org.apache.parquet.schema.PrimitiveType) GroupType(org.apache.parquet.schema.GroupType) MessageType(org.apache.parquet.schema.MessageType) MinorType(org.apache.drill.common.types.TypeProtos.MinorType) Type(org.apache.parquet.schema.Type) OriginalType(org.apache.parquet.schema.OriginalType) GroupType(org.apache.parquet.schema.GroupType) DataMode(org.apache.drill.common.types.TypeProtos.DataMode) MinorType(org.apache.drill.common.types.TypeProtos.MinorType) MaterializedField(org.apache.drill.exec.record.MaterializedField)

Example 9 with GroupType

use of org.apache.parquet.schema.GroupType in project hive by apache.

the class VectorizedParquetRecordReader method buildVectorizedParquetReader.

// Build VectorizedParquetColumnReader via Hive typeInfo and Parquet schema
private VectorizedColumnReader buildVectorizedParquetReader(TypeInfo typeInfo, Type type, PageReadStore pages, List<ColumnDescriptor> columnDescriptors, boolean skipTimestampConversion, int depth) throws IOException {
    List<ColumnDescriptor> descriptors = getAllColumnDescriptorByType(depth, type, columnDescriptors);
    switch(typeInfo.getCategory()) {
        case PRIMITIVE:
            if (columnDescriptors == null || columnDescriptors.isEmpty()) {
                throw new RuntimeException("Failed to find related Parquet column descriptor with type " + type);
            }
            if (fileSchema.getColumns().contains(descriptors.get(0))) {
                return new VectorizedPrimitiveColumnReader(descriptors.get(0), pages.getPageReader(descriptors.get(0)), skipTimestampConversion, type, typeInfo);
            } else {
                // Support for schema evolution
                return new VectorizedDummyColumnReader();
            }
        case STRUCT:
            StructTypeInfo structTypeInfo = (StructTypeInfo) typeInfo;
            List<VectorizedColumnReader> fieldReaders = new ArrayList<>();
            List<TypeInfo> fieldTypes = structTypeInfo.getAllStructFieldTypeInfos();
            List<Type> types = type.asGroupType().getFields();
            for (int i = 0; i < fieldTypes.size(); i++) {
                VectorizedColumnReader r = buildVectorizedParquetReader(fieldTypes.get(i), types.get(i), pages, descriptors, skipTimestampConversion, depth + 1);
                if (r != null) {
                    fieldReaders.add(r);
                } else {
                    throw new RuntimeException("Fail to build Parquet vectorized reader based on Hive type " + fieldTypes.get(i).getTypeName() + " and Parquet type" + types.get(i).toString());
                }
            }
            return new VectorizedStructColumnReader(fieldReaders);
        case LIST:
            checkListColumnSupport(((ListTypeInfo) typeInfo).getListElementTypeInfo());
            if (columnDescriptors == null || columnDescriptors.isEmpty()) {
                throw new RuntimeException("Failed to find related Parquet column descriptor with type " + type);
            }
            return new VectorizedListColumnReader(descriptors.get(0), pages.getPageReader(descriptors.get(0)), skipTimestampConversion, getElementType(type), typeInfo);
        case MAP:
            if (columnDescriptors == null || columnDescriptors.isEmpty()) {
                throw new RuntimeException("Failed to find related Parquet column descriptor with type " + type);
            }
            // to handle the different Map definition in Parquet, eg:
            // definition has 1 group:
            // repeated group map (MAP_KEY_VALUE)
            // {required binary key (UTF8); optional binary value (UTF8);}
            // definition has 2 groups:
            // optional group m1 (MAP) {
            // repeated group map (MAP_KEY_VALUE)
            // {required binary key (UTF8); optional binary value (UTF8);}
            // }
            int nestGroup = 0;
            GroupType groupType = type.asGroupType();
            // otherwise, continue to get the group type until MAP_DEFINITION_LEVEL_MAX.
            while (groupType.getFieldCount() < 2) {
                if (nestGroup > MAP_DEFINITION_LEVEL_MAX) {
                    throw new RuntimeException("More than " + MAP_DEFINITION_LEVEL_MAX + " level is found in Map definition, " + "Failed to get the field types for Map with type " + type);
                }
                groupType = groupType.getFields().get(0).asGroupType();
                nestGroup++;
            }
            List<Type> kvTypes = groupType.getFields();
            VectorizedListColumnReader keyListColumnReader = new VectorizedListColumnReader(descriptors.get(0), pages.getPageReader(descriptors.get(0)), skipTimestampConversion, kvTypes.get(0), typeInfo);
            VectorizedListColumnReader valueListColumnReader = new VectorizedListColumnReader(descriptors.get(1), pages.getPageReader(descriptors.get(1)), skipTimestampConversion, kvTypes.get(1), typeInfo);
            return new VectorizedMapColumnReader(keyListColumnReader, valueListColumnReader);
        case UNION:
        default:
            throw new RuntimeException("Unsupported category " + typeInfo.getCategory().name());
    }
}
Also used : ColumnDescriptor(org.apache.parquet.column.ColumnDescriptor) ArrayList(java.util.ArrayList) StructTypeInfo(org.apache.hadoop.hive.serde2.typeinfo.StructTypeInfo) StructTypeInfo(org.apache.hadoop.hive.serde2.typeinfo.StructTypeInfo) PrimitiveTypeInfo(org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo) ListTypeInfo(org.apache.hadoop.hive.serde2.typeinfo.ListTypeInfo) TypeInfo(org.apache.hadoop.hive.serde2.typeinfo.TypeInfo) PrimitiveType(org.apache.parquet.schema.PrimitiveType) GroupType(org.apache.parquet.schema.GroupType) MessageType(org.apache.parquet.schema.MessageType) Type(org.apache.parquet.schema.Type) ParquetRuntimeException(org.apache.parquet.ParquetRuntimeException) GroupType(org.apache.parquet.schema.GroupType)

Example 10 with GroupType

use of org.apache.parquet.schema.GroupType in project hive by apache.

the class DataWritableWriter method createWriter.

/**
 * Creates a writer for the specific object inspector. The returned writer will be used
 * to call Parquet API for the specific data type.
 * @param inspector The object inspector used to get the correct value type.
 * @param type Type that contains information about the type schema.
 * @return A ParquetWriter object used to call the Parquet API fo the specific data type.
 */
private DataWriter createWriter(ObjectInspector inspector, Type type) {
    if (type.isPrimitive()) {
        checkInspectorCategory(inspector, ObjectInspector.Category.PRIMITIVE);
        PrimitiveObjectInspector primitiveInspector = (PrimitiveObjectInspector) inspector;
        switch(primitiveInspector.getPrimitiveCategory()) {
            case BOOLEAN:
                return new BooleanDataWriter((BooleanObjectInspector) inspector);
            case BYTE:
                return new ByteDataWriter((ByteObjectInspector) inspector);
            case SHORT:
                return new ShortDataWriter((ShortObjectInspector) inspector);
            case INT:
                return new IntDataWriter((IntObjectInspector) inspector);
            case LONG:
                return new LongDataWriter((LongObjectInspector) inspector);
            case FLOAT:
                return new FloatDataWriter((FloatObjectInspector) inspector);
            case DOUBLE:
                return new DoubleDataWriter((DoubleObjectInspector) inspector);
            case STRING:
                return new StringDataWriter((StringObjectInspector) inspector);
            case CHAR:
                return new CharDataWriter((HiveCharObjectInspector) inspector);
            case VARCHAR:
                return new VarcharDataWriter((HiveVarcharObjectInspector) inspector);
            case BINARY:
                return new BinaryDataWriter((BinaryObjectInspector) inspector);
            case TIMESTAMP:
                return new TimestampDataWriter((TimestampObjectInspector) inspector);
            case DECIMAL:
                return new DecimalDataWriter((HiveDecimalObjectInspector) inspector);
            case DATE:
                return new DateDataWriter((DateObjectInspector) inspector);
            default:
                throw new IllegalArgumentException("Unsupported primitive data type: " + primitiveInspector.getPrimitiveCategory());
        }
    } else {
        GroupType groupType = type.asGroupType();
        OriginalType originalType = type.getOriginalType();
        if (originalType != null && originalType.equals(OriginalType.LIST)) {
            checkInspectorCategory(inspector, ObjectInspector.Category.LIST);
            return new ListDataWriter((ListObjectInspector) inspector, groupType);
        } else if (originalType != null && originalType.equals(OriginalType.MAP)) {
            checkInspectorCategory(inspector, ObjectInspector.Category.MAP);
            return new MapDataWriter((MapObjectInspector) inspector, groupType);
        } else {
            checkInspectorCategory(inspector, ObjectInspector.Category.STRUCT);
            return new StructDataWriter((StructObjectInspector) inspector, groupType);
        }
    }
}
Also used : OriginalType(org.apache.parquet.schema.OriginalType) GroupType(org.apache.parquet.schema.GroupType) MapObjectInspector(org.apache.hadoop.hive.serde2.objectinspector.MapObjectInspector) PrimitiveObjectInspector(org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector) StructObjectInspector(org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector)

Aggregations

GroupType (org.apache.parquet.schema.GroupType)10 MessageType (org.apache.parquet.schema.MessageType)8 Type (org.apache.parquet.schema.Type)8 OriginalType (org.apache.parquet.schema.OriginalType)7 PrimitiveType (org.apache.parquet.schema.PrimitiveType)4 ArrayList (java.util.ArrayList)3 ListTypeInfo (org.apache.hadoop.hive.serde2.typeinfo.ListTypeInfo)3 StructTypeInfo (org.apache.hadoop.hive.serde2.typeinfo.StructTypeInfo)3 TypeInfo (org.apache.hadoop.hive.serde2.typeinfo.TypeInfo)3 HashMap (java.util.HashMap)1 DataMode (org.apache.drill.common.types.TypeProtos.DataMode)1 MinorType (org.apache.drill.common.types.TypeProtos.MinorType)1 MaterializedField (org.apache.drill.exec.record.MaterializedField)1 FieldNode (org.apache.hadoop.hive.ql.optimizer.FieldNode)1 MapObjectInspector (org.apache.hadoop.hive.serde2.objectinspector.MapObjectInspector)1 PrimitiveObjectInspector (org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector)1 StructObjectInspector (org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector)1 PrimitiveTypeInfo (org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo)1 Text (org.apache.hadoop.io.Text)1 ParquetRuntimeException (org.apache.parquet.ParquetRuntimeException)1