Search in sources :

Example 1 with TupleRecordMaterializer

use of org.apache.parquet.pig.convert.TupleRecordMaterializer in project parquet-mr by apache.

the class TupleReadSupport method prepareForRead.

@Override
public RecordMaterializer<Tuple> prepareForRead(Configuration configuration, Map<String, String> keyValueMetaData, MessageType fileSchema, ReadContext readContext) {
    MessageType requestedSchema = readContext.getRequestedSchema();
    Schema requestedPigSchema = getPigSchema(configuration);
    if (requestedPigSchema == null) {
        throw new ParquetDecodingException("Missing Pig schema: ParquetLoader sets the schema in the job conf");
    }
    boolean elephantBirdCompatible = configuration.getBoolean(PARQUET_PIG_ELEPHANT_BIRD_COMPATIBLE, false);
    boolean columnIndexAccess = configuration.getBoolean(PARQUET_COLUMN_INDEX_ACCESS, false);
    if (elephantBirdCompatible) {
        LOG.info("Numbers will default to 0 instead of NULL; Boolean will be converted to Int");
    }
    return new TupleRecordMaterializer(requestedSchema, requestedPigSchema, elephantBirdCompatible, columnIndexAccess);
}
Also used : ParquetDecodingException(org.apache.parquet.io.ParquetDecodingException) PigSchemaConverter.parsePigSchema(org.apache.parquet.pig.PigSchemaConverter.parsePigSchema) Schema(org.apache.pig.impl.logicalLayer.schema.Schema) FieldSchema(org.apache.pig.impl.logicalLayer.schema.Schema.FieldSchema) TupleRecordMaterializer(org.apache.parquet.pig.convert.TupleRecordMaterializer) MessageType(org.apache.parquet.schema.MessageType)

Example 2 with TupleRecordMaterializer

use of org.apache.parquet.pig.convert.TupleRecordMaterializer in project parquet-mr by apache.

the class TestThriftToPigCompatibility method validateSameTupleAsEB.

/**
 * <ul> steps:
 * <li>Writes using the thrift mapping
 * <li>Reads using the pig mapping
 * <li>Use Elephant bird to convert from thrift to pig
 * <li>Check that both transformations give the same result
 * @param o the object to convert
 * @throws TException
 */
public static <T extends TBase<?, ?>> void validateSameTupleAsEB(T o) throws TException {
    final ThriftSchemaConverter thriftSchemaConverter = new ThriftSchemaConverter();
    @SuppressWarnings("unchecked") final Class<T> class1 = (Class<T>) o.getClass();
    final MessageType schema = thriftSchemaConverter.convert(class1);
    final StructType structType = ThriftSchemaConverter.toStructType(class1);
    final ThriftToPig<T> thriftToPig = new ThriftToPig<T>(class1);
    final Schema pigSchema = thriftToPig.toSchema();
    final TupleRecordMaterializer tupleRecordConverter = new TupleRecordMaterializer(schema, pigSchema, true);
    RecordConsumer recordConsumer = new ConverterConsumer(tupleRecordConverter.getRootConverter(), schema);
    final MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
    ParquetWriteProtocol p = new ParquetWriteProtocol(new RecordConsumerLoggingWrapper(recordConsumer), columnIO, structType);
    o.write(p);
    final Tuple t = tupleRecordConverter.getCurrentRecord();
    final Tuple expected = thriftToPig.getPigTuple(o);
    assertEquals(expected.toString(), t.toString());
    final MessageType filtered = new PigSchemaConverter().filter(schema, pigSchema);
    assertEquals(schema.toString(), filtered.toString());
}
Also used : StructType(org.apache.parquet.thrift.struct.ThriftType.StructType) RecordConsumerLoggingWrapper(org.apache.parquet.io.RecordConsumerLoggingWrapper) Schema(org.apache.pig.impl.logicalLayer.schema.Schema) PigSchemaConverter(org.apache.parquet.pig.PigSchemaConverter) ThriftToPig(com.twitter.elephantbird.pig.util.ThriftToPig) RecordConsumer(org.apache.parquet.io.api.RecordConsumer) ConverterConsumer(org.apache.parquet.io.ConverterConsumer) MessageColumnIO(org.apache.parquet.io.MessageColumnIO) ColumnIOFactory(org.apache.parquet.io.ColumnIOFactory) TupleRecordMaterializer(org.apache.parquet.pig.convert.TupleRecordMaterializer) MessageType(org.apache.parquet.schema.MessageType) Tuple(org.apache.pig.data.Tuple)

Aggregations

TupleRecordMaterializer (org.apache.parquet.pig.convert.TupleRecordMaterializer)2 MessageType (org.apache.parquet.schema.MessageType)2 Schema (org.apache.pig.impl.logicalLayer.schema.Schema)2 ThriftToPig (com.twitter.elephantbird.pig.util.ThriftToPig)1 ColumnIOFactory (org.apache.parquet.io.ColumnIOFactory)1 ConverterConsumer (org.apache.parquet.io.ConverterConsumer)1 MessageColumnIO (org.apache.parquet.io.MessageColumnIO)1 ParquetDecodingException (org.apache.parquet.io.ParquetDecodingException)1 RecordConsumerLoggingWrapper (org.apache.parquet.io.RecordConsumerLoggingWrapper)1 RecordConsumer (org.apache.parquet.io.api.RecordConsumer)1 PigSchemaConverter (org.apache.parquet.pig.PigSchemaConverter)1 PigSchemaConverter.parsePigSchema (org.apache.parquet.pig.PigSchemaConverter.parsePigSchema)1 StructType (org.apache.parquet.thrift.struct.ThriftType.StructType)1 Tuple (org.apache.pig.data.Tuple)1 FieldSchema (org.apache.pig.impl.logicalLayer.schema.Schema.FieldSchema)1