Search in sources :

Example 1 with InitContext

use of org.apache.parquet.hadoop.api.InitContext in project parquet-mr by apache.

the class InternalParquetRecordReader method initialize.

public void initialize(ParquetFileReader reader, Configuration configuration) throws IOException {
    // initialize a ReadContext for this file
    this.reader = reader;
    FileMetaData parquetFileMetadata = reader.getFooter().getFileMetaData();
    this.fileSchema = parquetFileMetadata.getSchema();
    Map<String, String> fileMetadata = parquetFileMetadata.getKeyValueMetaData();
    ReadSupport.ReadContext readContext = readSupport.init(new InitContext(configuration, toSetMultiMap(fileMetadata), fileSchema));
    this.columnIOFactory = new ColumnIOFactory(parquetFileMetadata.getCreatedBy());
    this.requestedSchema = readContext.getRequestedSchema();
    this.columnCount = requestedSchema.getPaths().size();
    this.recordConverter = readSupport.prepareForRead(configuration, fileMetadata, fileSchema, readContext);
    this.strictTypeChecking = configuration.getBoolean(STRICT_TYPE_CHECKING, true);
    this.total = reader.getRecordCount();
    this.unmaterializableRecordCounter = new UnmaterializableRecordCounter(configuration, total);
    this.filterRecords = configuration.getBoolean(RECORD_FILTERING_ENABLED, true);
    reader.setRequestedSchema(requestedSchema);
    LOG.info("RecordReader initialized will read a total of {} records.", total);
}
Also used : ReadSupport(org.apache.parquet.hadoop.api.ReadSupport) InitContext(org.apache.parquet.hadoop.api.InitContext) FileMetaData(org.apache.parquet.hadoop.metadata.FileMetaData) ColumnIOFactory(org.apache.parquet.io.ColumnIOFactory)

Example 2 with InitContext

use of org.apache.parquet.hadoop.api.InitContext in project parquet-mr by apache.

the class TestTupleRecordConsumer method newPigRecordConsumer.

private RecordMaterializer<Tuple> newPigRecordConsumer(String pigSchemaString) throws ParserException {
    TupleReadSupport tupleReadSupport = new TupleReadSupport();
    final Configuration configuration = new Configuration(false);
    MessageType parquetSchema = getMessageType(pigSchemaString);
    final Map<String, String> pigMetaData = pigMetaData(pigSchemaString);
    Map<String, Set<String>> globalMetaData = new HashMap<String, Set<String>>();
    for (Entry<String, String> entry : pigMetaData.entrySet()) {
        globalMetaData.put(entry.getKey(), new HashSet<String>(Arrays.asList(entry.getValue())));
    }
    configuration.set(PARQUET_PIG_SCHEMA, pigSchemaString);
    final ReadContext init = tupleReadSupport.init(new InitContext(configuration, globalMetaData, parquetSchema));
    return tupleReadSupport.prepareForRead(configuration, pigMetaData, parquetSchema, init);
}
Also used : InitContext(org.apache.parquet.hadoop.api.InitContext) HashSet(java.util.HashSet) Set(java.util.Set) Configuration(org.apache.hadoop.conf.Configuration) HashMap(java.util.HashMap) ReadContext(org.apache.parquet.hadoop.api.ReadSupport.ReadContext) MessageType(org.apache.parquet.schema.MessageType)

Example 3 with InitContext

use of org.apache.parquet.hadoop.api.InitContext in project parquet-mr by apache.

the class InternalParquetRecordReader method initialize.

public void initialize(ParquetFileReader reader, ParquetReadOptions options) {
    // copy custom configuration to the Configuration passed to the ReadSupport
    Configuration conf = new Configuration();
    if (options instanceof HadoopReadOptions) {
        conf = ((HadoopReadOptions) options).getConf();
    }
    for (String property : options.getPropertyNames()) {
        conf.set(property, options.getProperty(property));
    }
    // initialize a ReadContext for this file
    this.reader = reader;
    FileMetaData parquetFileMetadata = reader.getFooter().getFileMetaData();
    this.fileSchema = parquetFileMetadata.getSchema();
    Map<String, String> fileMetadata = parquetFileMetadata.getKeyValueMetaData();
    ReadSupport.ReadContext readContext = readSupport.init(new InitContext(conf, toSetMultiMap(fileMetadata), fileSchema));
    this.columnIOFactory = new ColumnIOFactory(parquetFileMetadata.getCreatedBy());
    this.requestedSchema = readContext.getRequestedSchema();
    this.columnCount = requestedSchema.getPaths().size();
    this.recordConverter = readSupport.prepareForRead(conf, fileMetadata, fileSchema, readContext);
    this.strictTypeChecking = options.isEnabled(STRICT_TYPE_CHECKING, true);
    this.total = reader.getRecordCount();
    this.unmaterializableRecordCounter = new UnmaterializableRecordCounter(options, total);
    this.filterRecords = options.useRecordFilter();
    reader.setRequestedSchema(requestedSchema);
    LOG.info("RecordReader initialized will read a total of {} records.", total);
}
Also used : ReadSupport(org.apache.parquet.hadoop.api.ReadSupport) InitContext(org.apache.parquet.hadoop.api.InitContext) HadoopReadOptions(org.apache.parquet.HadoopReadOptions) Configuration(org.apache.hadoop.conf.Configuration) FileMetaData(org.apache.parquet.hadoop.metadata.FileMetaData) ColumnIOFactory(org.apache.parquet.io.ColumnIOFactory)

Example 4 with InitContext

use of org.apache.parquet.hadoop.api.InitContext in project parquet-mr by apache.

the class ClientSideMetadataSplitStrategy method getSplits.

/**
 * @param configuration the configuration to connect to the file system
 * @param footers the footers of the files to read
 * @return the splits for the footers
 * @throws IOException
 * @deprecated split planning using file footers will be removed
 */
@Deprecated
public List<ParquetInputSplit> getSplits(Configuration configuration, List<Footer> footers) throws IOException {
    boolean strictTypeChecking = configuration.getBoolean(STRICT_TYPE_CHECKING, true);
    final long maxSplitSize = configuration.getLong("mapred.max.split.size", Long.MAX_VALUE);
    final long minSplitSize = Math.max(getFormatMinSplitSize(), configuration.getLong("mapred.min.split.size", 0L));
    if (maxSplitSize < 0 || minSplitSize < 0) {
        throw new ParquetDecodingException("maxSplitSize or minSplitSize should not be negative: maxSplitSize = " + maxSplitSize + "; minSplitSize = " + minSplitSize);
    }
    GlobalMetaData globalMetaData = ParquetFileWriter.getGlobalMetaData(footers, strictTypeChecking);
    ReadContext readContext = getReadSupport(configuration).init(new InitContext(configuration, globalMetaData.getKeyValueMetaData(), globalMetaData.getSchema()));
    return new ClientSideMetadataSplitStrategy().getSplits(configuration, footers, maxSplitSize, minSplitSize, readContext);
}
Also used : InitContext(org.apache.parquet.hadoop.api.InitContext) ParquetDecodingException(org.apache.parquet.io.ParquetDecodingException) ReadContext(org.apache.parquet.hadoop.api.ReadSupport.ReadContext) GlobalMetaData(org.apache.parquet.hadoop.metadata.GlobalMetaData)

Example 5 with InitContext

use of org.apache.parquet.hadoop.api.InitContext in project hive by apache.

the class ParquetRecordReaderBase method getSplit.

/**
 * gets a ParquetInputSplit corresponding to a split given by Hive
 *
 * @param oldSplit The split given by Hive
 * @param conf The JobConf of the Hive job
 * @return a ParquetInputSplit corresponding to the oldSplit
 * @throws IOException if the config cannot be enhanced or if the footer cannot be read from the file
 */
@SuppressWarnings("deprecation")
protected ParquetInputSplit getSplit(final org.apache.hadoop.mapred.InputSplit oldSplit, final JobConf conf) throws IOException {
    if (oldSplit.getLength() == 0) {
        return null;
    }
    ParquetInputSplit split;
    if (oldSplit instanceof FileSplit) {
        final Path finalPath = ((FileSplit) oldSplit).getPath();
        jobConf = projectionPusher.pushProjectionsAndFilters(conf, finalPath.getParent());
        // TODO enable MetadataFilter by using readFooter(Configuration configuration, Path file,
        // MetadataFilter filter) API
        final ParquetMetadata parquetMetadata = ParquetFileReader.readFooter(jobConf, finalPath);
        final List<BlockMetaData> blocks = parquetMetadata.getBlocks();
        final FileMetaData fileMetaData = parquetMetadata.getFileMetaData();
        final ReadSupport.ReadContext readContext = new DataWritableReadSupport().init(new InitContext(jobConf, null, fileMetaData.getSchema()));
        // Compute stats
        for (BlockMetaData bmd : blocks) {
            serDeStats.setRowCount(serDeStats.getRowCount() + bmd.getRowCount());
            serDeStats.setRawDataSize(serDeStats.getRawDataSize() + bmd.getTotalByteSize());
        }
        schemaSize = MessageTypeParser.parseMessageType(readContext.getReadSupportMetadata().get(DataWritableReadSupport.HIVE_TABLE_AS_PARQUET_SCHEMA)).getFieldCount();
        final List<BlockMetaData> splitGroup = new ArrayList<BlockMetaData>();
        final long splitStart = ((FileSplit) oldSplit).getStart();
        final long splitLength = ((FileSplit) oldSplit).getLength();
        for (final BlockMetaData block : blocks) {
            final long firstDataPage = block.getColumns().get(0).getFirstDataPageOffset();
            if (firstDataPage >= splitStart && firstDataPage < splitStart + splitLength) {
                splitGroup.add(block);
            }
        }
        if (splitGroup.isEmpty()) {
            LOG.warn("Skipping split, could not find row group in: " + oldSplit);
            return null;
        }
        FilterCompat.Filter filter = setFilter(jobConf, fileMetaData.getSchema());
        if (filter != null) {
            filtedBlocks = RowGroupFilter.filterRowGroups(filter, splitGroup, fileMetaData.getSchema());
            if (filtedBlocks.isEmpty()) {
                LOG.debug("All row groups are dropped due to filter predicates");
                return null;
            }
            long droppedBlocks = splitGroup.size() - filtedBlocks.size();
            if (droppedBlocks > 0) {
                LOG.debug("Dropping " + droppedBlocks + " row groups that do not pass filter predicate");
            }
        } else {
            filtedBlocks = splitGroup;
        }
        if (HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVE_PARQUET_TIMESTAMP_SKIP_CONVERSION)) {
            skipTimestampConversion = !Strings.nullToEmpty(fileMetaData.getCreatedBy()).startsWith("parquet-mr");
        }
        skipProlepticConversion = DataWritableReadSupport.getWriterDateProleptic(fileMetaData.getKeyValueMetaData());
        if (skipProlepticConversion == null) {
            skipProlepticConversion = HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVE_PARQUET_DATE_PROLEPTIC_GREGORIAN_DEFAULT);
        }
        legacyConversionEnabled = HiveConf.getBoolVar(conf, ConfVars.HIVE_PARQUET_TIMESTAMP_LEGACY_CONVERSION_ENABLED);
        if (fileMetaData.getKeyValueMetaData().containsKey(DataWritableWriteSupport.WRITER_ZONE_CONVERSION_LEGACY)) {
            legacyConversionEnabled = Boolean.parseBoolean(fileMetaData.getKeyValueMetaData().get(DataWritableWriteSupport.WRITER_ZONE_CONVERSION_LEGACY));
        }
        split = new ParquetInputSplit(finalPath, splitStart, splitLength, oldSplit.getLocations(), filtedBlocks, readContext.getRequestedSchema().toString(), fileMetaData.getSchema().toString(), fileMetaData.getKeyValueMetaData(), readContext.getReadSupportMetadata());
        return split;
    } else {
        throw new IllegalArgumentException("Unknown split type: " + oldSplit);
    }
}
Also used : Path(org.apache.hadoop.fs.Path) BlockMetaData(org.apache.parquet.hadoop.metadata.BlockMetaData) DataWritableReadSupport(org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport) ParquetMetadata(org.apache.parquet.hadoop.metadata.ParquetMetadata) FilterCompat(org.apache.parquet.filter2.compat.FilterCompat) ArrayList(java.util.ArrayList) FileSplit(org.apache.hadoop.mapred.FileSplit) ReadSupport(org.apache.parquet.hadoop.api.ReadSupport) DataWritableReadSupport(org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport) InitContext(org.apache.parquet.hadoop.api.InitContext) ParquetInputSplit(org.apache.parquet.hadoop.ParquetInputSplit) FileMetaData(org.apache.parquet.hadoop.metadata.FileMetaData)

Aggregations

InitContext (org.apache.parquet.hadoop.api.InitContext)5 ReadSupport (org.apache.parquet.hadoop.api.ReadSupport)3 FileMetaData (org.apache.parquet.hadoop.metadata.FileMetaData)3 Configuration (org.apache.hadoop.conf.Configuration)2 ReadContext (org.apache.parquet.hadoop.api.ReadSupport.ReadContext)2 ColumnIOFactory (org.apache.parquet.io.ColumnIOFactory)2 ArrayList (java.util.ArrayList)1 HashMap (java.util.HashMap)1 HashSet (java.util.HashSet)1 Set (java.util.Set)1 Path (org.apache.hadoop.fs.Path)1 DataWritableReadSupport (org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport)1 FileSplit (org.apache.hadoop.mapred.FileSplit)1 HadoopReadOptions (org.apache.parquet.HadoopReadOptions)1 FilterCompat (org.apache.parquet.filter2.compat.FilterCompat)1 ParquetInputSplit (org.apache.parquet.hadoop.ParquetInputSplit)1 BlockMetaData (org.apache.parquet.hadoop.metadata.BlockMetaData)1 GlobalMetaData (org.apache.parquet.hadoop.metadata.GlobalMetaData)1 ParquetMetadata (org.apache.parquet.hadoop.metadata.ParquetMetadata)1 ParquetDecodingException (org.apache.parquet.io.ParquetDecodingException)1