Search in sources :

Example 1 with VecParquetReader

use of org.apache.parquet.hadoop.VecParquetReader in project h2o-3 by h2oai.

the class ParquetParser method parseChunk.

@Override
protected final ParseWriter parseChunk(int cidx, ParseReader din, ParseWriter dout) {
    if (!(din instanceof FVecParseReader)) {
        // TODO: Should we modify the interface to expose the underlying chunk for non-streaming parsers?
        throw new IllegalStateException("We only accept parser readers backed by a Vec (no streaming support!).");
    }
    Chunk chunk = ((FVecParseReader) din).getChunk();
    Vec vec = chunk.vec();
    // extract metadata, we want to read only the row groups that have centers in this chunk
    ParquetMetadataConverter.MetadataFilter chunkFilter = ParquetMetadataConverter.range(chunk.start(), chunk.start() + chunk.len());
    ParquetMetadata metadata = VecParquetReader.readFooter(_metadata, chunkFilter);
    if (metadata.getBlocks().isEmpty()) {
        Log.trace("Chunk #", cidx, " doesn't contain any Parquet block center.");
        return dout;
    }
    Log.info("Processing ", metadata.getBlocks().size(), " blocks of chunk #", cidx);
    VecParquetReader reader = new VecParquetReader(vec, metadata, dout, _setup.getColumnTypes());
    try {
        Integer recordNumber;
        do {
            recordNumber = reader.read();
        } while (recordNumber != null);
    } catch (IOException e) {
        throw new RuntimeException("Failed to parse records", e);
    }
    return dout;
}
Also used : ParquetMetadataConverter(org.apache.parquet.format.converter.ParquetMetadataConverter) ByteVec(water.fvec.ByteVec) Vec(water.fvec.Vec) ParquetMetadata(org.apache.parquet.hadoop.metadata.ParquetMetadata) VecParquetReader(org.apache.parquet.hadoop.VecParquetReader) IOException(java.io.IOException) Chunk(water.fvec.Chunk)

Example 2 with VecParquetReader

use of org.apache.parquet.hadoop.VecParquetReader in project h2o-3 by h2oai.

the class ParquetParser method readFirstRecords.

private static ParquetPreviewParseWriter readFirstRecords(ParquetMetadata metadata, ByteVec vec, int cnt) {
    ParquetMetadata startMetadata = new ParquetMetadata(metadata.getFileMetaData(), Collections.singletonList(findFirstBlock(metadata)));
    ParquetPreviewParseWriter ppWriter = new ParquetPreviewParseWriter(metadata.getFileMetaData().getSchema());
    VecParquetReader reader = new VecParquetReader(vec, startMetadata, ppWriter, ppWriter._roughTypes);
    try {
        int recordCnt = 0;
        Integer recordNum;
        do {
            recordNum = reader.read();
        } while ((recordNum != null) && (++recordCnt < cnt));
        return ppWriter;
    } catch (IOException e) {
        throw new RuntimeException("Failed to read the first few records", e);
    }
}
Also used : ParquetMetadata(org.apache.parquet.hadoop.metadata.ParquetMetadata) VecParquetReader(org.apache.parquet.hadoop.VecParquetReader) IOException(java.io.IOException)

Aggregations

IOException (java.io.IOException)2 VecParquetReader (org.apache.parquet.hadoop.VecParquetReader)2 ParquetMetadata (org.apache.parquet.hadoop.metadata.ParquetMetadata)2 ParquetMetadataConverter (org.apache.parquet.format.converter.ParquetMetadataConverter)1 ByteVec (water.fvec.ByteVec)1 Chunk (water.fvec.Chunk)1 Vec (water.fvec.Vec)1