Search in sources :

Example 1 with IcebergSplit

use of org.apache.iceberg.mr.mapreduce.IcebergSplit in project hive by apache.

the class HiveIcebergInputFormat method getSplits.

@Override
public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException {
    // Convert Hive filter to Iceberg filter
    String hiveFilter = job.get(TableScanDesc.FILTER_EXPR_CONF_STR);
    if (hiveFilter != null) {
        ExprNodeGenericFuncDesc exprNodeDesc = SerializationUtilities.deserializeObject(hiveFilter, ExprNodeGenericFuncDesc.class);
        SearchArgument sarg = ConvertAstToSearchArg.create(job, exprNodeDesc);
        try {
            Expression filter = HiveIcebergFilterFactory.generateFilterExpression(sarg);
            job.set(InputFormatConfig.FILTER_EXPRESSION, SerializationUtil.serializeToBase64(filter));
        } catch (UnsupportedOperationException e) {
            LOG.warn("Unable to create Iceberg filter, continuing without filter (will be applied by Hive later): ", e);
        }
    }
    job.set(InputFormatConfig.SELECTED_COLUMNS, job.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, ""));
    job.set(InputFormatConfig.AS_OF_TIMESTAMP, job.get(TableScanDesc.AS_OF_TIMESTAMP, "-1"));
    job.set(InputFormatConfig.SNAPSHOT_ID, job.get(TableScanDesc.AS_OF_VERSION, "-1"));
    String location = job.get(InputFormatConfig.TABLE_LOCATION);
    return Arrays.stream(super.getSplits(job, numSplits)).map(split -> new HiveIcebergSplit((IcebergSplit) split, location)).toArray(InputSplit[]::new);
}
Also used : CombineHiveInputFormat(org.apache.hadoop.hive.ql.io.CombineHiveInputFormat) ExprNodeGenericFuncDesc(org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc) Arrays(java.util.Arrays) ConvertAstToSearchArg(org.apache.hadoop.hive.ql.io.sarg.ConvertAstToSearchArg) ColumnProjectionUtils(org.apache.hadoop.hive.serde2.ColumnProjectionUtils) AbstractMapredIcebergRecordReader(org.apache.iceberg.mr.mapred.AbstractMapredIcebergRecordReader) IcebergSplit(org.apache.iceberg.mr.mapreduce.IcebergSplit) LoggerFactory(org.slf4j.LoggerFactory) SerializationUtilities(org.apache.hadoop.hive.ql.exec.SerializationUtilities) TableScanDesc(org.apache.hadoop.hive.ql.plan.TableScanDesc) SearchArgument(org.apache.hadoop.hive.ql.io.sarg.SearchArgument) DynConstructors(org.apache.iceberg.common.DynConstructors) Utilities(org.apache.hadoop.hive.ql.exec.Utilities) VectorizedSupport(org.apache.hadoop.hive.ql.exec.vector.VectorizedSupport) Expression(org.apache.iceberg.expressions.Expression) Configuration(org.apache.hadoop.conf.Configuration) Path(org.apache.hadoop.fs.Path) FileMetadataCache(org.apache.hadoop.hive.common.io.FileMetadataCache) Container(org.apache.iceberg.mr.mapred.Container) Logger(org.slf4j.Logger) IcebergSplitContainer(org.apache.iceberg.mr.mapreduce.IcebergSplitContainer) Reporter(org.apache.hadoop.mapred.Reporter) HiveConf(org.apache.hadoop.hive.conf.HiveConf) InputFormatConfig(org.apache.iceberg.mr.InputFormatConfig) IOException(java.io.IOException) SerializationUtil(org.apache.iceberg.util.SerializationUtil) VectorizedInputFormatInterface(org.apache.hadoop.hive.ql.exec.vector.VectorizedInputFormatInterface) MapredIcebergInputFormat(org.apache.iceberg.mr.mapred.MapredIcebergInputFormat) DataCache(org.apache.hadoop.hive.common.io.DataCache) JobConf(org.apache.hadoop.mapred.JobConf) Record(org.apache.iceberg.data.Record) MetastoreUtil(org.apache.iceberg.hive.MetastoreUtil) LlapCacheOnlyInputFormatInterface(org.apache.hadoop.hive.ql.io.LlapCacheOnlyInputFormatInterface) InputSplit(org.apache.hadoop.mapred.InputSplit) IcebergInputFormat(org.apache.iceberg.mr.mapreduce.IcebergInputFormat) Preconditions(org.apache.iceberg.relocated.com.google.common.base.Preconditions) RecordReader(org.apache.hadoop.mapred.RecordReader) Expression(org.apache.iceberg.expressions.Expression) ExprNodeGenericFuncDesc(org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc) SearchArgument(org.apache.hadoop.hive.ql.io.sarg.SearchArgument)

Example 2 with IcebergSplit

use of org.apache.iceberg.mr.mapreduce.IcebergSplit in project hive by apache.

the class HiveIcebergInputFormat method getRecordReader.

@Override
public RecordReader<Void, Container<Record>> getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException {
    job.set(InputFormatConfig.SELECTED_COLUMNS, job.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, ""));
    if (HiveConf.getBoolVar(job, HiveConf.ConfVars.HIVE_VECTORIZATION_ENABLED) && Utilities.getIsVectorized(job)) {
        Preconditions.checkArgument(MetastoreUtil.hive3PresentOnClasspath(), "Vectorization only supported for Hive 3+");
        job.setEnum(InputFormatConfig.IN_MEMORY_DATA_MODEL, InputFormatConfig.InMemoryDataModel.HIVE);
        job.setBoolean(InputFormatConfig.SKIP_RESIDUAL_FILTERING, true);
        IcebergSplit icebergSplit = ((IcebergSplitContainer) split).icebergSplit();
        // bogus cast for favouring code reuse over syntax
        return (RecordReader) HIVE_VECTORIZED_RECORDREADER_CTOR.newInstance(new IcebergInputFormat<>(), icebergSplit, job, reporter);
    } else {
        return super.getRecordReader(split, job, reporter);
    }
}
Also used : IcebergSplitContainer(org.apache.iceberg.mr.mapreduce.IcebergSplitContainer) MapredIcebergInputFormat(org.apache.iceberg.mr.mapred.MapredIcebergInputFormat) IcebergInputFormat(org.apache.iceberg.mr.mapreduce.IcebergInputFormat) AbstractMapredIcebergRecordReader(org.apache.iceberg.mr.mapred.AbstractMapredIcebergRecordReader) RecordReader(org.apache.hadoop.mapred.RecordReader) IcebergSplit(org.apache.iceberg.mr.mapreduce.IcebergSplit)

Example 3 with IcebergSplit

use of org.apache.iceberg.mr.mapreduce.IcebergSplit in project hive by apache.

the class HiveIcebergSplit method readFields.

@Override
public void readFields(DataInput in) throws IOException {
    byte[] bytes = new byte[in.readInt()];
    in.readFully(bytes);
    tableLocation = SerializationUtil.deserializeFromBytes(bytes);
    innerSplit = new IcebergSplit();
    innerSplit.readFields(in);
}
Also used : IcebergSplit(org.apache.iceberg.mr.mapreduce.IcebergSplit)

Aggregations

IcebergSplit (org.apache.iceberg.mr.mapreduce.IcebergSplit)3 RecordReader (org.apache.hadoop.mapred.RecordReader)2 AbstractMapredIcebergRecordReader (org.apache.iceberg.mr.mapred.AbstractMapredIcebergRecordReader)2 MapredIcebergInputFormat (org.apache.iceberg.mr.mapred.MapredIcebergInputFormat)2 IcebergInputFormat (org.apache.iceberg.mr.mapreduce.IcebergInputFormat)2 IcebergSplitContainer (org.apache.iceberg.mr.mapreduce.IcebergSplitContainer)2 IOException (java.io.IOException)1 Arrays (java.util.Arrays)1 Configuration (org.apache.hadoop.conf.Configuration)1 Path (org.apache.hadoop.fs.Path)1 DataCache (org.apache.hadoop.hive.common.io.DataCache)1 FileMetadataCache (org.apache.hadoop.hive.common.io.FileMetadataCache)1 HiveConf (org.apache.hadoop.hive.conf.HiveConf)1 SerializationUtilities (org.apache.hadoop.hive.ql.exec.SerializationUtilities)1 Utilities (org.apache.hadoop.hive.ql.exec.Utilities)1 VectorizedInputFormatInterface (org.apache.hadoop.hive.ql.exec.vector.VectorizedInputFormatInterface)1 VectorizedSupport (org.apache.hadoop.hive.ql.exec.vector.VectorizedSupport)1 CombineHiveInputFormat (org.apache.hadoop.hive.ql.io.CombineHiveInputFormat)1 LlapCacheOnlyInputFormatInterface (org.apache.hadoop.hive.ql.io.LlapCacheOnlyInputFormatInterface)1 ConvertAstToSearchArg (org.apache.hadoop.hive.ql.io.sarg.ConvertAstToSearchArg)1