Search in sources :

Example 1 with SerializableTable

use of org.apache.iceberg.SerializableTable in project hive by apache.

the class IcebergInputFormat method getSplits.

@Override
public List<InputSplit> getSplits(JobContext context) {
    Configuration conf = context.getConfiguration();
    Table table = Optional.ofNullable(HiveIcebergStorageHandler.table(conf, conf.get(InputFormatConfig.TABLE_IDENTIFIER))).orElseGet(() -> Catalogs.loadTable(conf));
    TableScan scan = createTableScan(table, conf);
    List<InputSplit> splits = Lists.newArrayList();
    boolean applyResidual = !conf.getBoolean(InputFormatConfig.SKIP_RESIDUAL_FILTERING, false);
    InputFormatConfig.InMemoryDataModel model = conf.getEnum(InputFormatConfig.IN_MEMORY_DATA_MODEL, InputFormatConfig.InMemoryDataModel.GENERIC);
    try (CloseableIterable<CombinedScanTask> tasksIterable = scan.planTasks()) {
        Table serializableTable = SerializableTable.copyOf(table);
        tasksIterable.forEach(task -> {
            if (applyResidual && (model == InputFormatConfig.InMemoryDataModel.HIVE || model == InputFormatConfig.InMemoryDataModel.PIG)) {
                // TODO: We do not support residual evaluation for HIVE and PIG in memory data model yet
                checkResiduals(task);
            }
            splits.add(new IcebergSplit(serializableTable, conf, task));
        });
    } catch (IOException e) {
        throw new UncheckedIOException(String.format("Failed to close table scan: %s", scan), e);
    }
    // wouldn't be able to inject the config into these tasks on the deserializer-side, unlike for standard queries
    if (scan instanceof DataTableScan) {
        HiveIcebergStorageHandler.checkAndSkipIoConfigSerialization(conf, table);
    }
    return splits;
}
Also used : TableScan(org.apache.iceberg.TableScan) DataTableScan(org.apache.iceberg.DataTableScan) Table(org.apache.iceberg.Table) SerializableTable(org.apache.iceberg.SerializableTable) CombinedScanTask(org.apache.iceberg.CombinedScanTask) Configuration(org.apache.hadoop.conf.Configuration) UncheckedIOException(java.io.UncheckedIOException) UncheckedIOException(java.io.UncheckedIOException) IOException(java.io.IOException) InputFormatConfig(org.apache.iceberg.mr.InputFormatConfig) DataTableScan(org.apache.iceberg.DataTableScan) InputSplit(org.apache.hadoop.mapreduce.InputSplit)

Example 2 with SerializableTable

use of org.apache.iceberg.SerializableTable in project hive by apache.

the class HiveIcebergStorageHandler method overlayTableProperties.

/**
 * Stores the serializable table data in the configuration.
 * Currently the following is handled:
 * <ul>
 *   <li>- Table - in case the table is serializable</li>
 *   <li>- Location</li>
 *   <li>- Schema</li>
 *   <li>- Partition specification</li>
 *   <li>- FileIO for handling table files</li>
 *   <li>- Location provider used for file generation</li>
 *   <li>- Encryption manager for encryption handling</li>
 * </ul>
 * @param configuration The configuration storing the catalog information
 * @param tableDesc The table which we want to store to the configuration
 * @param map The map of the configuration properties which we append with the serialized data
 */
@VisibleForTesting
static void overlayTableProperties(Configuration configuration, TableDesc tableDesc, Map<String, String> map) {
    Properties props = tableDesc.getProperties();
    Table table = IcebergTableUtil.getTable(configuration, props);
    String schemaJson = SchemaParser.toJson(table.schema());
    Maps.fromProperties(props).entrySet().stream().filter(// map overrides tableDesc properties
    entry -> !map.containsKey(entry.getKey())).forEach(entry -> map.put(entry.getKey(), entry.getValue()));
    map.put(InputFormatConfig.TABLE_IDENTIFIER, props.getProperty(Catalogs.NAME));
    map.put(InputFormatConfig.TABLE_LOCATION, table.location());
    map.put(InputFormatConfig.TABLE_SCHEMA, schemaJson);
    props.put(InputFormatConfig.PARTITION_SPEC, PartitionSpecParser.toJson(table.spec()));
    // serialize table object into config
    Table serializableTable = SerializableTable.copyOf(table);
    checkAndSkipIoConfigSerialization(configuration, serializableTable);
    map.put(InputFormatConfig.SERIALIZED_TABLE_PREFIX + tableDesc.getTableName(), SerializationUtil.serializeToBase64(serializableTable));
    // We need to remove this otherwise the job.xml will be invalid as column comments are separated with '\0' and
    // the serialization utils fail to serialize this character
    map.remove("columns.comments");
    // save schema into table props as well to avoid repeatedly hitting the HMS during serde initializations
    // this is an exception to the interface documentation, but it's a safe operation to add this property
    props.put(InputFormatConfig.TABLE_SCHEMA, schemaJson);
}
Also used : ExprNodeGenericFuncDesc(org.apache.hadoop.hive.ql.plan.ExprNodeGenericFuncDesc) TableDesc(org.apache.hadoop.hive.ql.plan.TableDesc) HadoopConfigurable(org.apache.iceberg.hadoop.HadoopConfigurable) ListIterator(java.util.ListIterator) URISyntaxException(java.net.URISyntaxException) Catalogs(org.apache.iceberg.mr.Catalogs) LoggerFactory(org.slf4j.LoggerFactory) Date(org.apache.hadoop.hive.common.type.Date) SemanticException(org.apache.hadoop.hive.ql.parse.SemanticException) JobID(org.apache.hadoop.mapred.JobID) AbstractSerDe(org.apache.hadoop.hive.serde2.AbstractSerDe) StatsSetupConst(org.apache.hadoop.hive.common.StatsSetupConst) OutputCommitter(org.apache.hadoop.mapred.OutputCommitter) AlterTableType(org.apache.hadoop.hive.ql.ddl.table.AlterTableType) Throwables(org.apache.iceberg.relocated.com.google.common.base.Throwables) Map(java.util.Map) Configuration(org.apache.hadoop.conf.Configuration) InputFormat(org.apache.hadoop.mapred.InputFormat) URI(java.net.URI) PrimitiveTypeInfo(org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo) HiveStorageHandler(org.apache.hadoop.hive.ql.metadata.HiveStorageHandler) HiveStoragePredicateHandler(org.apache.hadoop.hive.ql.metadata.HiveStoragePredicateHandler) Splitter(org.apache.iceberg.relocated.com.google.common.base.Splitter) OutputFormat(org.apache.hadoop.mapred.OutputFormat) ExprNodeDesc(org.apache.hadoop.hive.ql.plan.ExprNodeDesc) WriteEntity(org.apache.hadoop.hive.ql.hooks.WriteEntity) Collection(java.util.Collection) Partish(org.apache.hadoop.hive.ql.stats.Partish) HiveMetaHook(org.apache.hadoop.hive.metastore.HiveMetaHook) FileSinkDesc(org.apache.hadoop.hive.ql.plan.FileSinkDesc) InputFormatConfig(org.apache.iceberg.mr.InputFormatConfig) Schema(org.apache.iceberg.Schema) Collectors(java.util.stream.Collectors) SessionState(org.apache.hadoop.hive.ql.session.SessionState) PartitionSpecParser(org.apache.iceberg.PartitionSpecParser) Serializable(java.io.Serializable) SchemaParser(org.apache.iceberg.SchemaParser) List(java.util.List) Optional(java.util.Optional) TableProperties(org.apache.iceberg.TableProperties) SessionStateUtil(org.apache.hadoop.hive.ql.session.SessionStateUtil) HiveException(org.apache.hadoop.hive.ql.metadata.HiveException) LockType(org.apache.hadoop.hive.metastore.api.LockType) ConvertAstToSearchArg(org.apache.hadoop.hive.ql.io.sarg.ConvertAstToSearchArg) HashMap(java.util.HashMap) ExprNodeDynamicListDesc(org.apache.hadoop.hive.ql.plan.ExprNodeDynamicListDesc) ArrayList(java.util.ArrayList) SearchArgument(org.apache.hadoop.hive.ql.io.sarg.SearchArgument) Utilities(org.apache.hadoop.hive.ql.exec.Utilities) JobStatus(org.apache.hadoop.mapred.JobStatus) PartitionTransformSpec(org.apache.hadoop.hive.ql.parse.PartitionTransformSpec) ExprNodeColumnDesc(org.apache.hadoop.hive.ql.plan.ExprNodeColumnDesc) Properties(java.util.Properties) Logger(org.slf4j.Logger) Timestamp(org.apache.hadoop.hive.common.type.Timestamp) ExprNodeConstantDesc(org.apache.hadoop.hive.ql.plan.ExprNodeConstantDesc) Table(org.apache.iceberg.Table) HiveConf(org.apache.hadoop.hive.conf.HiveConf) Maps(org.apache.iceberg.relocated.com.google.common.collect.Maps) IOException(java.io.IOException) SerializationUtil(org.apache.iceberg.util.SerializationUtil) JobConf(org.apache.hadoop.mapred.JobConf) SnapshotSummary(org.apache.iceberg.SnapshotSummary) JobContext(org.apache.hadoop.mapred.JobContext) Deserializer(org.apache.hadoop.hive.serde2.Deserializer) Preconditions(org.apache.iceberg.relocated.com.google.common.base.Preconditions) JobContextImpl(org.apache.hadoop.mapred.JobContextImpl) HiveAuthorizationProvider(org.apache.hadoop.hive.ql.security.authorization.HiveAuthorizationProvider) SerializableTable(org.apache.iceberg.SerializableTable) VisibleForTesting(org.apache.iceberg.relocated.com.google.common.annotations.VisibleForTesting) Table(org.apache.iceberg.Table) SerializableTable(org.apache.iceberg.SerializableTable) TableProperties(org.apache.iceberg.TableProperties) Properties(java.util.Properties) VisibleForTesting(org.apache.iceberg.relocated.com.google.common.annotations.VisibleForTesting)

Aggregations

IOException (java.io.IOException)2 Configuration (org.apache.hadoop.conf.Configuration)2 Serializable (java.io.Serializable)1 UncheckedIOException (java.io.UncheckedIOException)1 URI (java.net.URI)1 URISyntaxException (java.net.URISyntaxException)1 ArrayList (java.util.ArrayList)1 Collection (java.util.Collection)1 HashMap (java.util.HashMap)1 List (java.util.List)1 ListIterator (java.util.ListIterator)1 Map (java.util.Map)1 Optional (java.util.Optional)1 Properties (java.util.Properties)1 Collectors (java.util.stream.Collectors)1 StatsSetupConst (org.apache.hadoop.hive.common.StatsSetupConst)1 Date (org.apache.hadoop.hive.common.type.Date)1 Timestamp (org.apache.hadoop.hive.common.type.Timestamp)1 HiveConf (org.apache.hadoop.hive.conf.HiveConf)1 HiveMetaHook (org.apache.hadoop.hive.metastore.HiveMetaHook)1