Search in sources :

Example 1 with ProcessingMode

use of org.apache.hadoop.hive.ql.plan.VectorGroupByDesc.ProcessingMode in project hive by apache.

the class Vectorizer method validateGroupByOperator.

private boolean validateGroupByOperator(GroupByOperator op, boolean isReduce, boolean isTezOrSpark) {
    GroupByDesc desc = op.getConf();
    if (desc.isGroupingSetsPresent()) {
        setOperatorIssue("Grouping sets not supported");
        return false;
    }
    if (desc.pruneGroupingSetId()) {
        setOperatorIssue("Pruning grouping set id not supported");
        return false;
    }
    if (desc.getMode() != GroupByDesc.Mode.HASH && desc.isDistinct()) {
        setOperatorIssue("DISTINCT not supported");
        return false;
    }
    boolean ret = validateExprNodeDesc(desc.getKeys(), "Key");
    if (!ret) {
        return false;
    }
    /**
     *
     * GROUP BY DEFINITIONS:
     *
     * GroupByDesc.Mode enumeration:
     *
     *    The different modes of a GROUP BY operator.
     *
     *    These descriptions are hopefully less cryptic than the comments for GroupByDesc.Mode.
     *
     *        COMPLETE       Aggregates original rows into full aggregation row(s).
     *
     *                       If the key length is 0, this is also called Global aggregation and
     *                       1 output row is produced.
     *
     *                       When the key length is > 0, the original rows come in ALREADY GROUPED.
     *
     *                       An example for key length > 0 is a GROUP BY being applied to the
     *                       ALREADY GROUPED rows coming from an upstream JOIN operator.  Or,
     *                       ALREADY GROUPED rows coming from upstream MERGEPARTIAL GROUP BY
     *                       operator.
     *
     *        PARTIAL1       The first of 2 (or more) phases that aggregates ALREADY GROUPED
     *                       original rows into partial aggregations.
     *
     *                       Subsequent phases PARTIAL2 (optional) and MERGEPARTIAL will merge
     *                       the partial aggregations and output full aggregations.
     *
     *        PARTIAL2       Accept ALREADY GROUPED partial aggregations and merge them into another
     *                       partial aggregation.  Output the merged partial aggregations.
     *
     *                       (Haven't seen this one used)
     *
     *        PARTIALS       (Behaves for non-distinct the same as PARTIAL2; and behaves for
     *                       distinct the same as PARTIAL1.)
     *
     *        FINAL          Accept ALREADY GROUPED original rows and aggregate them into
     *                       full aggregations.
     *
     *                       Example is a GROUP BY being applied to rows from a sorted table, where
     *                       the group key is the table sort key (or a prefix).
     *
     *        HASH           Accept UNORDERED original rows and aggregate them into a memory table.
     *                       Output the partial aggregations on closeOp (or low memory).
     *
     *                       Similar to PARTIAL1 except original rows are UNORDERED.
     *
     *                       Commonly used in both Mapper and Reducer nodes.  Always followed by
     *                       a Reducer with MERGEPARTIAL GROUP BY.
     *
     *        MERGEPARTIAL   Always first operator of a Reducer.  Data is grouped by reduce-shuffle.
     *
     *                       (Behaves for non-distinct aggregations the same as FINAL; and behaves
     *                       for distinct aggregations the same as COMPLETE.)
     *
     *                       The output is full aggregation(s).
     *
     *                       Used in Reducers after a stage with a HASH GROUP BY operator.
     *
     *
     *  VectorGroupByDesc.ProcessingMode for VectorGroupByOperator:
     *
     *     GLOBAL         No key.  All rows --> 1 full aggregation on end of input
     *
     *     HASH           Rows aggregated in to hash table on group key -->
     *                        1 partial aggregation per key (normally, unless there is spilling)
     *
     *     MERGE_PARTIAL  As first operator in a REDUCER, partial aggregations come grouped from
     *                    reduce-shuffle -->
     *                        aggregate the partial aggregations and emit full aggregation on
     *                        endGroup / closeOp
     *
     *     STREAMING      Rows come from PARENT operator ALREADY GROUPED -->
     *                        aggregate the rows and emit full aggregation on key change / closeOp
     *
     *     NOTE: Hash can spill partial result rows prematurely if it runs low on memory.
     *     NOTE: Streaming has to compare keys where MergePartial gets an endGroup call.
     *
     *
     *  DECIDER: Which VectorGroupByDesc.ProcessingMode for VectorGroupByOperator?
     *
     *     Decides using GroupByDesc.Mode and whether there are keys with the
     *     VectorGroupByDesc.groupByDescModeToVectorProcessingMode method.
     *
     *         Mode.COMPLETE      --> (numKeys == 0 ? ProcessingMode.GLOBAL : ProcessingMode.STREAMING)
     *
     *         Mode.HASH          --> ProcessingMode.HASH
     *
     *         Mode.MERGEPARTIAL  --> (numKeys == 0 ? ProcessingMode.GLOBAL : ProcessingMode.MERGE_PARTIAL)
     *
     *         Mode.PARTIAL1,
     *         Mode.PARTIAL2,
     *         Mode.PARTIALS,
     *         Mode.FINAL        --> ProcessingMode.STREAMING
     *
     */
    boolean hasKeys = (desc.getKeys().size() > 0);
    ProcessingMode processingMode = VectorGroupByDesc.groupByDescModeToVectorProcessingMode(desc.getMode(), hasKeys);
    Pair<Boolean, Boolean> retPair = validateAggregationDescs(desc.getAggregators(), processingMode, hasKeys);
    if (!retPair.left) {
        return false;
    }
    // If all the aggregation outputs are primitive, we can output VectorizedRowBatch.
    // Otherwise, we the rest of the operator tree will be row mode.
    VectorGroupByDesc vectorDesc = new VectorGroupByDesc();
    desc.setVectorDesc(vectorDesc);
    vectorDesc.setVectorOutput(retPair.right);
    vectorDesc.setProcessingMode(processingMode);
    LOG.info("Vector GROUP BY operator will use processing mode " + processingMode.name() + ", isVectorOutput " + vectorDesc.isVectorOutput());
    return true;
}
Also used : ProcessingMode(org.apache.hadoop.hive.ql.plan.VectorGroupByDesc.ProcessingMode) VectorGroupByDesc(org.apache.hadoop.hive.ql.plan.VectorGroupByDesc) UDFToBoolean(org.apache.hadoop.hive.ql.udf.UDFToBoolean) VectorGroupByDesc(org.apache.hadoop.hive.ql.plan.VectorGroupByDesc) GroupByDesc(org.apache.hadoop.hive.ql.plan.GroupByDesc)

Aggregations

GroupByDesc (org.apache.hadoop.hive.ql.plan.GroupByDesc)1 VectorGroupByDesc (org.apache.hadoop.hive.ql.plan.VectorGroupByDesc)1 ProcessingMode (org.apache.hadoop.hive.ql.plan.VectorGroupByDesc.ProcessingMode)1 UDFToBoolean (org.apache.hadoop.hive.ql.udf.UDFToBoolean)1