Search in sources :

Example 1 with SourcePathTargetImpl

use of org.apache.crunch.io.impl.SourcePathTargetImpl in project crunch by cloudera.

the class MapsideJoin method join.

/**
 * Join two tables using a map side join. The right-side table will be loaded
 * fully in memory, so this method should only be used if the right side
 * table's contents can fit in the memory allocated to mappers. The join
 * performed by this method is an inner join.
 *
 * @param left
 *          The left-side table of the join
 * @param right
 *          The right-side table of the join, whose contents will be fully
 *          read into memory
 * @return A table keyed on the join key, containing pairs of joined values
 */
public static <K, U, V> PTable<K, Pair<U, V>> join(PTable<K, U> left, PTable<K, V> right) {
    if (!(right.getPipeline() instanceof MRPipeline)) {
        throw new CrunchRuntimeException("Map-side join is only supported within a MapReduce context");
    }
    MRPipeline pipeline = (MRPipeline) right.getPipeline();
    pipeline.materialize(right);
    // TODO Move necessary logic to MRPipeline so that we can theoretically
    // optimize his by running the setup of multiple map-side joins concurrently
    pipeline.run();
    ReadableSourceTarget<Pair<K, V>> readableSourceTarget = pipeline.getMaterializeSourceTarget(right);
    if (!(readableSourceTarget instanceof SourcePathTargetImpl)) {
        throw new CrunchRuntimeException("Right-side contents can't be read from a path");
    }
    // Suppress warnings because we've just checked this cast via instanceof
    @SuppressWarnings("unchecked") SourcePathTargetImpl<Pair<K, V>> sourcePathTarget = (SourcePathTargetImpl<Pair<K, V>>) readableSourceTarget;
    Path path = sourcePathTarget.getPath();
    DistributedCache.addCacheFile(path.toUri(), pipeline.getConfiguration());
    MapsideJoinDoFn<K, U, V> mapJoinDoFn = new MapsideJoinDoFn<K, U, V>(path.toString(), right.getPType());
    PTypeFamily typeFamily = left.getTypeFamily();
    return left.parallelDo("mapjoin", mapJoinDoFn, typeFamily.tableOf(left.getKeyType(), typeFamily.pairs(left.getValueType(), right.getValueType())));
}
Also used : Path(org.apache.hadoop.fs.Path) MRPipeline(org.apache.crunch.impl.mr.MRPipeline) PTypeFamily(org.apache.crunch.types.PTypeFamily) SourcePathTargetImpl(org.apache.crunch.io.impl.SourcePathTargetImpl) CrunchRuntimeException(org.apache.crunch.impl.mr.run.CrunchRuntimeException) Pair(org.apache.crunch.Pair)

Aggregations

Pair (org.apache.crunch.Pair)1 MRPipeline (org.apache.crunch.impl.mr.MRPipeline)1 CrunchRuntimeException (org.apache.crunch.impl.mr.run.CrunchRuntimeException)1 SourcePathTargetImpl (org.apache.crunch.io.impl.SourcePathTargetImpl)1 PTypeFamily (org.apache.crunch.types.PTypeFamily)1 Path (org.apache.hadoop.fs.Path)1