use of org.apache.crunch.impl.mr.MRPipeline in project crunch by cloudera.
the class MapsideJoin method join.
/**
* Join two tables using a map side join. The right-side table will be loaded
* fully in memory, so this method should only be used if the right side
* table's contents can fit in the memory allocated to mappers. The join
* performed by this method is an inner join.
*
* @param left
* The left-side table of the join
* @param right
* The right-side table of the join, whose contents will be fully
* read into memory
* @return A table keyed on the join key, containing pairs of joined values
*/
public static <K, U, V> PTable<K, Pair<U, V>> join(PTable<K, U> left, PTable<K, V> right) {
if (!(right.getPipeline() instanceof MRPipeline)) {
throw new CrunchRuntimeException("Map-side join is only supported within a MapReduce context");
}
MRPipeline pipeline = (MRPipeline) right.getPipeline();
pipeline.materialize(right);
// TODO Move necessary logic to MRPipeline so that we can theoretically
// optimize his by running the setup of multiple map-side joins concurrently
pipeline.run();
ReadableSourceTarget<Pair<K, V>> readableSourceTarget = pipeline.getMaterializeSourceTarget(right);
if (!(readableSourceTarget instanceof SourcePathTargetImpl)) {
throw new CrunchRuntimeException("Right-side contents can't be read from a path");
}
// Suppress warnings because we've just checked this cast via instanceof
@SuppressWarnings("unchecked") SourcePathTargetImpl<Pair<K, V>> sourcePathTarget = (SourcePathTargetImpl<Pair<K, V>>) readableSourceTarget;
Path path = sourcePathTarget.getPath();
DistributedCache.addCacheFile(path.toUri(), pipeline.getConfiguration());
MapsideJoinDoFn<K, U, V> mapJoinDoFn = new MapsideJoinDoFn<K, U, V>(path.toString(), right.getPType());
PTypeFamily typeFamily = left.getTypeFamily();
return left.parallelDo("mapjoin", mapJoinDoFn, typeFamily.tableOf(left.getKeyType(), typeFamily.pairs(left.getValueType(), right.getValueType())));
}
use of org.apache.crunch.impl.mr.MRPipeline in project crunch by cloudera.
the class PageRankTest method testWritablesBSON.
@Test
public void testWritablesBSON() throws Exception {
PTypeFamily tf = WritableTypeFamily.getInstance();
PType<PageRankData> prType = PTypes.smile(PageRankData.class, tf);
run(new MRPipeline(PageRankTest.class), prType, tf);
}
use of org.apache.crunch.impl.mr.MRPipeline in project crunch by cloudera.
the class PageRankTest method testWritablesJSON.
@Test
public void testWritablesJSON() throws Exception {
PTypeFamily tf = WritableTypeFamily.getInstance();
PType<PageRankData> prType = PTypes.jsonString(PageRankData.class, tf);
run(new MRPipeline(PageRankTest.class), prType, tf);
}
use of org.apache.crunch.impl.mr.MRPipeline in project crunch by cloudera.
the class AggregateTest method testCollectValues_Avro.
@Test
public void testCollectValues_Avro() throws IOException {
MapStringToEmployeePair mapFn = new MapStringToEmployeePair();
Pipeline pipeline = new MRPipeline(AggregateTest.class);
Map<Integer, Collection<Employee>> collectionMap = pipeline.readTextFile(FileHelper.createTempCopyOf("set2.txt")).parallelDo(mapFn, Avros.tableOf(Avros.ints(), Avros.records(Employee.class))).collectValues().materializeToMap();
assertEquals(1, collectionMap.size());
Employee empC = mapFn.map("c").second();
Employee empD = mapFn.map("d").second();
Employee empA = mapFn.map("a").second();
assertEquals(Lists.newArrayList(empC, empD, empA), collectionMap.get(1));
}
use of org.apache.crunch.impl.mr.MRPipeline in project crunch by cloudera.
the class AggregateTest method testAvro.
@Test
public void testAvro() throws Exception {
Pipeline pipeline = new MRPipeline(AggregateTest.class);
String shakesInputPath = FileHelper.createTempCopyOf("shakes.txt");
PCollection<String> shakes = pipeline.readTextFile(shakesInputPath);
runMinMax(shakes, AvroTypeFamily.getInstance());
pipeline.done();
}
Aggregations