Search in sources :

Example 16 with MRPipeline

use of org.apache.crunch.impl.mr.MRPipeline in project crunch by cloudera.

the class MapsideJoin method join.

/**
 * Join two tables using a map side join. The right-side table will be loaded
 * fully in memory, so this method should only be used if the right side
 * table's contents can fit in the memory allocated to mappers. The join
 * performed by this method is an inner join.
 *
 * @param left
 *          The left-side table of the join
 * @param right
 *          The right-side table of the join, whose contents will be fully
 *          read into memory
 * @return A table keyed on the join key, containing pairs of joined values
 */
public static <K, U, V> PTable<K, Pair<U, V>> join(PTable<K, U> left, PTable<K, V> right) {
    if (!(right.getPipeline() instanceof MRPipeline)) {
        throw new CrunchRuntimeException("Map-side join is only supported within a MapReduce context");
    }
    MRPipeline pipeline = (MRPipeline) right.getPipeline();
    pipeline.materialize(right);
    // TODO Move necessary logic to MRPipeline so that we can theoretically
    // optimize his by running the setup of multiple map-side joins concurrently
    pipeline.run();
    ReadableSourceTarget<Pair<K, V>> readableSourceTarget = pipeline.getMaterializeSourceTarget(right);
    if (!(readableSourceTarget instanceof SourcePathTargetImpl)) {
        throw new CrunchRuntimeException("Right-side contents can't be read from a path");
    }
    // Suppress warnings because we've just checked this cast via instanceof
    @SuppressWarnings("unchecked") SourcePathTargetImpl<Pair<K, V>> sourcePathTarget = (SourcePathTargetImpl<Pair<K, V>>) readableSourceTarget;
    Path path = sourcePathTarget.getPath();
    DistributedCache.addCacheFile(path.toUri(), pipeline.getConfiguration());
    MapsideJoinDoFn<K, U, V> mapJoinDoFn = new MapsideJoinDoFn<K, U, V>(path.toString(), right.getPType());
    PTypeFamily typeFamily = left.getTypeFamily();
    return left.parallelDo("mapjoin", mapJoinDoFn, typeFamily.tableOf(left.getKeyType(), typeFamily.pairs(left.getValueType(), right.getValueType())));
}
Also used : Path(org.apache.hadoop.fs.Path) MRPipeline(org.apache.crunch.impl.mr.MRPipeline) PTypeFamily(org.apache.crunch.types.PTypeFamily) SourcePathTargetImpl(org.apache.crunch.io.impl.SourcePathTargetImpl) CrunchRuntimeException(org.apache.crunch.impl.mr.run.CrunchRuntimeException) Pair(org.apache.crunch.Pair)

Example 17 with MRPipeline

use of org.apache.crunch.impl.mr.MRPipeline in project crunch by cloudera.

the class PageRankTest method testWritablesBSON.

@Test
public void testWritablesBSON() throws Exception {
    PTypeFamily tf = WritableTypeFamily.getInstance();
    PType<PageRankData> prType = PTypes.smile(PageRankData.class, tf);
    run(new MRPipeline(PageRankTest.class), prType, tf);
}
Also used : PTypeFamily(org.apache.crunch.types.PTypeFamily) MRPipeline(org.apache.crunch.impl.mr.MRPipeline) Test(org.junit.Test)

Example 18 with MRPipeline

use of org.apache.crunch.impl.mr.MRPipeline in project crunch by cloudera.

the class PageRankTest method testWritablesJSON.

@Test
public void testWritablesJSON() throws Exception {
    PTypeFamily tf = WritableTypeFamily.getInstance();
    PType<PageRankData> prType = PTypes.jsonString(PageRankData.class, tf);
    run(new MRPipeline(PageRankTest.class), prType, tf);
}
Also used : PTypeFamily(org.apache.crunch.types.PTypeFamily) MRPipeline(org.apache.crunch.impl.mr.MRPipeline) Test(org.junit.Test)

Example 19 with MRPipeline

use of org.apache.crunch.impl.mr.MRPipeline in project crunch by cloudera.

the class AggregateTest method testCollectValues_Avro.

@Test
public void testCollectValues_Avro() throws IOException {
    MapStringToEmployeePair mapFn = new MapStringToEmployeePair();
    Pipeline pipeline = new MRPipeline(AggregateTest.class);
    Map<Integer, Collection<Employee>> collectionMap = pipeline.readTextFile(FileHelper.createTempCopyOf("set2.txt")).parallelDo(mapFn, Avros.tableOf(Avros.ints(), Avros.records(Employee.class))).collectValues().materializeToMap();
    assertEquals(1, collectionMap.size());
    Employee empC = mapFn.map("c").second();
    Employee empD = mapFn.map("d").second();
    Employee empA = mapFn.map("a").second();
    assertEquals(Lists.newArrayList(empC, empD, empA), collectionMap.get(1));
}
Also used : Employee(org.apache.crunch.test.Employee) MRPipeline(org.apache.crunch.impl.mr.MRPipeline) PCollection(org.apache.crunch.PCollection) Collection(java.util.Collection) MemPipeline(org.apache.crunch.impl.mem.MemPipeline) Pipeline(org.apache.crunch.Pipeline) MRPipeline(org.apache.crunch.impl.mr.MRPipeline) Test(org.junit.Test)

Example 20 with MRPipeline

use of org.apache.crunch.impl.mr.MRPipeline in project crunch by cloudera.

the class AggregateTest method testAvro.

@Test
public void testAvro() throws Exception {
    Pipeline pipeline = new MRPipeline(AggregateTest.class);
    String shakesInputPath = FileHelper.createTempCopyOf("shakes.txt");
    PCollection<String> shakes = pipeline.readTextFile(shakesInputPath);
    runMinMax(shakes, AvroTypeFamily.getInstance());
    pipeline.done();
}
Also used : MRPipeline(org.apache.crunch.impl.mr.MRPipeline) MemPipeline(org.apache.crunch.impl.mem.MemPipeline) Pipeline(org.apache.crunch.Pipeline) MRPipeline(org.apache.crunch.impl.mr.MRPipeline) Test(org.junit.Test)

Aggregations

MRPipeline (org.apache.crunch.impl.mr.MRPipeline)34 Test (org.junit.Test)26 Pipeline (org.apache.crunch.Pipeline)13 PTypeFamily (org.apache.crunch.types.PTypeFamily)7 MemPipeline (org.apache.crunch.impl.mem.MemPipeline)6 Pair (org.apache.crunch.Pair)4 Collection (java.util.Collection)3 Record (org.apache.avro.generic.GenericData.Record)3 GenericRecord (org.apache.avro.generic.GenericRecord)3 PCollection (org.apache.crunch.PCollection)3 Person (org.apache.crunch.test.Person)3 Schema (org.apache.avro.Schema)2 PojoPerson (org.apache.crunch.io.avro.AvroFileReaderFactoryTest.PojoPerson)2 Employee (org.apache.crunch.test.Employee)2 Before (org.junit.Before)2 ImmutableMap (com.google.common.collect.ImmutableMap)1 Map (java.util.Map)1 MapFn (org.apache.crunch.MapFn)1 CrunchRuntimeException (org.apache.crunch.impl.mr.run.CrunchRuntimeException)1 SourcePathTargetImpl (org.apache.crunch.io.impl.SourcePathTargetImpl)1