Search in sources :

Example 1 with MRPipeline

use of in project crunch by cloudera.

the class AggregateTest method testWritables.

public void testWritables() throws Exception {
    Pipeline pipeline = new MRPipeline(AggregateTest.class);
    String shakesInputPath = FileHelper.createTempCopyOf("shakes.txt");
    PCollection<String> shakes = pipeline.readTextFile(shakesInputPath);
    runMinMax(shakes, WritableTypeFamily.getInstance());
Also used : MRPipeline( MemPipeline(org.apache.crunch.impl.mem.MemPipeline) Pipeline(org.apache.crunch.Pipeline) MRPipeline( Test(org.junit.Test)

Example 2 with MRPipeline

use of in project crunch by cloudera.

the class WordCount method run.

public int run(String[] args) throws Exception {
    if (args.length != 3) {
        System.err.println("Usage: " + this.getClass().getName() + " [generic options] input output");
        return 1;
    // Create an object to coordinate pipeline creation and execution.
    Pipeline pipeline = new MRPipeline(WordCount.class, getConf());
    // Reference a given text file as a collection of Strings.
    PCollection<String> lines = pipeline.readTextFile(args[1]);
    // Define a function that splits each line in a PCollection of Strings into a
    // PCollection made up of the individual words in the file.
    PCollection<String> words = lines.parallelDo(new DoFn<String, String>() {

        public void process(String line, Emitter<String> emitter) {
            for (String word : line.split("\\s+")) {
    }, // Indicates the serialization format
    // The count method applies a series of Crunch primitives and returns
    // a map of the unique words in the input PCollection to their counts.
    // Best of all, the count() function doesn't need to know anything about
    // the kind of data stored in the input PCollection.
    PTable<String, Long> counts = words.count();
    // Instruct the pipeline to write the resulting counts to a text file.
    pipeline.writeTextFile(counts, args[2]);
    // Execute the pipeline as a MapReduce.
    return 0;
Also used : MRPipeline( Pipeline(org.apache.crunch.Pipeline) MRPipeline(

Example 3 with MRPipeline

use of in project crunch by cloudera.

the class SpecificAvroGroupByTest method testGrouByWithSpecificAvroType.

public void testGrouByWithSpecificAvroType() throws Exception {
    MRPipeline pipeline = new MRPipeline(SpecificAvroGroupByTest.class);
Also used : MRPipeline( Test(org.junit.Test)

Example 4 with MRPipeline

use of in project crunch by cloudera.

the class MapsideJoinTest method testMapsideJoin_RightSideIsEmpty.

public void testMapsideJoin_RightSideIsEmpty() throws IOException {
    MRPipeline pipeline = new MRPipeline(MapsideJoinTest.class);
    PTable<Integer, String> customerTable = readTable(pipeline, "customers.txt");
    PTable<Integer, String> orderTable = readTable(pipeline, "orders.txt");
    PTable<Integer, String> filteredOrderTable = orderTable.parallelDo(new NegativeFilter(), orderTable.getPTableType());
    PTable<Integer, Pair<String, String>> joined = MapsideJoin.join(customerTable, filteredOrderTable);
    List<Pair<Integer, Pair<String, String>>> materializedJoin = Lists.newArrayList(joined.materialize());
Also used : MRPipeline( Pair(org.apache.crunch.Pair) Test(org.junit.Test)

Example 5 with MRPipeline

use of in project crunch by cloudera.

the class MapsTest method run.

public static void run(PTypeFamily typeFamily) throws Exception {
    Pipeline pipeline = new MRPipeline(MapsTest.class);
    String shakesInputPath = FileHelper.createTempCopyOf("shakes.txt");
    PCollection<String> shakespeare = pipeline.readTextFile(shakesInputPath);
    Iterable<Pair<String, Map<String, Long>>> output = shakespeare.parallelDo(new DoFn<String, Pair<String, Map<String, Long>>>() {

        public void process(String input, Emitter<Pair<String, Map<String, Long>>> emitter) {
            String last = null;
            for (String word : input.toLowerCase().split("\\W+")) {
                if (!word.isEmpty()) {
                    String firstChar = word.substring(0, 1);
                    if (last != null) {
                        Map<String, Long> cc = ImmutableMap.of(firstChar, 1L);
                        emitter.emit(Pair.of(last, cc));
                    last = firstChar;
    }, typeFamily.tableOf(typeFamily.strings(), typeFamily.maps(typeFamily.longs()))).groupByKey().combineValues(new CombineFn<String, Map<String, Long>>() {

        public void process(Pair<String, Iterable<Map<String, Long>>> input, Emitter<Pair<String, Map<String, Long>>> emitter) {
            Map<String, Long> agg = Maps.newHashMap();
            for (Map<String, Long> in : input.second()) {
                for (Map.Entry<String, Long> e : in.entrySet()) {
                    if (!agg.containsKey(e.getKey())) {
                        agg.put(e.getKey(), e.getValue());
                    } else {
                        agg.put(e.getKey(), e.getValue() + agg.get(e.getKey()));
            emitter.emit(Pair.of(input.first(), agg));
    boolean passed = false;
    for (Pair<String, Map<String, Long>> v : output) {
        if (v.first() == "k" && v.second().get("n") == 8L) {
            passed = true;
Also used : MRPipeline( MRPipeline( ImmutableMap( Map(java.util.Map)


MRPipeline ( Test (org.junit.Test)26 Pipeline (org.apache.crunch.Pipeline)13 PTypeFamily (org.apache.crunch.types.PTypeFamily)7 MemPipeline (org.apache.crunch.impl.mem.MemPipeline)6 Pair (org.apache.crunch.Pair)4 Collection (java.util.Collection)3 Record (org.apache.avro.generic.GenericData.Record)3 GenericRecord (org.apache.avro.generic.GenericRecord)3 PCollection (org.apache.crunch.PCollection)3 Person (org.apache.crunch.test.Person)3 Schema (org.apache.avro.Schema)2 PojoPerson ( Employee (org.apache.crunch.test.Employee)2 Before (org.junit.Before)2 ImmutableMap ( Map (java.util.Map)1 MapFn (org.apache.crunch.MapFn)1 CrunchRuntimeException ( SourcePathTargetImpl (