Search in sources :

Example 21 with Pair

use of org.apache.crunch.Pair in project crunch by cloudera.

the class Sort method sort.

/**
   * Sorts the {@link PCollection} using the natural ordering of its elements
   * in the order specified.
   * 
   * @return a {@link PCollection} representing the sorted collection.
   */
public static <T> PCollection<T> sort(PCollection<T> collection, Order order) {
    PTypeFamily tf = collection.getTypeFamily();
    PTableType<T, Void> type = tf.tableOf(collection.getPType(), tf.nulls());
    Configuration conf = collection.getPipeline().getConfiguration();
    GroupingOptions options = buildGroupingOptions(conf, tf, collection.getPType(), order);
    PTable<T, Void> pt = collection.parallelDo("sort-pre", new DoFn<T, Pair<T, Void>>() {

        @Override
        public void process(T input, Emitter<Pair<T, Void>> emitter) {
            emitter.emit(Pair.of(input, (Void) null));
        }
    }, type);
    PTable<T, Void> sortedPt = pt.groupByKey(options).ungroup();
    return sortedPt.parallelDo("sort-post", new DoFn<Pair<T, Void>, T>() {

        @Override
        public void process(Pair<T, Void> input, Emitter<T> emitter) {
            emitter.emit(input.first());
        }
    }, collection.getPType());
}
Also used : Configuration(org.apache.hadoop.conf.Configuration) PTypeFamily(org.apache.crunch.types.PTypeFamily) GroupingOptions(org.apache.crunch.GroupingOptions) Pair(org.apache.crunch.Pair)

Example 22 with Pair

use of org.apache.crunch.Pair in project cdk-examples by cloudera.

the class CreateSessions method run.

@Override
public int run(String[] args) throws Exception {
    // Construct a local filesystem dataset repository rooted at /tmp/data
    DatasetRepository fsRepo = DatasetRepositories.open("repo:hdfs:/tmp/data");
    // Construct an HCatalog dataset repository using external Hive tables
    DatasetRepository hcatRepo = DatasetRepositories.open("repo:hive:/tmp/data");
    // Turn debug on while in development.
    getPipeline().enableDebug();
    getPipeline().getConfiguration().set("crunch.log.job.progress", "true");
    // Load the events dataset and get the correct partition to sessionize
    Dataset<StandardEvent> eventsDataset = fsRepo.load("events");
    Dataset<StandardEvent> partition;
    if (args.length == 0 || (args.length == 1 && args[0].equals("LATEST"))) {
        partition = getLatestPartition(eventsDataset);
    } else {
        partition = getPartitionForURI(eventsDataset, args[0]);
    }
    // Create a parallel collection from the working partition
    PCollection<StandardEvent> events = read(CrunchDatasets.asSource(partition, StandardEvent.class));
    // Process the events into sessions, using a combiner
    PCollection<Session> sessions = events.parallelDo(new DoFn<StandardEvent, Session>() {

        @Override
        public void process(StandardEvent event, Emitter<Session> emitter) {
            emitter.emit(Session.newBuilder().setUserId(event.getUserId()).setSessionId(event.getSessionId()).setIp(event.getIp()).setStartTimestamp(event.getTimestamp()).setDuration(0).setSessionEventCount(1).build());
        }
    }, Avros.specifics(Session.class)).by(new MapFn<Session, Pair<Long, String>>() {

        @Override
        public Pair<Long, String> map(Session session) {
            return Pair.of(session.getUserId(), session.getSessionId());
        }
    }, Avros.pairs(Avros.longs(), Avros.strings())).groupByKey().combineValues(new CombineFn<Pair<Long, String>, Session>() {

        @Override
        public void process(Pair<Pair<Long, String>, Iterable<Session>> pairIterable, Emitter<Pair<Pair<Long, String>, Session>> emitter) {
            String ip = null;
            long startTimestamp = Long.MAX_VALUE;
            long endTimestamp = Long.MIN_VALUE;
            int sessionEventCount = 0;
            for (Session s : pairIterable.second()) {
                ip = s.getIp();
                startTimestamp = Math.min(startTimestamp, s.getStartTimestamp());
                endTimestamp = Math.max(endTimestamp, s.getStartTimestamp() + s.getDuration());
                sessionEventCount += s.getSessionEventCount();
            }
            emitter.emit(Pair.of(pairIterable.first(), Session.newBuilder().setUserId(pairIterable.first().first()).setSessionId(pairIterable.first().second()).setIp(ip).setStartTimestamp(startTimestamp).setDuration(endTimestamp - startTimestamp).setSessionEventCount(sessionEventCount).build()));
        }
    }).parallelDo(new DoFn<Pair<Pair<Long, String>, Session>, Session>() {

        @Override
        public void process(Pair<Pair<Long, String>, Session> pairSession, Emitter<Session> emitter) {
            emitter.emit(pairSession.second());
        }
    }, Avros.specifics(Session.class));
    // Write the sessions to the "sessions" Dataset
    getPipeline().write(sessions, CrunchDatasets.asTarget(hcatRepo.load("sessions")), Target.WriteMode.APPEND);
    return run().succeeded() ? 0 : 1;
}
Also used : Emitter(org.apache.crunch.Emitter) DatasetRepository(com.cloudera.cdk.data.DatasetRepository) FileSystemDatasetRepository(com.cloudera.cdk.data.filesystem.FileSystemDatasetRepository) MapFn(org.apache.crunch.MapFn) CombineFn(org.apache.crunch.CombineFn) StandardEvent(com.cloudera.cdk.data.event.StandardEvent) Session(com.cloudera.cdk.examples.demo.event.Session) Pair(org.apache.crunch.Pair)

Aggregations

Pair (org.apache.crunch.Pair)22 PTypeFamily (org.apache.crunch.types.PTypeFamily)15 GroupingOptions (org.apache.crunch.GroupingOptions)6 Configuration (org.apache.hadoop.conf.Configuration)5 MRPipeline (org.apache.crunch.impl.mr.MRPipeline)4 Test (org.junit.Test)4 Collection (java.util.Collection)3 CombineFn (org.apache.crunch.CombineFn)3 Emitter (org.apache.crunch.Emitter)3 PCollection (org.apache.crunch.PCollection)3 Utf8 (org.apache.avro.util.Utf8)2 DoFn (org.apache.crunch.DoFn)2 Pipeline (org.apache.crunch.Pipeline)2 Tuple3 (org.apache.crunch.Tuple3)2 Path (org.apache.hadoop.fs.Path)2 DatasetRepository (com.cloudera.cdk.data.DatasetRepository)1 StandardEvent (com.cloudera.cdk.data.event.StandardEvent)1 FileSystemDatasetRepository (com.cloudera.cdk.data.filesystem.FileSystemDatasetRepository)1 Session (com.cloudera.cdk.examples.demo.event.Session)1 IOException (java.io.IOException)1