Search in sources :

Example 1 with PartitionKey

use of com.cloudera.cdk.data.PartitionKey in project cdk-examples by cloudera.

the class StagingToPersistentSerial method run.

@Override
public int run(String[] args) throws Exception {
    // open the repository
    final DatasetRepository repo = DatasetRepositories.open("repo:file:/tmp/data");
    final Calendar now = Calendar.getInstance();
    final long yesterdayTimestamp = now.getTimeInMillis() - DAY_IN_MILLIS;
    // the destination dataset
    final Dataset<GenericRecord> persistent = repo.load("logs");
    final DatasetWriter<GenericRecord> writer = persistent.newWriter();
    writer.open();
    // the source dataset: yesterday's partition in the staging area
    final Dataset<GenericRecord> staging = repo.load("logs-staging");
    final PartitionKey yesterday = getPartitionKey(staging, yesterdayTimestamp);
    final DatasetReader<GenericRecord> reader = staging.getPartition(yesterday, false).newReader();
    try {
        reader.open();
        // yep, it's that easy.
        for (GenericRecord record : reader) {
            writer.write(record);
        }
    } finally {
        reader.close();
        writer.flush();
    }
    // remove the source data partition from staging
    staging.dropPartition(yesterday);
    // if the above didn't throw an exception, commit the data
    writer.close();
    return 0;
}
Also used : DatasetRepository(com.cloudera.cdk.data.DatasetRepository) Calendar(java.util.Calendar) PartitionKey(com.cloudera.cdk.data.PartitionKey) GenericRecord(org.apache.avro.generic.GenericRecord)

Example 2 with PartitionKey

use of com.cloudera.cdk.data.PartitionKey in project cdk-examples by cloudera.

the class ReadUserDatasetGenericOnePartition method run.

@Override
public int run(String[] args) throws Exception {
    // Construct a filesystem dataset repository rooted at /tmp/data
    DatasetRepository repo = DatasetRepositories.open("repo:hdfs:/tmp/data");
    // Load the users dataset
    Dataset<GenericRecord> users = repo.load("users");
    // Get the partition strategy and use it to construct a partition key for
    // hash(username)=0
    PartitionStrategy partitionStrategy = users.getDescriptor().getPartitionStrategy();
    PartitionKey partitionKey = partitionStrategy.partitionKey(0);
    // Get the dataset partition for the partition key
    Dataset<GenericRecord> partition = users.getPartition(partitionKey, false);
    // Get a reader for the partition and read all the users
    DatasetReader<GenericRecord> reader = partition.newReader();
    try {
        reader.open();
        for (GenericRecord user : reader) {
            System.out.println(user);
        }
    } finally {
        reader.close();
    }
    return 0;
}
Also used : DatasetRepository(com.cloudera.cdk.data.DatasetRepository) PartitionKey(com.cloudera.cdk.data.PartitionKey) GenericRecord(org.apache.avro.generic.GenericRecord) PartitionStrategy(com.cloudera.cdk.data.PartitionStrategy)

Example 3 with PartitionKey

use of com.cloudera.cdk.data.PartitionKey in project cdk-examples by cloudera.

the class CreateSessions method getPartitionForURI.

private <E> Dataset<E> getPartitionForURI(Dataset<E> eventsDataset, String uri) {
    PartitionKey partitionKey = FileSystemDatasetRepository.partitionKeyForPath(eventsDataset, URI.create(uri));
    Dataset<E> partition = eventsDataset.getPartition(partitionKey, false);
    if (partition == null) {
        throw new IllegalArgumentException("Partition not found: " + uri);
    }
    return partition;
}
Also used : PartitionKey(com.cloudera.cdk.data.PartitionKey)

Aggregations

PartitionKey (com.cloudera.cdk.data.PartitionKey)3 DatasetRepository (com.cloudera.cdk.data.DatasetRepository)2 GenericRecord (org.apache.avro.generic.GenericRecord)2 PartitionStrategy (com.cloudera.cdk.data.PartitionStrategy)1 Calendar (java.util.Calendar)1