Search in sources :

Example 1 with PartitionStrategy

use of com.cloudera.cdk.data.PartitionStrategy in project cdk-examples by cloudera.

the class ReadUserDatasetGenericOnePartition method run.

@Override
public int run(String[] args) throws Exception {
    // Construct a filesystem dataset repository rooted at /tmp/data
    DatasetRepository repo = DatasetRepositories.open("repo:hdfs:/tmp/data");
    // Load the users dataset
    Dataset<GenericRecord> users = repo.load("users");
    // Get the partition strategy and use it to construct a partition key for
    // hash(username)=0
    PartitionStrategy partitionStrategy = users.getDescriptor().getPartitionStrategy();
    PartitionKey partitionKey = partitionStrategy.partitionKey(0);
    // Get the dataset partition for the partition key
    Dataset<GenericRecord> partition = users.getPartition(partitionKey, false);
    // Get a reader for the partition and read all the users
    DatasetReader<GenericRecord> reader = partition.newReader();
    try {
        reader.open();
        for (GenericRecord user : reader) {
            System.out.println(user);
        }
    } finally {
        reader.close();
    }
    return 0;
}
Also used : DatasetRepository(com.cloudera.cdk.data.DatasetRepository) PartitionKey(com.cloudera.cdk.data.PartitionKey) GenericRecord(org.apache.avro.generic.GenericRecord) PartitionStrategy(com.cloudera.cdk.data.PartitionStrategy)

Example 2 with PartitionStrategy

use of com.cloudera.cdk.data.PartitionStrategy in project cdk-examples by cloudera.

the class CreateStagedDataset method run.

@Override
public int run(String[] args) throws Exception {
    DatasetRepository repo = DatasetRepositories.open("repo:file:/tmp/data");
    // where the schema is stored
    URI schemaURI = URI.create("resource:simple-log.avsc");
    // create a Parquet dataset for long-term storage
    repo.create("logs", new DatasetDescriptor.Builder().format(Formats.PARQUET).schemaUri(schemaURI).partitionStrategy(new PartitionStrategy.Builder().year("timestamp", "year").month("timestamp", "month").day("timestamp", "day").build()).build());
    // create an Avro dataset to temporarily hold data
    repo.create("logs-staging", new DatasetDescriptor.Builder().format(Formats.AVRO).schemaUri(schemaURI).partitionStrategy(new PartitionStrategy.Builder().day("timestamp", "day").build()).build());
    return 0;
}
Also used : DatasetDescriptor(com.cloudera.cdk.data.DatasetDescriptor) DatasetRepository(com.cloudera.cdk.data.DatasetRepository) URI(java.net.URI) PartitionStrategy(com.cloudera.cdk.data.PartitionStrategy)

Example 3 with PartitionStrategy

use of com.cloudera.cdk.data.PartitionStrategy in project cdk-examples by cloudera.

the class StagingToPersistentSerial method getPartitionKey.

@SuppressWarnings("deprecation")
private static PartitionKey getPartitionKey(Dataset data, long timestamp) {
    // need to build a fake record to get a partition key
    final GenericRecordBuilder builder = new GenericRecordBuilder(data.getDescriptor().getSchema());
    builder.set("timestamp", timestamp);
    builder.set("level", "INFO");
    builder.set("component", "StagingToPersistentSerial");
    builder.set("message", "Fake log message");
    // access the partition strategy, which produces keys from records
    final PartitionStrategy partitioner = data.getDescriptor().getPartitionStrategy();
    return partitioner.partitionKeyForEntity(builder.build());
}
Also used : GenericRecordBuilder(org.apache.avro.generic.GenericRecordBuilder) PartitionStrategy(com.cloudera.cdk.data.PartitionStrategy)

Example 4 with PartitionStrategy

use of com.cloudera.cdk.data.PartitionStrategy in project cdk-examples by cloudera.

the class CreateUserDatasetGenericPartitioned method run.

@Override
public int run(String[] args) throws Exception {
    // Construct a filesystem dataset repository rooted at /tmp/data
    DatasetRepository repo = DatasetRepositories.open("repo:hdfs:/tmp/data");
    // Create a partition strategy that hash partitions on username with 10 buckets
    PartitionStrategy partitionStrategy = new PartitionStrategy.Builder().hash("username", 10).build();
    // Create a dataset of users with the Avro schema in the repository
    DatasetDescriptor descriptor = new DatasetDescriptor.Builder().schemaUri("resource:user.avsc").partitionStrategy(partitionStrategy).build();
    Dataset<GenericRecord> users = repo.create("users", descriptor);
    // Get a writer for the dataset and write some users to it
    DatasetWriter<GenericRecord> writer = users.newWriter();
    try {
        writer.open();
        String[] colors = { "green", "blue", "pink", "brown", "yellow" };
        Random rand = new Random();
        GenericRecordBuilder builder = new GenericRecordBuilder(descriptor.getSchema());
        for (int i = 0; i < 100; i++) {
            GenericRecord record = builder.set("username", "user-" + i).set("creationDate", System.currentTimeMillis()).set("favoriteColor", colors[rand.nextInt(colors.length)]).build();
            writer.write(record);
        }
    } finally {
        writer.close();
    }
    return 0;
}
Also used : DatasetDescriptor(com.cloudera.cdk.data.DatasetDescriptor) DatasetRepository(com.cloudera.cdk.data.DatasetRepository) GenericRecordBuilder(org.apache.avro.generic.GenericRecordBuilder) Random(java.util.Random) GenericRecordBuilder(org.apache.avro.generic.GenericRecordBuilder) GenericRecord(org.apache.avro.generic.GenericRecord) PartitionStrategy(com.cloudera.cdk.data.PartitionStrategy)

Aggregations

PartitionStrategy (com.cloudera.cdk.data.PartitionStrategy)4 DatasetRepository (com.cloudera.cdk.data.DatasetRepository)3 DatasetDescriptor (com.cloudera.cdk.data.DatasetDescriptor)2 GenericRecord (org.apache.avro.generic.GenericRecord)2 GenericRecordBuilder (org.apache.avro.generic.GenericRecordBuilder)2 PartitionKey (com.cloudera.cdk.data.PartitionKey)1 URI (java.net.URI)1 Random (java.util.Random)1