Search in sources :

Example 6 with DatasetDescriptor

use of com.cloudera.cdk.data.DatasetDescriptor in project cdk-examples by cloudera.

the class CreateUserDatasetGenericPartitioned method run.

@Override
public int run(String[] args) throws Exception {
    // Construct a filesystem dataset repository rooted at /tmp/data
    DatasetRepository repo = DatasetRepositories.open("repo:hdfs:/tmp/data");
    // Create a partition strategy that hash partitions on username with 10 buckets
    PartitionStrategy partitionStrategy = new PartitionStrategy.Builder().hash("username", 10).build();
    // Create a dataset of users with the Avro schema in the repository
    DatasetDescriptor descriptor = new DatasetDescriptor.Builder().schemaUri("resource:user.avsc").partitionStrategy(partitionStrategy).build();
    Dataset<GenericRecord> users = repo.create("users", descriptor);
    // Get a writer for the dataset and write some users to it
    DatasetWriter<GenericRecord> writer = users.newWriter();
    try {
        writer.open();
        String[] colors = { "green", "blue", "pink", "brown", "yellow" };
        Random rand = new Random();
        GenericRecordBuilder builder = new GenericRecordBuilder(descriptor.getSchema());
        for (int i = 0; i < 100; i++) {
            GenericRecord record = builder.set("username", "user-" + i).set("creationDate", System.currentTimeMillis()).set("favoriteColor", colors[rand.nextInt(colors.length)]).build();
            writer.write(record);
        }
    } finally {
        writer.close();
    }
    return 0;
}
Also used : DatasetDescriptor(com.cloudera.cdk.data.DatasetDescriptor) DatasetRepository(com.cloudera.cdk.data.DatasetRepository) GenericRecordBuilder(org.apache.avro.generic.GenericRecordBuilder) Random(java.util.Random) GenericRecordBuilder(org.apache.avro.generic.GenericRecordBuilder) GenericRecord(org.apache.avro.generic.GenericRecord) PartitionStrategy(com.cloudera.cdk.data.PartitionStrategy)

Example 7 with DatasetDescriptor

use of com.cloudera.cdk.data.DatasetDescriptor in project cdk-examples by cloudera.

the class CreateDataset method run.

@Override
public int run(String[] args) throws Exception {
    // Construct a local filesystem dataset repository rooted at /tmp/data
    DatasetRepository repo = DatasetRepositories.open("repo:hdfs:/tmp/data");
    // Create a dataset of events with the Avro schema in the repository
    DatasetDescriptor descriptor = new DatasetDescriptor.Builder().schemaUri("resource:event.avsc").build();
    repo.create("events", descriptor);
    return 0;
}
Also used : DatasetDescriptor(com.cloudera.cdk.data.DatasetDescriptor) DatasetRepository(com.cloudera.cdk.data.DatasetRepository)

Aggregations

DatasetDescriptor (com.cloudera.cdk.data.DatasetDescriptor)7 DatasetRepository (com.cloudera.cdk.data.DatasetRepository)7 Random (java.util.Random)4 GenericRecord (org.apache.avro.generic.GenericRecord)4 GenericRecordBuilder (org.apache.avro.generic.GenericRecordBuilder)4 PartitionStrategy (com.cloudera.cdk.data.PartitionStrategy)1