Search in sources :

Example 1 with DatasetDescriptor

use of com.cloudera.cdk.data.DatasetDescriptor in project cdk-examples by cloudera.

the class CreateHCatalogUserDatasetGeneric method run.

@Override
public int run(String[] args) throws Exception {
    // Construct an HCatalog dataset repository using managed Hive tables
    DatasetRepository repo = DatasetRepositories.open("repo:hive");
    // Create a dataset of users with the Avro schema in the repository
    DatasetDescriptor descriptor = new DatasetDescriptor.Builder().schemaUri("resource:user.avsc").build();
    Dataset<GenericRecord> users = repo.create("users", descriptor);
    // Get a writer for the dataset and write some users to it
    DatasetWriter<GenericRecord> writer = users.newWriter();
    try {
        writer.open();
        String[] colors = { "green", "blue", "pink", "brown", "yellow" };
        Random rand = new Random();
        GenericRecordBuilder builder = new GenericRecordBuilder(descriptor.getSchema());
        for (int i = 0; i < 100; i++) {
            GenericRecord record = builder.set("username", "user-" + i).set("creationDate", System.currentTimeMillis()).set("favoriteColor", colors[rand.nextInt(colors.length)]).build();
            writer.write(record);
        }
    } finally {
        writer.close();
    }
    return 0;
}
Also used : DatasetDescriptor(com.cloudera.cdk.data.DatasetDescriptor) Random(java.util.Random) DatasetRepository(com.cloudera.cdk.data.DatasetRepository) GenericRecordBuilder(org.apache.avro.generic.GenericRecordBuilder) GenericRecordBuilder(org.apache.avro.generic.GenericRecordBuilder) GenericRecord(org.apache.avro.generic.GenericRecord)

Example 2 with DatasetDescriptor

use of com.cloudera.cdk.data.DatasetDescriptor in project cdk-examples by cloudera.

the class CreateProductDatasetPojo method run.

@Override
public int run(String[] args) throws Exception {
    // Construct a filesystem dataset repository rooted at /tmp/data
    DatasetRepository repo = DatasetRepositories.open("repo:hdfs:/tmp/data");
    // Create a dataset of products with the Avro schema in the repository
    DatasetDescriptor descriptor = new DatasetDescriptor.Builder().schema(Product.class).build();
    Dataset<Product> products = repo.create("products", descriptor);
    // Get a writer for the dataset and write some products to it
    DatasetWriter<Product> writer = products.newWriter();
    try {
        writer.open();
        String[] names = { "toaster", "teapot", "butter dish" };
        int i = 0;
        for (String name : names) {
            Product product = new Product();
            product.setName(name);
            product.setId(i++);
            writer.write(product);
        }
    } finally {
        writer.close();
    }
    return 0;
}
Also used : DatasetDescriptor(com.cloudera.cdk.data.DatasetDescriptor) DatasetRepository(com.cloudera.cdk.data.DatasetRepository)

Example 3 with DatasetDescriptor

use of com.cloudera.cdk.data.DatasetDescriptor in project cdk-examples by cloudera.

the class Hello method run.

@Override
public int run(String[] args) throws Exception {
    // Construct a local filesystem dataset repository rooted at /tmp/hello-cdk
    DatasetRepository repo = DatasetRepositories.open("repo:file:/tmp/hello-cdk");
    // Create a dataset of Hellos
    DatasetDescriptor descriptor = new DatasetDescriptor.Builder().schema(Hello.class).build();
    Dataset<Hello> hellos = repo.create("hellos", descriptor);
    // Write some Hellos in to the dataset
    DatasetWriter<Hello> writer = hellos.newWriter();
    try {
        writer.open();
        Hello cdk = new Hello("CDK");
        writer.write(cdk);
    } finally {
        writer.close();
    }
    // Read the Hellos from the dataset
    DatasetReader<Hello> reader = hellos.newReader();
    try {
        reader.open();
        for (Hello hello : reader) {
            hello.sayHello();
        }
    } finally {
        reader.close();
    }
    // Delete the dataset now that we are done with it
    repo.delete("hellos");
    return 0;
}
Also used : DatasetDescriptor(com.cloudera.cdk.data.DatasetDescriptor) DatasetRepository(com.cloudera.cdk.data.DatasetRepository)

Example 4 with DatasetDescriptor

use of com.cloudera.cdk.data.DatasetDescriptor in project cdk-examples by cloudera.

the class CreateUserDatasetGeneric method run.

@Override
public int run(String[] args) throws Exception {
    // Construct a filesystem dataset repository rooted at /tmp/data
    DatasetRepository repo = DatasetRepositories.open("repo:hdfs:/tmp/data");
    // Create a dataset of users with the Avro schema in the repository
    DatasetDescriptor descriptor = new DatasetDescriptor.Builder().schemaUri("resource:user.avsc").build();
    Dataset<GenericRecord> users = repo.create("users", descriptor);
    // Get a writer for the dataset and write some users to it
    DatasetWriter<GenericRecord> writer = users.newWriter();
    try {
        writer.open();
        String[] colors = { "green", "blue", "pink", "brown", "yellow" };
        Random rand = new Random();
        GenericRecordBuilder builder = new GenericRecordBuilder(descriptor.getSchema());
        for (int i = 0; i < 100; i++) {
            GenericRecord record = builder.set("username", "user-" + i).set("creationDate", System.currentTimeMillis()).set("favoriteColor", colors[rand.nextInt(colors.length)]).build();
            writer.write(record);
        }
    } finally {
        writer.close();
    }
    return 0;
}
Also used : DatasetDescriptor(com.cloudera.cdk.data.DatasetDescriptor) Random(java.util.Random) DatasetRepository(com.cloudera.cdk.data.DatasetRepository) GenericRecordBuilder(org.apache.avro.generic.GenericRecordBuilder) GenericRecordBuilder(org.apache.avro.generic.GenericRecordBuilder) GenericRecord(org.apache.avro.generic.GenericRecord)

Example 5 with DatasetDescriptor

use of com.cloudera.cdk.data.DatasetDescriptor in project cdk-examples by cloudera.

the class CreateUserDatasetGenericParquet method run.

@Override
public int run(String[] args) throws Exception {
    // Construct a filesystem dataset repository rooted at /tmp/data
    DatasetRepository repo = DatasetRepositories.open("repo:hdfs:/tmp/data");
    // Create a dataset of users with the Avro schema, and Parquet format in the
    // repository
    DatasetDescriptor descriptor = new DatasetDescriptor.Builder().schemaUri("resource:user.avsc").format(Formats.PARQUET).build();
    Dataset<GenericRecord> users = repo.create("users", descriptor);
    // Get a writer for the dataset and write some users to it
    DatasetWriter<GenericRecord> writer = users.newWriter();
    try {
        writer.open();
        String[] colors = { "green", "blue", "pink", "brown", "yellow" };
        Random rand = new Random();
        GenericRecordBuilder builder = new GenericRecordBuilder(descriptor.getSchema());
        for (int i = 0; i < 100; i++) {
            GenericRecord record = builder.set("username", "user-" + i).set("creationDate", System.currentTimeMillis()).set("favoriteColor", colors[rand.nextInt(colors.length)]).build();
            writer.write(record);
        }
    } finally {
        writer.close();
    }
    return 0;
}
Also used : DatasetDescriptor(com.cloudera.cdk.data.DatasetDescriptor) Random(java.util.Random) DatasetRepository(com.cloudera.cdk.data.DatasetRepository) GenericRecordBuilder(org.apache.avro.generic.GenericRecordBuilder) GenericRecord(org.apache.avro.generic.GenericRecord)

Aggregations

DatasetDescriptor (com.cloudera.cdk.data.DatasetDescriptor)7 DatasetRepository (com.cloudera.cdk.data.DatasetRepository)7 Random (java.util.Random)4 GenericRecord (org.apache.avro.generic.GenericRecord)4 GenericRecordBuilder (org.apache.avro.generic.GenericRecordBuilder)4 PartitionStrategy (com.cloudera.cdk.data.PartitionStrategy)1