Search in sources :

Example 1 with KeyValueTable

use of io.cdap.cdap.api.dataset.lib.KeyValueTable in project cdap by caskdata.

the class ObjectStoreDefinition method getDataset.

@Override
public ObjectStoreDataset<?> getDataset(DatasetContext datasetContext, DatasetSpecification spec, Map<String, String> arguments, ClassLoader classLoader) throws IOException {
    DatasetSpecification kvTableSpec = spec.getSpecification("objects");
    KeyValueTable table = tableDef.getDataset(datasetContext, kvTableSpec, arguments, classLoader);
    TypeRepresentation typeRep = GSON.fromJson(spec.getProperty("type"), TypeRepresentation.class);
    Schema schema = GSON.fromJson(spec.getProperty("schema"), Schema.class);
    return new ObjectStoreDataset(spec.getName(), table, typeRep, schema, classLoader);
}
Also used : KeyValueTable(io.cdap.cdap.api.dataset.lib.KeyValueTable) TypeRepresentation(io.cdap.cdap.internal.io.TypeRepresentation) Schema(io.cdap.cdap.api.data.schema.Schema) DatasetSpecification(io.cdap.cdap.api.dataset.DatasetSpecification)

Example 2 with KeyValueTable

use of io.cdap.cdap.api.dataset.lib.KeyValueTable in project cdap by caskdata.

the class CoreDatasetsModule method register.

@Override
public void register(DatasetDefinitionRegistry registry) {
    DatasetDefinition<Table, DatasetAdmin> tableDef = registry.get("table");
    DatasetDefinition<KeyValueTable, DatasetAdmin> kvTableDef = new KeyValueTableDefinition(KeyValueTable.TYPE, tableDef);
    registry.add(kvTableDef);
    registry.add(new KeyValueTableDefinition(KeyValueTable.class.getName(), tableDef));
    DatasetDefinition<ObjectStore, DatasetAdmin> objectStoreDef = new ObjectStoreDefinition(ObjectStore.TYPE, kvTableDef);
    registry.add(new ObjectStoreDefinition(ObjectStore.TYPE, kvTableDef));
    registry.add(new ObjectStoreDefinition(ObjectStore.class.getName(), kvTableDef));
    registry.add(new IndexedObjectStoreDefinition(IndexedObjectStore.TYPE, tableDef, objectStoreDef));
    registry.add(new IndexedObjectStoreDefinition(IndexedObjectStore.class.getName(), tableDef, objectStoreDef));
    registry.add(new IndexedTableDefinition(IndexedTable.TYPE, tableDef));
    registry.add(new IndexedTableDefinition(IndexedTable.class.getName(), tableDef));
    registry.add(new TimeseriesTableDefinition(TimeseriesTable.TYPE, tableDef));
    registry.add(new TimeseriesTableDefinition(TimeseriesTable.class.getName(), tableDef));
    registry.add(new CounterTimeseriesTableDefinition(CounterTimeseriesTable.TYPE, tableDef));
    registry.add(new CounterTimeseriesTableDefinition(CounterTimeseriesTable.class.getName(), tableDef));
    // in-memory table
    registry.add(new InMemoryTableDefinition(InMemoryTable.TYPE));
}
Also used : ObjectStore(io.cdap.cdap.api.dataset.lib.ObjectStore) IndexedObjectStore(io.cdap.cdap.api.dataset.lib.IndexedObjectStore) Table(io.cdap.cdap.api.dataset.table.Table) InMemoryTable(io.cdap.cdap.data2.dataset2.lib.table.inmemory.InMemoryTable) CounterTimeseriesTable(io.cdap.cdap.api.dataset.lib.CounterTimeseriesTable) TimeseriesTable(io.cdap.cdap.api.dataset.lib.TimeseriesTable) IndexedTable(io.cdap.cdap.api.dataset.lib.IndexedTable) KeyValueTable(io.cdap.cdap.api.dataset.lib.KeyValueTable) DatasetAdmin(io.cdap.cdap.api.dataset.DatasetAdmin) InMemoryTableDefinition(io.cdap.cdap.data2.dataset2.lib.table.inmemory.InMemoryTableDefinition) CounterTimeseriesTableDefinition(io.cdap.cdap.api.dataset.lib.CounterTimeseriesTableDefinition) IndexedTableDefinition(io.cdap.cdap.api.dataset.lib.IndexedTableDefinition) KeyValueTable(io.cdap.cdap.api.dataset.lib.KeyValueTable) IndexedObjectStoreDefinition(io.cdap.cdap.api.dataset.lib.IndexedObjectStoreDefinition) KeyValueTableDefinition(io.cdap.cdap.api.dataset.lib.KeyValueTableDefinition) IndexedObjectStoreDefinition(io.cdap.cdap.api.dataset.lib.IndexedObjectStoreDefinition) TimeseriesTableDefinition(io.cdap.cdap.api.dataset.lib.TimeseriesTableDefinition) CounterTimeseriesTableDefinition(io.cdap.cdap.api.dataset.lib.CounterTimeseriesTableDefinition)

Example 3 with KeyValueTable

use of io.cdap.cdap.api.dataset.lib.KeyValueTable in project cdap by caskdata.

the class DataPipelineTest method testSinglePhaseWithSparkSink.

private void testSinglePhaseWithSparkSink() throws Exception {
    /*
     * source1 ---|
     *            |--> sparksink
     * source2 ---|
     */
    ETLBatchConfig etlConfig = ETLBatchConfig.builder().addStage(new ETLStage("source1", MockSource.getPlugin("messages1", SpamMessage.SCHEMA))).addStage(new ETLStage("source2", MockSource.getPlugin("messages2", SpamMessage.SCHEMA))).addStage(new ETLStage("customsink", new ETLPlugin(NaiveBayesTrainer.PLUGIN_NAME, SparkSink.PLUGIN_TYPE, ImmutableMap.of("fileSetName", "modelFileSet", "path", "output", "fieldToClassify", SpamMessage.TEXT_FIELD, "predictionField", SpamMessage.SPAM_PREDICTION_FIELD), null))).addConnection("source1", "customsink").addConnection("source2", "customsink").build();
    AppRequest<ETLBatchConfig> appRequest = new AppRequest<>(APP_ARTIFACT, etlConfig);
    ApplicationId appId = NamespaceId.DEFAULT.app("SparkSinkApp");
    ApplicationManager appManager = deployApplication(appId, appRequest);
    // set up five spam messages and five non-spam messages to be used for classification
    List<StructuredRecord> messagesToWrite = new ArrayList<>();
    messagesToWrite.add(new SpamMessage("buy our clothes", 1.0).toStructuredRecord());
    messagesToWrite.add(new SpamMessage("sell your used books to us", 1.0).toStructuredRecord());
    messagesToWrite.add(new SpamMessage("earn money for free", 1.0).toStructuredRecord());
    messagesToWrite.add(new SpamMessage("this is definitely not spam", 1.0).toStructuredRecord());
    messagesToWrite.add(new SpamMessage("you won the lottery", 1.0).toStructuredRecord());
    // write records to source1
    DataSetManager<Table> inputManager = getDataset(NamespaceId.DEFAULT.dataset("messages1"));
    MockSource.writeInput(inputManager, messagesToWrite);
    messagesToWrite.clear();
    messagesToWrite.add(new SpamMessage("how was your day", 0.0).toStructuredRecord());
    messagesToWrite.add(new SpamMessage("what are you up to", 0.0).toStructuredRecord());
    messagesToWrite.add(new SpamMessage("this is a genuine message", 0.0).toStructuredRecord());
    messagesToWrite.add(new SpamMessage("this is an even more genuine message", 0.0).toStructuredRecord());
    messagesToWrite.add(new SpamMessage("could you send me the report", 0.0).toStructuredRecord());
    // write records to source2
    inputManager = getDataset(NamespaceId.DEFAULT.dataset("messages2"));
    MockSource.writeInput(inputManager, messagesToWrite);
    // ingest in some messages to be classified
    DataSetManager<FileSet> fileSetManager = getDataset(NaiveBayesTrainer.TEXTS_TO_CLASSIFY);
    FileSet fileSet = fileSetManager.get();
    try (PrintStream out = new PrintStream(fileSet.getLocation("inputTexts").getOutputStream(), true, "UTF-8")) {
        out.println("how are you doing today");
        out.println("free money money");
        out.println("what are you doing today");
        out.println("genuine report");
    }
    // manually trigger the pipeline
    Map<String, String> runtimeArgs = new HashMap<>();
    FileSetArguments.setInputPath(runtimeArgs, "inputTexts");
    WorkflowManager workflowManager = appManager.getWorkflowManager(SmartWorkflow.NAME);
    workflowManager.start(runtimeArgs);
    workflowManager.waitForRun(ProgramRunStatus.COMPLETED, 5, TimeUnit.MINUTES);
    DataSetManager<KeyValueTable> classifiedTexts = getDataset(NaiveBayesTrainer.CLASSIFIED_TEXTS);
    Assert.assertEquals(0.0d, Bytes.toDouble(classifiedTexts.get().read("how are you doing today")), 0.01d);
    // only 'free money money' should be predicated as spam
    Assert.assertEquals(1.0d, Bytes.toDouble(classifiedTexts.get().read("free money money")), 0.01d);
    Assert.assertEquals(0.0d, Bytes.toDouble(classifiedTexts.get().read("what are you doing today")), 0.01d);
    Assert.assertEquals(0.0d, Bytes.toDouble(classifiedTexts.get().read("genuine report")), 0.01d);
    validateMetric(5, appId, "source1.records.out");
    validateMetric(5, appId, "source2.records.out");
    validateMetric(10, appId, "customsink.records.in");
}
Also used : PrintStream(java.io.PrintStream) ApplicationManager(io.cdap.cdap.test.ApplicationManager) Table(io.cdap.cdap.api.dataset.table.Table) KeyValueTable(io.cdap.cdap.api.dataset.lib.KeyValueTable) FileSet(io.cdap.cdap.api.dataset.lib.FileSet) HashMap(java.util.HashMap) SpamMessage(io.cdap.cdap.datapipeline.mock.SpamMessage) WorkflowManager(io.cdap.cdap.test.WorkflowManager) ArrayList(java.util.ArrayList) ETLPlugin(io.cdap.cdap.etl.proto.v2.ETLPlugin) StructuredRecord(io.cdap.cdap.api.data.format.StructuredRecord) AppRequest(io.cdap.cdap.proto.artifact.AppRequest) ETLBatchConfig(io.cdap.cdap.etl.proto.v2.ETLBatchConfig) ETLStage(io.cdap.cdap.etl.proto.v2.ETLStage) KeyValueTable(io.cdap.cdap.api.dataset.lib.KeyValueTable) ApplicationId(io.cdap.cdap.proto.id.ApplicationId)

Example 4 with KeyValueTable

use of io.cdap.cdap.api.dataset.lib.KeyValueTable in project cdap by caskdata.

the class DataPipelineTest method testKVTableLookup.

@Test
public void testKVTableLookup() throws Exception {
    addDatasetInstance(KeyValueTable.class.getName(), "ageTable");
    DataSetManager<KeyValueTable> lookupTable = getDataset("ageTable");
    lookupTable.get().write("samuel".getBytes(Charsets.UTF_8), "12".getBytes(Charsets.UTF_8));
    lookupTable.get().write("bob".getBytes(Charsets.UTF_8), "36".getBytes(Charsets.UTF_8));
    lookupTable.get().write("jane".getBytes(Charsets.UTF_8), "25".getBytes(Charsets.UTF_8));
    lookupTable.flush();
    ETLBatchConfig etlConfig = ETLBatchConfig.builder().addStage(new ETLStage("source", MockSource.getPlugin("inputTable"))).addStage(new ETLStage("transform", LookupTransform.getPlugin("person", "age", "ageTable"))).addStage(new ETLStage("sink", MockSink.getPlugin("outputTable"))).addConnection("source", "transform").addConnection("transform", "sink").build();
    ApplicationId appId = NamespaceId.DEFAULT.app("testKVTableLookup");
    AppRequest<ETLBatchConfig> appRequest = new AppRequest<>(APP_ARTIFACT, etlConfig);
    ApplicationManager appManager = deployApplication(appId, appRequest);
    // set up input data
    Schema inputSchema = Schema.recordOf("person", Schema.Field.of("person", Schema.of(Schema.Type.STRING)));
    StructuredRecord recordSamuel = StructuredRecord.builder(inputSchema).set("person", "samuel").build();
    StructuredRecord recordBob = StructuredRecord.builder(inputSchema).set("person", "bob").build();
    StructuredRecord recordJane = StructuredRecord.builder(inputSchema).set("person", "jane").build();
    DataSetManager<Table> inputTable = getDataset("inputTable");
    MockSource.writeInput(inputTable, ImmutableList.of(recordSamuel, recordBob, recordJane));
    WorkflowManager workflowManager = appManager.getWorkflowManager(SmartWorkflow.NAME).start();
    workflowManager.waitForRun(ProgramRunStatus.COMPLETED, 5, TimeUnit.MINUTES);
    Schema schema = Schema.recordOf("person", Schema.Field.of("person", Schema.of(Schema.Type.STRING)), Schema.Field.of("age", Schema.of(Schema.Type.STRING)));
    Set<StructuredRecord> expected = new HashSet<>();
    expected.add(StructuredRecord.builder(schema).set("person", "samuel").set("age", "12").build());
    expected.add(StructuredRecord.builder(schema).set("person", "bob").set("age", "36").build());
    expected.add(StructuredRecord.builder(schema).set("person", "jane").set("age", "25").build());
    DataSetManager<Table> outputTable = getDataset("outputTable");
    Set<StructuredRecord> actual = new HashSet<>(MockSink.readOutput(outputTable));
    Assert.assertEquals(expected, actual);
    validateMetric(3, appId, "source.records.out");
    validateMetric(3, appId, "sink.records.in");
    deleteDatasetInstance(NamespaceId.DEFAULT.dataset("inputTable"));
    deleteDatasetInstance(NamespaceId.DEFAULT.dataset("outputTable"));
}
Also used : ApplicationManager(io.cdap.cdap.test.ApplicationManager) Table(io.cdap.cdap.api.dataset.table.Table) KeyValueTable(io.cdap.cdap.api.dataset.lib.KeyValueTable) Schema(io.cdap.cdap.api.data.schema.Schema) WorkflowManager(io.cdap.cdap.test.WorkflowManager) StructuredRecord(io.cdap.cdap.api.data.format.StructuredRecord) AppRequest(io.cdap.cdap.proto.artifact.AppRequest) ETLBatchConfig(io.cdap.cdap.etl.proto.v2.ETLBatchConfig) ETLStage(io.cdap.cdap.etl.proto.v2.ETLStage) KeyValueTable(io.cdap.cdap.api.dataset.lib.KeyValueTable) ApplicationId(io.cdap.cdap.proto.id.ApplicationId) HashSet(java.util.HashSet) Test(org.junit.Test)

Example 5 with KeyValueTable

use of io.cdap.cdap.api.dataset.lib.KeyValueTable in project cdap by caskdata.

the class MapReduceProgramRunnerTest method testFailureInOutputCommitter.

@Test
public void testFailureInOutputCommitter() throws Exception {
    final ApplicationWithPrograms app = deployApp(AppWithMapReduce.class);
    // We want to verify that when a mapreduce fails when committing the dataset outputs,
    // the destroy method is still called and committed.
    // (1) setup the datasets we use
    datasetCache.newTransactionContext();
    final KeyValueTable kvTable = datasetCache.getDataset("recorder");
    Transactions.createTransactionExecutor(txExecutorFactory, datasetCache.getTransactionAwares()).execute(new TransactionExecutor.Subroutine() {

        @Override
        public void apply() {
            // the table should not have initialized=true
            kvTable.write("initialized", "false");
        }
    });
    // 2) run job
    runProgram(app, AppWithMapReduce.MapReduceWithFailingOutputCommitter.class, new HashMap<String, String>(), false);
    // 3) verify results
    Transactions.createTransactionExecutor(txExecutorFactory, datasetCache.getTransactionAwares()).execute(new TransactionExecutor.Subroutine() {

        @Override
        public void apply() {
            // the destroy() method should have recorded FAILED status in the kv table
            Assert.assertEquals(ProgramStatus.FAILED.name(), Bytes.toString(kvTable.read("status")));
        }
    });
    datasetCache.dismissTransactionContext();
}
Also used : ApplicationWithPrograms(io.cdap.cdap.internal.app.deploy.pipeline.ApplicationWithPrograms) KeyValueTable(io.cdap.cdap.api.dataset.lib.KeyValueTable) TransactionExecutor(org.apache.tephra.TransactionExecutor) Test(org.junit.Test)

Aggregations

KeyValueTable (io.cdap.cdap.api.dataset.lib.KeyValueTable)55 Test (org.junit.Test)29 ApplicationManager (io.cdap.cdap.test.ApplicationManager)26 SparkManager (io.cdap.cdap.test.SparkManager)13 HashMap (java.util.HashMap)11 TransactionExecutor (org.apache.tephra.TransactionExecutor)10 FileSet (io.cdap.cdap.api.dataset.lib.FileSet)9 ApplicationWithPrograms (io.cdap.cdap.internal.app.deploy.pipeline.ApplicationWithPrograms)9 WorkflowManager (io.cdap.cdap.test.WorkflowManager)8 Table (io.cdap.cdap.api.dataset.table.Table)7 ServiceManager (io.cdap.cdap.test.ServiceManager)7 IOException (java.io.IOException)7 Location (org.apache.twill.filesystem.Location)7 KeyValue (io.cdap.cdap.api.dataset.lib.KeyValue)6 ArrayList (java.util.ArrayList)6 URL (java.net.URL)5 Category (org.junit.experimental.categories.Category)5 Schema (io.cdap.cdap.api.data.schema.Schema)4 DatasetSpecification (io.cdap.cdap.api.dataset.DatasetSpecification)4 MapReduceManager (io.cdap.cdap.test.MapReduceManager)4