Search in sources :

Example 1 with JoinDefinition

use of io.cdap.cdap.etl.api.join.JoinDefinition in project cdap by caskdata.

the class BatchSparkPipelineDriverTest method testShouldJoinOnSQLEngineWithBroadcastAndAlreadyPushedCollection.

@Test
public void testShouldJoinOnSQLEngineWithBroadcastAndAlreadyPushedCollection() {
    List<JoinStage> noneBroadcast = Arrays.asList(JoinStage.builder("a", null).setBroadcast(false).build(), JoinStage.builder("b", null).setBroadcast(false).build(), JoinStage.builder("c", null).setBroadcast(true).build());
    JoinDefinition joinDefinition = mock(JoinDefinition.class);
    doReturn(noneBroadcast).when(joinDefinition).getStages();
    Map<String, SparkCollection<Object>> collections = new HashMap<>();
    collections.put("a", mock(SQLEngineCollection.class));
    collections.put("b", mock(RDDCollection.class));
    collections.put("c", mock(RDDCollection.class));
    Assert.assertTrue(driver.canJoinOnSQLEngine(STAGE_NAME, joinDefinition, collections));
}
Also used : SparkCollection(io.cdap.cdap.etl.spark.SparkCollection) JoinStage(io.cdap.cdap.etl.api.join.JoinStage) JoinDefinition(io.cdap.cdap.etl.api.join.JoinDefinition) HashMap(java.util.HashMap) Matchers.anyString(org.mockito.Matchers.anyString) Test(org.junit.Test)

Example 2 with JoinDefinition

use of io.cdap.cdap.etl.api.join.JoinDefinition in project cdap by caskdata.

the class BatchSparkPipelineDriverTest method testSQLEngineDoesNotSupportJoin.

@Test
public void testSQLEngineDoesNotSupportJoin() {
    when(adapter.canJoin(anyString(), any(JoinDefinition.class))).thenReturn(false);
    List<JoinStage> noneBroadcast = Arrays.asList(JoinStage.builder("a", null).setBroadcast(false).build(), JoinStage.builder("b", null).setBroadcast(false).build(), JoinStage.builder("c", null).setBroadcast(false).build());
    JoinDefinition joinDefinition = mock(JoinDefinition.class);
    doReturn(noneBroadcast).when(joinDefinition).getStages();
    Map<String, SparkCollection<Object>> collections = new HashMap<>();
    collections.put("a", mock(RDDCollection.class));
    collections.put("b", mock(RDDCollection.class));
    collections.put("c", mock(RDDCollection.class));
    Assert.assertFalse(driver.canJoinOnSQLEngine(STAGE_NAME, joinDefinition, collections));
}
Also used : SparkCollection(io.cdap.cdap.etl.spark.SparkCollection) JoinStage(io.cdap.cdap.etl.api.join.JoinStage) JoinDefinition(io.cdap.cdap.etl.api.join.JoinDefinition) HashMap(java.util.HashMap) Matchers.anyString(org.mockito.Matchers.anyString) Test(org.junit.Test)

Example 3 with JoinDefinition

use of io.cdap.cdap.etl.api.join.JoinDefinition in project cdap by caskdata.

the class PipelineSpecGenerator method configureAutoJoiner.

private void configureAutoJoiner(String stageName, AutoJoiner autoJoiner, DefaultStageConfigurer stageConfigurer, FailureCollector collector) {
    AutoJoinerContext autoContext = DefaultAutoJoinerContext.from(stageConfigurer.getInputSchemas(), collector);
    JoinDefinition joinDefinition = autoJoiner.define(autoContext);
    if (joinDefinition == null) {
        return;
    }
    validateJoinCondition(stageName, joinDefinition.getCondition(), collector);
    stageConfigurer.setOutputSchema(joinDefinition.getOutputSchema());
    Set<String> inputStages = stageConfigurer.getInputSchemas().keySet();
    Set<String> joinStages = joinDefinition.getStages().stream().map(JoinStage::getStageName).collect(Collectors.toSet());
    Set<String> missingInputs = Sets.difference(inputStages, joinStages);
    if (!missingInputs.isEmpty()) {
        collector.addFailure(String.format("Joiner stage '%s' did not include input stage %s in the join.", stageName, String.join(", ", missingInputs)), "Check with the plugin developer to make sure it is implemented correctly.");
    }
    Set<String> extraInputs = Sets.difference(joinStages, inputStages);
    if (!extraInputs.isEmpty()) {
        collector.addFailure(String.format("Joiner stage '%s' is trying to join stage %s, which is not an input.", stageName, String.join(", ", missingInputs)), "Check with the plugin developer to make sure it is implemented correctly.");
    }
}
Also used : DefaultAutoJoinerContext(io.cdap.cdap.etl.common.DefaultAutoJoinerContext) AutoJoinerContext(io.cdap.cdap.etl.api.join.AutoJoinerContext) JoinDefinition(io.cdap.cdap.etl.api.join.JoinDefinition)

Example 4 with JoinDefinition

use of io.cdap.cdap.etl.api.join.JoinDefinition in project cdap by caskdata.

the class PipelinePhasePreparer method validateAutoJoiner.

private void validateAutoJoiner(AutoJoiner autoJoiner, StageSpec stageSpec) {
    // validate that the join definition is not null
    // it could be null at configure time due to macros not being evaluated, but at this
    // point all macros should be evaluated and the definition should be non-null.
    String stageName = stageSpec.getName();
    String pluginName = stageSpec.getPlugin().getName();
    FailureCollector failureCollector = new LoggingFailureCollector(stageSpec.getName(), stageSpec.getInputSchemas());
    AutoJoinerContext autoJoinerContext = DefaultAutoJoinerContext.from(stageSpec.getInputSchemas(), failureCollector);
    JoinDefinition joinDefinition = autoJoiner.define(autoJoinerContext);
    failureCollector.getOrThrowException();
    if (joinDefinition == null) {
        throw new IllegalArgumentException(String.format("Joiner stage '%s' using plugin '%s' did not provide a join definition. " + "Check with the plugin developer to make sure it is implemented correctly.", stageName, pluginName));
    }
    // validate that the stages mentioned in the join definition are actually inputs into the joiner.
    Set<String> inputStages = stageSpec.getInputSchemas().keySet();
    Set<String> joinStages = joinDefinition.getStages().stream().map(JoinStage::getStageName).collect(Collectors.toSet());
    Set<String> missingInputs = Sets.difference(inputStages, joinStages);
    if (!missingInputs.isEmpty()) {
        throw new IllegalArgumentException(String.format("Joiner stage '%s' using plugin '%s' did not include input stage %s in the join. " + "Check with the plugin developer to make sure it is implemented correctly.", stageName, pluginName, String.join(", ", missingInputs)));
    }
    Set<String> extraInputs = Sets.difference(joinStages, inputStages);
    if (!extraInputs.isEmpty()) {
        throw new IllegalArgumentException(String.format("Joiner stage '%s' using plugin '%s' is trying to join stage %s, which is not an input. " + "Check with the plugin developer to make sure it is implemented correctly.", stageName, pluginName, String.join(", ", missingInputs)));
    }
}
Also used : LoggingFailureCollector(io.cdap.cdap.etl.validation.LoggingFailureCollector) DefaultAutoJoinerContext(io.cdap.cdap.etl.common.DefaultAutoJoinerContext) AutoJoinerContext(io.cdap.cdap.etl.api.join.AutoJoinerContext) JoinDefinition(io.cdap.cdap.etl.api.join.JoinDefinition) LoggingFailureCollector(io.cdap.cdap.etl.validation.LoggingFailureCollector) FailureCollector(io.cdap.cdap.etl.api.FailureCollector)

Example 5 with JoinDefinition

use of io.cdap.cdap.etl.api.join.JoinDefinition in project cdap by caskdata.

the class SparkPipelineRunner method handleJoin.

protected SparkCollection<Object> handleJoin(Map<String, SparkCollection<Object>> inputDataCollections, PipelinePhase pipelinePhase, PluginFunctionContext pluginFunctionContext, StageSpec stageSpec, FunctionCache.Factory functionCacheFactory, Object plugin, Integer numPartitions, StageStatisticsCollector collector, Set<String> shufflers) throws Exception {
    String stageName = stageSpec.getName();
    if (plugin instanceof BatchJoiner) {
        BatchJoiner<Object, Object, Object> joiner = (BatchJoiner<Object, Object, Object>) plugin;
        BatchJoinerRuntimeContext joinerRuntimeContext = pluginFunctionContext.createBatchRuntimeContext();
        joiner.initialize(joinerRuntimeContext);
        shufflers.add(stageName);
        return handleJoin(joiner, inputDataCollections, stageSpec, functionCacheFactory, numPartitions, collector);
    } else if (plugin instanceof AutoJoiner) {
        AutoJoiner autoJoiner = (AutoJoiner) plugin;
        Map<String, Schema> inputSchemas = new HashMap<>();
        for (String inputStageName : pipelinePhase.getStageInputs(stageName)) {
            StageSpec inputStageSpec = pipelinePhase.getStage(inputStageName);
            inputSchemas.put(inputStageName, inputStageSpec.getOutputSchema());
        }
        FailureCollector failureCollector = new LoggingFailureCollector(stageName, inputSchemas);
        AutoJoinerContext autoJoinerContext = DefaultAutoJoinerContext.from(inputSchemas, failureCollector);
        // joinDefinition will always be non-null because
        // it is checked by PipelinePhasePreparer at the start of the run.
        JoinDefinition joinDefinition = autoJoiner.define(autoJoinerContext);
        failureCollector.getOrThrowException();
        if (joinDefinition.getStages().stream().noneMatch(JoinStage::isBroadcast)) {
            shufflers.add(stageName);
        }
        return handleAutoJoin(stageName, joinDefinition, inputDataCollections, numPartitions);
    } else {
        // should never happen unless there is a bug in the code. should have failed during deployment
        throw new IllegalStateException(String.format("Stage '%s' is an unknown joiner type %s", stageName, plugin.getClass().getName()));
    }
}
Also used : BatchJoinerRuntimeContext(io.cdap.cdap.etl.api.batch.BatchJoinerRuntimeContext) LoggingFailureCollector(io.cdap.cdap.etl.validation.LoggingFailureCollector) BatchJoiner(io.cdap.cdap.etl.api.batch.BatchJoiner) DefaultAutoJoinerContext(io.cdap.cdap.etl.common.DefaultAutoJoinerContext) AutoJoinerContext(io.cdap.cdap.etl.api.join.AutoJoinerContext) JoinDefinition(io.cdap.cdap.etl.api.join.JoinDefinition) StageSpec(io.cdap.cdap.etl.proto.v2.spec.StageSpec) AutoJoiner(io.cdap.cdap.etl.api.join.AutoJoiner) Map(java.util.Map) HashMap(java.util.HashMap) LoggingFailureCollector(io.cdap.cdap.etl.validation.LoggingFailureCollector) FailureCollector(io.cdap.cdap.etl.api.FailureCollector)

Aggregations

JoinDefinition (io.cdap.cdap.etl.api.join.JoinDefinition)12 AutoJoinerContext (io.cdap.cdap.etl.api.join.AutoJoinerContext)6 JoinStage (io.cdap.cdap.etl.api.join.JoinStage)6 HashMap (java.util.HashMap)6 BatchJoinerRuntimeContext (io.cdap.cdap.etl.api.batch.BatchJoinerRuntimeContext)5 DefaultAutoJoinerContext (io.cdap.cdap.etl.common.DefaultAutoJoinerContext)5 FailureCollector (io.cdap.cdap.etl.api.FailureCollector)4 BatchAutoJoiner (io.cdap.cdap.etl.api.batch.BatchAutoJoiner)4 BatchJoiner (io.cdap.cdap.etl.api.batch.BatchJoiner)4 JoinerBridge (io.cdap.cdap.etl.common.plugin.JoinerBridge)4 SparkCollection (io.cdap.cdap.etl.spark.SparkCollection)4 LoggingFailureCollector (io.cdap.cdap.etl.validation.LoggingFailureCollector)4 Test (org.junit.Test)4 Matchers.anyString (org.mockito.Matchers.anyString)4 Schema (io.cdap.cdap.api.data.schema.Schema)3 JoinCondition (io.cdap.cdap.etl.api.join.JoinCondition)3 StageSpec (io.cdap.cdap.etl.proto.v2.spec.StageSpec)2 HashSet (java.util.HashSet)2 StructuredRecord (io.cdap.cdap.api.data.format.StructuredRecord)1 Emitter (io.cdap.cdap.etl.api.Emitter)1