Search in sources :

Example 1 with JDBCRelation

use of org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation in project OpenLineage by OpenLineage.

the class LogicalRelationDatasetBuilder method handleJdbcRelation.

private List<D> handleJdbcRelation(LogicalRelation x) {
    JDBCRelation relation = (JDBCRelation) x.relation();
    // TODO- if a relation is composed of a complex sql query, we should attempt to
    // extract the
    // table names so that we can construct a true lineage
    String tableName = relation.jdbcOptions().parameters().get(JDBCOptions.JDBC_TABLE_NAME()).getOrElse(ScalaConversionUtils.toScalaFn(() -> "COMPLEX"));
    // strip the jdbc: prefix from the url. this leaves us with a url like
    // postgresql://<hostname>:<port>/<database_name>?params
    // we don't parse the URI here because different drivers use different connection
    // formats that aren't always amenable to how Java parses URIs. E.g., the oracle
    // driver format looks like oracle:<drivertype>:<user>/<password>@<database>
    // whereas postgres, mysql, and sqlserver use the scheme://hostname:port/db format.
    String url = JdbcUtils.sanitizeJdbcUrl(relation.jdbcOptions().url());
    return Collections.singletonList(datasetFactory.getDataset(tableName, url, relation.schema()));
}
Also used : JDBCRelation(org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation)

Example 2 with JDBCRelation

use of org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation in project OpenLineage by OpenLineage.

the class OptimizedCreateHiveTableAsSelectCommandVisitorTest method testOptimizedCreateHiveTableAsSelectCommand.

@Test
void testOptimizedCreateHiveTableAsSelectCommand() {
    OptimizedCreateHiveTableAsSelectCommandVisitor visitor = new OptimizedCreateHiveTableAsSelectCommandVisitor(SparkAgentTestExtension.newContext(session));
    OptimizedCreateHiveTableAsSelectCommand command = new OptimizedCreateHiveTableAsSelectCommand(SparkUtils.catalogTable(TableIdentifier$.MODULE$.apply("tablename", Option.apply("db")), CatalogTableType.EXTERNAL(), CatalogStorageFormat$.MODULE$.apply(Option.apply(URI.create("s3://bucket/directory")), null, null, null, false, Map$.MODULE$.empty()), new StructType(new StructField[] { new StructField("key", IntegerType$.MODULE$, false, new Metadata(new HashMap<>())), new StructField("value", StringType$.MODULE$, false, new Metadata(new HashMap<>())) })), new LogicalRelation(new JDBCRelation(new StructType(new StructField[] { new StructField("key", IntegerType$.MODULE$, false, null), new StructField("value", StringType$.MODULE$, false, null) }), new Partition[] {}, new JDBCOptions("", "temp", scala.collection.immutable.Map$.MODULE$.newBuilder().$plus$eq(Tuple2.apply("driver", Driver.class.getName())).result()), session), Seq$.MODULE$.<AttributeReference>newBuilder().$plus$eq(new AttributeReference("key", IntegerType$.MODULE$, false, null, ExprId.apply(1L), Seq$.MODULE$.<String>empty())).$plus$eq(new AttributeReference("value", StringType$.MODULE$, false, null, ExprId.apply(2L), Seq$.MODULE$.<String>empty())).result(), Option.empty(), false), ScalaConversionUtils.fromList(Arrays.asList("key", "value")), SaveMode.Overwrite);
    assertThat(visitor.isDefinedAt(command)).isTrue();
    List<OpenLineage.OutputDataset> datasets = visitor.apply(command);
    assertEquals(1, datasets.size());
    OpenLineage.OutputDataset outputDataset = datasets.get(0);
    assertEquals(OpenLineage.LifecycleStateChangeDatasetFacet.LifecycleStateChange.OVERWRITE, outputDataset.getFacets().getLifecycleStateChange().getLifecycleStateChange());
    assertEquals("directory", outputDataset.getName());
    assertEquals("s3://bucket", outputDataset.getNamespace());
}
Also used : StructType(org.apache.spark.sql.types.StructType) OptimizedCreateHiveTableAsSelectCommandVisitor(io.openlineage.spark.agent.lifecycle.plan.OptimizedCreateHiveTableAsSelectCommandVisitor) AttributeReference(org.apache.spark.sql.catalyst.expressions.AttributeReference) Metadata(org.apache.spark.sql.types.Metadata) JDBCRelation(org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation) Driver(org.postgresql.Driver) LogicalRelation(org.apache.spark.sql.execution.datasources.LogicalRelation) StructField(org.apache.spark.sql.types.StructField) JDBCOptions(org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions) OptimizedCreateHiveTableAsSelectCommand(org.apache.spark.sql.hive.execution.OptimizedCreateHiveTableAsSelectCommand) OpenLineage(io.openlineage.client.OpenLineage) Test(org.junit.jupiter.api.Test)

Example 3 with JDBCRelation

use of org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation in project OpenLineage by OpenLineage.

the class LogicalPlanSerializerTest method testSerializeLogicalPlan.

@Test
public void testSerializeLogicalPlan() throws IOException {
    String jdbcUrl = "jdbc:postgresql://postgreshost:5432/sparkdata";
    String sparkTableName = "my_spark_table";
    scala.collection.immutable.Map<String, String> map = (scala.collection.immutable.Map<String, String>) Map$.MODULE$.<String, String>newBuilder().$plus$eq(Tuple2.apply("driver", Driver.class.getName())).result();
    JDBCRelation relation = new JDBCRelation(new StructType(new StructField[] { new StructField("name", StringType$.MODULE$, false, Metadata.empty()) }), new Partition[] {}, new JDBCOptions(jdbcUrl, sparkTableName, map), mock(SparkSession.class));
    LogicalRelation logicalRelation = new LogicalRelation(relation, Seq$.MODULE$.<AttributeReference>newBuilder().$plus$eq(new AttributeReference("name", StringType$.MODULE$, false, Metadata.empty(), ExprId.apply(1L), Seq$.MODULE$.<String>empty())).result(), Option.empty(), false);
    Aggregate aggregate = new Aggregate(Seq$.MODULE$.<Expression>empty(), Seq$.MODULE$.<NamedExpression>empty(), logicalRelation);
    Map<String, Object> aggregateActualNode = objectMapper.readValue(logicalPlanSerializer.serialize(aggregate), mapTypeReference);
    Map<String, Object> logicalRelationActualNode = objectMapper.readValue(logicalPlanSerializer.serialize(logicalRelation), mapTypeReference);
    Path expectedAggregateNodePath = Paths.get("src", "test", "resources", "test_data", "serde", "aggregate-node.json");
    Path logicalRelationNodePath = Paths.get("src", "test", "resources", "test_data", "serde", "logicalrelation-node.json");
    Map<String, Object> expectedAggregateNode = objectMapper.readValue(expectedAggregateNodePath.toFile(), mapTypeReference);
    Map<String, Object> expectedLogicalRelationNode = objectMapper.readValue(logicalRelationNodePath.toFile(), mapTypeReference);
    assertThat(aggregateActualNode).satisfies(new MatchesMapRecursively(expectedAggregateNode));
    assertThat(logicalRelationActualNode).satisfies(new MatchesMapRecursively(expectedLogicalRelationNode));
}
Also used : Path(java.nio.file.Path) SparkSession(org.apache.spark.sql.SparkSession) StructType(org.apache.spark.sql.types.StructType) AttributeReference(org.apache.spark.sql.catalyst.expressions.AttributeReference) JDBCRelation(org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation) LogicalRelation(org.apache.spark.sql.execution.datasources.LogicalRelation) StructField(org.apache.spark.sql.types.StructField) JDBCOptions(org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions) Aggregate(org.apache.spark.sql.catalyst.plans.logical.Aggregate) Map(java.util.Map) ImmutableMap(com.google.cloud.spark.bigquery.repackaged.com.google.common.collect.ImmutableMap) HashMap(scala.collection.immutable.HashMap) Test(org.junit.jupiter.api.Test)

Example 4 with JDBCRelation

use of org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation in project OpenLineage by OpenLineage.

the class CreateHiveTableAsSelectCommandVisitorTest method testCreateHiveTableAsSelectCommand.

@Test
void testCreateHiveTableAsSelectCommand() {
    CreateHiveTableAsSelectCommandVisitor visitor = new CreateHiveTableAsSelectCommandVisitor(OpenLineageContext.builder().sparkSession(Optional.of(session)).sparkContext(session.sparkContext()).openLineage(new OpenLineage(OpenLineageClient.OPEN_LINEAGE_CLIENT_URI)).build());
    CreateHiveTableAsSelectCommand command = new CreateHiveTableAsSelectCommand(SparkUtils.catalogTable(TableIdentifier$.MODULE$.apply("tablename", Option.apply("db")), CatalogTableType.EXTERNAL(), CatalogStorageFormat$.MODULE$.apply(Option.apply(URI.create("s3://bucket/directory")), null, null, null, false, Map$.MODULE$.empty()), new StructType(new StructField[] { new StructField("key", IntegerType$.MODULE$, false, new Metadata(new HashMap<>())), new StructField("value", StringType$.MODULE$, false, new Metadata(new HashMap<>())) })), new LogicalRelation(new JDBCRelation(new StructType(new StructField[] { new StructField("key", IntegerType$.MODULE$, false, null), new StructField("value", StringType$.MODULE$, false, null) }), new Partition[] {}, new JDBCOptions("", "temp", scala.collection.immutable.Map$.MODULE$.newBuilder().$plus$eq(Tuple2.apply("driver", Driver.class.getName())).result()), session), Seq$.MODULE$.<AttributeReference>newBuilder().$plus$eq(new AttributeReference("key", IntegerType$.MODULE$, false, null, ExprId.apply(1L), Seq$.MODULE$.<String>empty())).$plus$eq(new AttributeReference("value", StringType$.MODULE$, false, null, ExprId.apply(2L), Seq$.MODULE$.<String>empty())).result(), Option.empty(), false), ScalaConversionUtils.fromList(Arrays.asList("key", "value")), SaveMode.Overwrite);
    assertThat(visitor.isDefinedAt(command)).isTrue();
    List<OpenLineage.OutputDataset> datasets = visitor.apply(command);
    assertEquals(1, datasets.size());
    OpenLineage.OutputDataset outputDataset = datasets.get(0);
    assertEquals(OpenLineage.LifecycleStateChangeDatasetFacet.LifecycleStateChange.CREATE, outputDataset.getFacets().getLifecycleStateChange().getLifecycleStateChange());
    assertEquals("directory", outputDataset.getName());
    assertEquals("s3://bucket", outputDataset.getNamespace());
}
Also used : StructType(org.apache.spark.sql.types.StructType) AttributeReference(org.apache.spark.sql.catalyst.expressions.AttributeReference) Metadata(org.apache.spark.sql.types.Metadata) JDBCRelation(org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation) Driver(org.postgresql.Driver) CreateHiveTableAsSelectCommand(org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand) LogicalRelation(org.apache.spark.sql.execution.datasources.LogicalRelation) StructField(org.apache.spark.sql.types.StructField) JDBCOptions(org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions) CreateHiveTableAsSelectCommandVisitor(io.openlineage.spark.agent.lifecycle.plan.CreateHiveTableAsSelectCommandVisitor) OpenLineage(io.openlineage.client.OpenLineage) Test(org.junit.jupiter.api.Test)

Example 5 with JDBCRelation

use of org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation in project OpenLineage by OpenLineage.

the class LogicalRelationDatasetBuilderTest method testApply.

@ParameterizedTest
@ValueSource(strings = { "postgresql://postgreshost:5432/sparkdata", "jdbc:oracle:oci8:@sparkdata", "jdbc:oracle:thin@sparkdata:1521:orcl", "mysql://localhost/sparkdata" })
void testApply(String connectionUri) {
    OpenLineage openLineage = new OpenLineage(OpenLineageClient.OPEN_LINEAGE_CLIENT_URI);
    String jdbcUrl = "jdbc:" + connectionUri;
    String sparkTableName = "my_spark_table";
    JDBCRelation relation = new JDBCRelation(new StructType(new StructField[] { new StructField("name", StringType$.MODULE$, false, null) }), new Partition[] {}, new JDBCOptions(jdbcUrl, sparkTableName, Map$.MODULE$.<String, String>newBuilder().$plus$eq(Tuple2.apply("driver", Driver.class.getName())).result()), session);
    QueryExecution qe = mock(QueryExecution.class);
    when(qe.optimizedPlan()).thenReturn(new LogicalRelation(relation, Seq$.MODULE$.<AttributeReference>newBuilder().$plus$eq(new AttributeReference("name", StringType$.MODULE$, false, null, ExprId.apply(1L), Seq$.MODULE$.<String>empty())).result(), Option.empty(), false));
    OpenLineageContext context = OpenLineageContext.builder().sparkContext(mock(SparkContext.class)).openLineage(openLineage).queryExecution(qe).build();
    LogicalRelationDatasetBuilder visitor = new LogicalRelationDatasetBuilder<>(context, DatasetFactory.output(openLineage), false);
    List<OutputDataset> datasets = visitor.apply(new SparkListenerJobStart(1, 1, Seq$.MODULE$.empty(), null));
    assertEquals(1, datasets.size());
    OutputDataset ds = datasets.get(0);
    assertEquals(connectionUri, ds.getNamespace());
    assertEquals(sparkTableName, ds.getName());
    assertEquals(URI.create(connectionUri), ds.getFacets().getDataSource().getUri());
    assertEquals(connectionUri, ds.getFacets().getDataSource().getName());
}
Also used : StructType(org.apache.spark.sql.types.StructType) SparkListenerJobStart(org.apache.spark.scheduler.SparkListenerJobStart) AttributeReference(org.apache.spark.sql.catalyst.expressions.AttributeReference) JDBCRelation(org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation) Driver(org.postgresql.Driver) QueryExecution(org.apache.spark.sql.execution.QueryExecution) LogicalRelation(org.apache.spark.sql.execution.datasources.LogicalRelation) StructField(org.apache.spark.sql.types.StructField) JDBCOptions(org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions) OpenLineage(io.openlineage.client.OpenLineage) OutputDataset(io.openlineage.client.OpenLineage.OutputDataset) OpenLineageContext(io.openlineage.spark.api.OpenLineageContext) ValueSource(org.junit.jupiter.params.provider.ValueSource) ParameterizedTest(org.junit.jupiter.params.ParameterizedTest)

Aggregations

JDBCRelation (org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation)5 AttributeReference (org.apache.spark.sql.catalyst.expressions.AttributeReference)4 LogicalRelation (org.apache.spark.sql.execution.datasources.LogicalRelation)4 JDBCOptions (org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions)4 StructField (org.apache.spark.sql.types.StructField)4 StructType (org.apache.spark.sql.types.StructType)4 OpenLineage (io.openlineage.client.OpenLineage)3 Test (org.junit.jupiter.api.Test)3 Driver (org.postgresql.Driver)3 Metadata (org.apache.spark.sql.types.Metadata)2 ImmutableMap (com.google.cloud.spark.bigquery.repackaged.com.google.common.collect.ImmutableMap)1 OutputDataset (io.openlineage.client.OpenLineage.OutputDataset)1 CreateHiveTableAsSelectCommandVisitor (io.openlineage.spark.agent.lifecycle.plan.CreateHiveTableAsSelectCommandVisitor)1 OptimizedCreateHiveTableAsSelectCommandVisitor (io.openlineage.spark.agent.lifecycle.plan.OptimizedCreateHiveTableAsSelectCommandVisitor)1 OpenLineageContext (io.openlineage.spark.api.OpenLineageContext)1 Path (java.nio.file.Path)1 Map (java.util.Map)1 SparkListenerJobStart (org.apache.spark.scheduler.SparkListenerJobStart)1 SparkSession (org.apache.spark.sql.SparkSession)1 Aggregate (org.apache.spark.sql.catalyst.plans.logical.Aggregate)1