Search in sources :

Example 1 with LogicalRDD

use of org.apache.spark.sql.execution.LogicalRDD in project OpenLineage by OpenLineage.

the class LogicalRDDVisitor method apply.

@Override
public List<D> apply(LogicalPlan x) {
    LogicalRDD logicalRdd = (LogicalRDD) x;
    List<HadoopRDD> hadoopRdds = findHadoopRdds(logicalRdd);
    return hadoopRdds.stream().flatMap(rdd -> {
        Path[] inputPaths = FileInputFormat.getInputPaths(rdd.getJobConf());
        Configuration hadoopConf = rdd.getConf();
        return Arrays.stream(inputPaths).map(p -> PlanUtils.getDirectoryPath(p, hadoopConf));
    }).distinct().map(p -> {
        // static partitions in the relation
        return datasetFactory.getDataset(p.toUri(), logicalRdd.schema());
    }).collect(Collectors.toList());
}
Also used : Arrays(java.util.Arrays) LogicalPlan(org.apache.spark.sql.catalyst.plans.logical.LogicalPlan) QueryPlanVisitor(io.openlineage.spark.api.QueryPlanVisitor) DatasetFactory(io.openlineage.spark.api.DatasetFactory) OpenLineageContext(io.openlineage.spark.api.OpenLineageContext) FileInputFormat(org.apache.hadoop.mapred.FileInputFormat) Seq(scala.collection.Seq) ScalaConversionUtils(io.openlineage.spark.agent.util.ScalaConversionUtils) Collectors(java.util.stream.Collectors) Stack(java.util.Stack) ArrayList(java.util.ArrayList) PlanUtils(io.openlineage.spark.agent.util.PlanUtils) List(java.util.List) Dependency(org.apache.spark.Dependency) HadoopRDD(org.apache.spark.rdd.HadoopRDD) Configuration(org.apache.hadoop.conf.Configuration) LogicalRDD(org.apache.spark.sql.execution.LogicalRDD) Path(org.apache.hadoop.fs.Path) OpenLineage(io.openlineage.client.OpenLineage) RDD(org.apache.spark.rdd.RDD) LogicalRDD(org.apache.spark.sql.execution.LogicalRDD) Configuration(org.apache.hadoop.conf.Configuration) HadoopRDD(org.apache.spark.rdd.HadoopRDD)

Example 2 with LogicalRDD

use of org.apache.spark.sql.execution.LogicalRDD in project OpenLineage by OpenLineage.

the class LogicalRDDVisitorTest method testApply.

@Test
public void testApply(@TempDir Path tmpDir) {
    SparkSession session = SparkSession.builder().master("local").getOrCreate();
    LogicalRDDVisitor visitor = new LogicalRDDVisitor(SparkAgentTestExtension.newContext(session), DatasetFactory.output(new OpenLineage(OpenLineageClient.OPEN_LINEAGE_CLIENT_URI)));
    StructType schema = new StructType(new StructField[] { new StructField("anInt", IntegerType$.MODULE$, false, new Metadata(new HashMap<>())), new StructField("aString", StringType$.MODULE$, false, new Metadata(new HashMap<>())) });
    jobConf = new JobConf();
    FileInputFormat.addInputPath(jobConf, new org.apache.hadoop.fs.Path("file://" + tmpDir));
    RDD<InternalRow> hadoopRdd = new HadoopRDD<>(session.sparkContext(), jobConf, TextInputFormat.class, LongWritable.class, Text.class, 1).toJavaRDD().map(t -> (InternalRow) new GenericInternalRow(new Object[] { t._2.toString() })).rdd();
    LogicalRDD logicalRDD = new LogicalRDD(ScalaConversionUtils.fromSeq(schema.toAttributes()).stream().map(AttributeReference::toAttribute).collect(ScalaConversionUtils.toSeq()), hadoopRdd, SinglePartition$.MODULE$, Seq$.MODULE$.<SortOrder>empty(), false, session);
    assertThat(visitor.isDefinedAt(logicalRDD)).isTrue();
    List<OpenLineage.Dataset> datasets = visitor.apply(logicalRDD);
    assertThat(datasets).singleElement().hasFieldOrPropertyWithValue("name", tmpDir.toString()).hasFieldOrPropertyWithValue("namespace", "file");
}
Also used : OpenLineageClient(io.openlineage.spark.agent.client.OpenLineageClient) Seq$(scala.collection.Seq$) TextInputFormat(org.apache.hadoop.mapred.TextInputFormat) InternalRow(org.apache.spark.sql.catalyst.InternalRow) SinglePartition$(org.apache.spark.sql.catalyst.plans.physical.SinglePartition$) Assertions.assertThat(org.assertj.core.api.Assertions.assertThat) Text(org.apache.hadoop.io.Text) LongWritable(org.apache.hadoop.io.LongWritable) GenericInternalRow(org.apache.spark.sql.catalyst.expressions.GenericInternalRow) AttributeReference(org.apache.spark.sql.catalyst.expressions.AttributeReference) SparkAgentTestExtension(io.openlineage.spark.agent.SparkAgentTestExtension) ExtendWith(org.junit.jupiter.api.extension.ExtendWith) HadoopRDD(org.apache.spark.rdd.HadoopRDD) Path(java.nio.file.Path) SparkSession(org.apache.spark.sql.SparkSession) Metadata(org.apache.spark.sql.types.Metadata) StringType$(org.apache.spark.sql.types.StringType$) StructField(org.apache.spark.sql.types.StructField) StructType(org.apache.spark.sql.types.StructType) IntegerType$(org.apache.spark.sql.types.IntegerType$) SparkSession$(org.apache.spark.sql.SparkSession$) DatasetFactory(io.openlineage.spark.api.DatasetFactory) FileInputFormat(org.apache.hadoop.mapred.FileInputFormat) ScalaConversionUtils(io.openlineage.spark.agent.util.ScalaConversionUtils) JobConf(org.apache.hadoop.mapred.JobConf) Test(org.junit.jupiter.api.Test) List(java.util.List) AfterEach(org.junit.jupiter.api.AfterEach) SortOrder(org.apache.spark.sql.catalyst.expressions.SortOrder) TempDir(org.junit.jupiter.api.io.TempDir) LogicalRDD(org.apache.spark.sql.execution.LogicalRDD) HashMap(scala.collection.immutable.HashMap) OpenLineage(io.openlineage.client.OpenLineage) RDD(org.apache.spark.rdd.RDD) SparkSession(org.apache.spark.sql.SparkSession) StructType(org.apache.spark.sql.types.StructType) AttributeReference(org.apache.spark.sql.catalyst.expressions.AttributeReference) Metadata(org.apache.spark.sql.types.Metadata) Text(org.apache.hadoop.io.Text) StructField(org.apache.spark.sql.types.StructField) LogicalRDD(org.apache.spark.sql.execution.LogicalRDD) TextInputFormat(org.apache.hadoop.mapred.TextInputFormat) GenericInternalRow(org.apache.spark.sql.catalyst.expressions.GenericInternalRow) OpenLineage(io.openlineage.client.OpenLineage) LongWritable(org.apache.hadoop.io.LongWritable) JobConf(org.apache.hadoop.mapred.JobConf) InternalRow(org.apache.spark.sql.catalyst.InternalRow) GenericInternalRow(org.apache.spark.sql.catalyst.expressions.GenericInternalRow) Test(org.junit.jupiter.api.Test)

Example 3 with LogicalRDD

use of org.apache.spark.sql.execution.LogicalRDD in project OpenLineage by OpenLineage.

the class LogicalRDDVisitor method findHadoopRdds.

private List<HadoopRDD> findHadoopRdds(LogicalRDD rdd) {
    RDD root = rdd.rdd();
    List<HadoopRDD> ret = new ArrayList<>();
    Stack<RDD> deps = new Stack<>();
    deps.add(root);
    while (!deps.isEmpty()) {
        RDD cur = deps.pop();
        Seq<Dependency> dependencies = cur.getDependencies();
        deps.addAll(ScalaConversionUtils.fromSeq(dependencies).stream().map(Dependency::rdd).collect(Collectors.toList()));
        if (cur instanceof HadoopRDD) {
            ret.add((HadoopRDD) cur);
        }
    }
    return ret;
}
Also used : HadoopRDD(org.apache.spark.rdd.HadoopRDD) LogicalRDD(org.apache.spark.sql.execution.LogicalRDD) RDD(org.apache.spark.rdd.RDD) HadoopRDD(org.apache.spark.rdd.HadoopRDD) ArrayList(java.util.ArrayList) Dependency(org.apache.spark.Dependency) Stack(java.util.Stack)

Aggregations

HadoopRDD (org.apache.spark.rdd.HadoopRDD)3 RDD (org.apache.spark.rdd.RDD)3 LogicalRDD (org.apache.spark.sql.execution.LogicalRDD)3 OpenLineage (io.openlineage.client.OpenLineage)2 ScalaConversionUtils (io.openlineage.spark.agent.util.ScalaConversionUtils)2 DatasetFactory (io.openlineage.spark.api.DatasetFactory)2 ArrayList (java.util.ArrayList)2 List (java.util.List)2 Stack (java.util.Stack)2 FileInputFormat (org.apache.hadoop.mapred.FileInputFormat)2 Dependency (org.apache.spark.Dependency)2 SparkAgentTestExtension (io.openlineage.spark.agent.SparkAgentTestExtension)1 OpenLineageClient (io.openlineage.spark.agent.client.OpenLineageClient)1 PlanUtils (io.openlineage.spark.agent.util.PlanUtils)1 OpenLineageContext (io.openlineage.spark.api.OpenLineageContext)1 QueryPlanVisitor (io.openlineage.spark.api.QueryPlanVisitor)1 Path (java.nio.file.Path)1 Arrays (java.util.Arrays)1 Collectors (java.util.stream.Collectors)1 Configuration (org.apache.hadoop.conf.Configuration)1