Search in sources :

Example 1 with HadoopFsRelation

use of org.apache.spark.sql.execution.datasources.HadoopFsRelation in project OpenLineage by OpenLineage.

the class LogicalRelationDatasetBuilderTest method testApplyForHadoopFsRelation.

@Test
void testApplyForHadoopFsRelation() {
    HadoopFsRelation hadoopFsRelation = mock(HadoopFsRelation.class);
    LogicalRelation logicalRelation = mock(LogicalRelation.class);
    Configuration hadoopConfig = mock(Configuration.class);
    SparkContext sparkContext = mock(SparkContext.class);
    FileIndex fileIndex = mock(FileIndex.class);
    OpenLineage openLineage = mock(OpenLineage.class);
    SessionState sessionState = mock(SessionState.class);
    Path p1 = new Path("/tmp/path1");
    Path p2 = new Path("/tmp/path2");
    when(logicalRelation.relation()).thenReturn(hadoopFsRelation);
    when(openLineageContext.getSparkContext()).thenReturn(sparkContext);
    when(openLineageContext.getSparkSession()).thenReturn(Optional.of(session));
    when(openLineageContext.getOpenLineage()).thenReturn(openLineage);
    when(openLineage.newDatasetFacetsBuilder()).thenReturn(new OpenLineage.DatasetFacetsBuilder());
    when(session.sessionState()).thenReturn(sessionState);
    when(sessionState.newHadoopConfWithOptions(any())).thenReturn(hadoopConfig);
    when(hadoopFsRelation.location()).thenReturn(fileIndex);
    when(fileIndex.rootPaths()).thenReturn(scala.collection.JavaConverters.collectionAsScalaIterableConverter(Arrays.asList(p1, p2)).asScala().toSeq());
    try (MockedStatic mocked = mockStatic(PlanUtils.class)) {
        when(PlanUtils.getDirectoryPath(p1, hadoopConfig)).thenReturn(new Path("/tmp"));
        when(PlanUtils.getDirectoryPath(p2, hadoopConfig)).thenReturn(new Path("/tmp"));
        List<OpenLineage.Dataset> datasets = builder.apply(logicalRelation);
        assertEquals(1, datasets.size());
        OpenLineage.Dataset ds = datasets.get(0);
        assertEquals("/tmp", ds.getName());
    }
}
Also used : Path(org.apache.hadoop.fs.Path) SessionState(org.apache.spark.sql.internal.SessionState) Configuration(org.apache.hadoop.conf.Configuration) MockedStatic(org.mockito.MockedStatic) OutputDataset(io.openlineage.client.OpenLineage.OutputDataset) LogicalRelation(org.apache.spark.sql.execution.datasources.LogicalRelation) FileIndex(org.apache.spark.sql.execution.datasources.FileIndex) SparkContext(org.apache.spark.SparkContext) OpenLineage(io.openlineage.client.OpenLineage) HadoopFsRelation(org.apache.spark.sql.execution.datasources.HadoopFsRelation) Test(org.junit.jupiter.api.Test) ParameterizedTest(org.junit.jupiter.params.ParameterizedTest)

Example 2 with HadoopFsRelation

use of org.apache.spark.sql.execution.datasources.HadoopFsRelation in project OpenLineage by OpenLineage.

the class DatasetVersionDatasetFacetUtils method extractVersionFromLogicalRelation.

/**
 * Delta uses LogicalRelation's HadoopFsRelation as a logical plan's leaf. It implements FileIndex
 * using TahoeLogFileIndex that contains DeltaLog, which can be used to get dataset's snapshot.
 */
public static Optional<String> extractVersionFromLogicalRelation(LogicalRelation logicalRelation) {
    if (logicalRelation.relation() instanceof HadoopFsRelation) {
        HadoopFsRelation fsRelation = (HadoopFsRelation) logicalRelation.relation();
        asJavaOptional(logicalRelation.catalogTable());
        if (logicalRelation.catalogTable().isDefined() && logicalRelation.catalogTable().get().provider().isDefined() && logicalRelation.catalogTable().get().provider().get().equalsIgnoreCase("delta")) {
            if (hasDeltaClasses() && fsRelation.location() instanceof TahoeLogFileIndex) {
                TahoeLogFileIndex fileIndex = (TahoeLogFileIndex) fsRelation.location();
                return Optional.of(Long.toString(fileIndex.getSnapshot().version()));
            }
        }
    }
    return Optional.empty();
}
Also used : TahoeLogFileIndex(org.apache.spark.sql.delta.files.TahoeLogFileIndex) HadoopFsRelation(org.apache.spark.sql.execution.datasources.HadoopFsRelation)

Example 3 with HadoopFsRelation

use of org.apache.spark.sql.execution.datasources.HadoopFsRelation in project OpenLineage by OpenLineage.

the class LogicalPlanSerializerTest method testSerializeInsertIntoHadoopPlan.

@Test
public void testSerializeInsertIntoHadoopPlan() throws IOException, InvocationTargetException, IllegalAccessException {
    SparkSession session = SparkSession.builder().master("local").getOrCreate();
    HadoopFsRelation hadoopFsRelation = new HadoopFsRelation(new CatalogFileIndex(session, CatalogTableTestUtils.getCatalogTable(new TableIdentifier("test", Option.apply("db"))), 100L), new StructType(new StructField[] { new StructField("name", StringType$.MODULE$, false, Metadata.empty()) }), new StructType(new StructField[] { new StructField("name", StringType$.MODULE$, false, Metadata.empty()) }), Option.empty(), new TextFileFormat(), new HashMap<>(), session);
    LogicalRelation logicalRelation = new LogicalRelation(hadoopFsRelation, Seq$.MODULE$.<AttributeReference>newBuilder().$plus$eq(new AttributeReference("name", StringType$.MODULE$, false, Metadata.empty(), ExprId.apply(1L), Seq$.MODULE$.<String>empty())).result(), Option.empty(), false);
    InsertIntoHadoopFsRelationCommand command = new InsertIntoHadoopFsRelationCommand(new org.apache.hadoop.fs.Path("/tmp"), new HashMap<>(), false, Seq$.MODULE$.<Attribute>newBuilder().$plus$eq(new AttributeReference("name", StringType$.MODULE$, false, Metadata.empty(), ExprId.apply(1L), Seq$.MODULE$.<String>empty())).result(), Option.empty(), new TextFileFormat(), new HashMap<>(), logicalRelation, SaveMode.Overwrite, Option.empty(), Option.empty(), Seq$.MODULE$.<String>newBuilder().$plus$eq("name").result());
    Map<String, Object> commandActualNode = objectMapper.readValue(logicalPlanSerializer.serialize(command), mapTypeReference);
    Map<String, Object> hadoopFSActualNode = objectMapper.readValue(logicalPlanSerializer.serialize(logicalRelation), mapTypeReference);
    Path expectedCommandNodePath = Paths.get("src", "test", "resources", "test_data", "serde", "insertintofs-node.json");
    Path expectedHadoopFSNodePath = Paths.get("src", "test", "resources", "test_data", "serde", "hadoopfsrelation-node.json");
    Map<String, Object> expectedCommandNode = objectMapper.readValue(expectedCommandNodePath.toFile(), mapTypeReference);
    Map<String, Object> expectedHadoopFSNode = objectMapper.readValue(expectedHadoopFSNodePath.toFile(), mapTypeReference);
    assertThat(commandActualNode).satisfies(new MatchesMapRecursively(expectedCommandNode, Collections.singleton("exprId")));
    assertThat(hadoopFSActualNode).satisfies(new MatchesMapRecursively(expectedHadoopFSNode, Collections.singleton("exprId")));
}
Also used : TableIdentifier(org.apache.spark.sql.catalyst.TableIdentifier) Path(java.nio.file.Path) SparkSession(org.apache.spark.sql.SparkSession) StructType(org.apache.spark.sql.types.StructType) AttributeReference(org.apache.spark.sql.catalyst.expressions.AttributeReference) InsertIntoHadoopFsRelationCommand(org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand) LogicalRelation(org.apache.spark.sql.execution.datasources.LogicalRelation) StructField(org.apache.spark.sql.types.StructField) HadoopFsRelation(org.apache.spark.sql.execution.datasources.HadoopFsRelation) CatalogFileIndex(org.apache.spark.sql.execution.datasources.CatalogFileIndex) TextFileFormat(org.apache.spark.sql.execution.datasources.text.TextFileFormat) Test(org.junit.jupiter.api.Test)

Example 4 with HadoopFsRelation

use of org.apache.spark.sql.execution.datasources.HadoopFsRelation in project OpenLineage by OpenLineage.

the class LogicalRelationDatasetBuilderTest method testApplyForHadoopFsRelationDatasetVersionFacet.

@Test
void testApplyForHadoopFsRelationDatasetVersionFacet() {
    HadoopFsRelation hadoopFsRelation = mock(HadoopFsRelation.class);
    LogicalRelation logicalRelation = mock(LogicalRelation.class);
    Configuration hadoopConfig = mock(Configuration.class);
    SparkContext sparkContext = mock(SparkContext.class);
    SessionState sessionState = mock(SessionState.class);
    FileIndex fileIndex = mock(FileIndex.class);
    Path path = new Path("/tmp/path1");
    when(logicalRelation.relation()).thenReturn(hadoopFsRelation);
    when(openLineageContext.getSparkContext()).thenReturn(sparkContext);
    when(openLineageContext.getSparkSession()).thenReturn(Optional.of(session));
    when(facet.getDatasetVersion()).thenReturn(SOME_VERSION);
    when(session.sessionState()).thenReturn(sessionState);
    when(sessionState.newHadoopConfWithOptions(any())).thenReturn(hadoopConfig);
    when(hadoopFsRelation.location()).thenReturn(fileIndex);
    when(fileIndex.rootPaths()).thenReturn(scala.collection.JavaConverters.collectionAsScalaIterableConverter(Arrays.asList(path)).asScala().toSeq());
    when(openLineageContext.getOpenLineage()).thenReturn(openLineage);
    when(openLineage.newDatasetFacetsBuilder()).thenReturn(new OpenLineage.DatasetFacetsBuilder());
    when(openLineage.newDatasetVersionDatasetFacet(SOME_VERSION)).thenReturn(facet);
    try (MockedStatic mocked = mockStatic(PlanUtils.class)) {
        try (MockedStatic mockedFacetUtils = mockStatic(DatasetVersionDatasetFacetUtils.class)) {
            when(PlanUtils.getDirectoryPath(path, hadoopConfig)).thenReturn(new Path("/tmp"));
            when(DatasetVersionDatasetFacetUtils.extractVersionFromLogicalRelation(logicalRelation)).thenReturn(Optional.of(SOME_VERSION));
            List<OpenLineage.Dataset> datasets = visitor.apply(logicalRelation);
            assertEquals(1, datasets.size());
            OpenLineage.Dataset ds = datasets.get(0);
            assertEquals("/tmp", ds.getName());
            assertEquals(SOME_VERSION, ds.getFacets().getVersion().getDatasetVersion());
        }
    }
}
Also used : Path(org.apache.hadoop.fs.Path) SessionState(org.apache.spark.sql.internal.SessionState) Configuration(org.apache.hadoop.conf.Configuration) MockedStatic(org.mockito.MockedStatic) LogicalRelation(org.apache.spark.sql.execution.datasources.LogicalRelation) FileIndex(org.apache.spark.sql.execution.datasources.FileIndex) SparkContext(org.apache.spark.SparkContext) OpenLineage(io.openlineage.client.OpenLineage) HadoopFsRelation(org.apache.spark.sql.execution.datasources.HadoopFsRelation) Test(org.junit.jupiter.api.Test)

Aggregations

HadoopFsRelation (org.apache.spark.sql.execution.datasources.HadoopFsRelation)4 LogicalRelation (org.apache.spark.sql.execution.datasources.LogicalRelation)3 Test (org.junit.jupiter.api.Test)3 OpenLineage (io.openlineage.client.OpenLineage)2 Configuration (org.apache.hadoop.conf.Configuration)2 Path (org.apache.hadoop.fs.Path)2 SparkContext (org.apache.spark.SparkContext)2 FileIndex (org.apache.spark.sql.execution.datasources.FileIndex)2 SessionState (org.apache.spark.sql.internal.SessionState)2 MockedStatic (org.mockito.MockedStatic)2 OutputDataset (io.openlineage.client.OpenLineage.OutputDataset)1 Path (java.nio.file.Path)1 SparkSession (org.apache.spark.sql.SparkSession)1 TableIdentifier (org.apache.spark.sql.catalyst.TableIdentifier)1 AttributeReference (org.apache.spark.sql.catalyst.expressions.AttributeReference)1 TahoeLogFileIndex (org.apache.spark.sql.delta.files.TahoeLogFileIndex)1 CatalogFileIndex (org.apache.spark.sql.execution.datasources.CatalogFileIndex)1 InsertIntoHadoopFsRelationCommand (org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand)1 TextFileFormat (org.apache.spark.sql.execution.datasources.text.TextFileFormat)1 StructField (org.apache.spark.sql.types.StructField)1