Search in sources :

Example 1 with InputDataset

use of io.openlineage.client.OpenLineage.InputDataset in project OpenLineage by OpenLineage.

the class InternalEventHandlerFactory method createInputDatasetBuilder.

@Override
public Collection<PartialFunction<Object, List<InputDataset>>> createInputDatasetBuilder(OpenLineageContext context) {
    ImmutableList builders = ImmutableList.<PartialFunction<Object, List<InputDataset>>>builder().addAll(generate(eventHandlerFactories, factory -> factory.createInputDatasetBuilder(context))).addAll(DatasetBuilderFactoryProvider.getInstance().getInputBuilders(context)).build();
    context.getInputDatasetBuilders().addAll(builders);
    return builders;
}
Also used : Spliterators(java.util.Spliterators) InputDataset(io.openlineage.client.OpenLineage.InputDataset) OutputDatasetFacet(io.openlineage.client.OpenLineage.OutputDatasetFacet) Function(java.util.function.Function) SparkVersionFacetBuilder(io.openlineage.spark.agent.facets.builder.SparkVersionFacetBuilder) ImmutableList(com.google.common.collect.ImmutableList) OutputDataset(io.openlineage.client.OpenLineage.OutputDataset) ErrorFacetBuilder(io.openlineage.spark.agent.facets.builder.ErrorFacetBuilder) OutputStatisticsOutputDatasetFacetBuilder(io.openlineage.spark.agent.facets.builder.OutputStatisticsOutputDatasetFacetBuilder) JobFacet(io.openlineage.client.OpenLineage.JobFacet) DatabricksEnvironmentFacetBuilder(io.openlineage.spark.agent.facets.builder.DatabricksEnvironmentFacetBuilder) StreamSupport(java.util.stream.StreamSupport) LogicalPlanRunFacetBuilder(io.openlineage.spark.agent.facets.builder.LogicalPlanRunFacetBuilder) LogicalPlan(org.apache.spark.sql.catalyst.plans.logical.LogicalPlan) PartialFunction(scala.PartialFunction) OpenLineageContext(io.openlineage.spark.api.OpenLineageContext) Collection(java.util.Collection) InputDatasetFacet(io.openlineage.client.OpenLineage.InputDatasetFacet) ServiceLoader(java.util.ServiceLoader) DatasetFacet(io.openlineage.client.OpenLineage.DatasetFacet) Collectors(java.util.stream.Collectors) List(java.util.List) OpenLineageEventHandlerFactory(io.openlineage.spark.api.OpenLineageEventHandlerFactory) CustomFacetBuilder(io.openlineage.spark.api.CustomFacetBuilder) Builder(com.google.common.collect.ImmutableList.Builder) Spliterator(java.util.Spliterator) RunFacet(io.openlineage.client.OpenLineage.RunFacet) ImmutableList(com.google.common.collect.ImmutableList) ImmutableList(com.google.common.collect.ImmutableList) List(java.util.List)

Example 2 with InputDataset

use of io.openlineage.client.OpenLineage.InputDataset in project OpenLineage by OpenLineage.

the class OpenLineageRunEventBuilder method buildInputDatasets.

private List<OpenLineage.InputDataset> buildInputDatasets(List<Object> nodes) {
    openLineageContext.getQueryExecution().ifPresent(qe -> {
        if (log.isDebugEnabled()) {
            log.debug("Traversing optimized plan {}", qe.optimizedPlan().toJSON());
            log.debug("Physical plan executed {}", qe.executedPlan().toJSON());
        }
    });
    log.info("Visiting query plan {} with input dataset builders {}", openLineageContext.getQueryExecution(), inputDatasetBuilders);
    Function1<LogicalPlan, Collection<InputDataset>> inputVisitor = visitLogicalPlan(PlanUtils.merge(inputDatasetQueryPlanVisitors));
    List<OpenLineage.InputDataset> datasets = Stream.concat(buildDatasets(nodes, inputDatasetBuilders), openLineageContext.getQueryExecution().map(qe -> fromSeq(qe.optimizedPlan().map(inputVisitor)).stream().flatMap(Collection::stream).map(((Class<InputDataset>) InputDataset.class)::cast)).orElse(Stream.empty())).collect(Collectors.toList());
    OpenLineage openLineage = openLineageContext.getOpenLineage();
    if (!datasets.isEmpty()) {
        Map<String, InputDatasetFacet> inputFacetsMap = new HashMap<>();
        nodes.forEach(event -> inputDatasetFacetBuilders.forEach(fn -> fn.accept(event, inputFacetsMap::put)));
        Map<String, DatasetFacets> datasetFacetsMap = new HashMap<>();
        nodes.forEach(event -> inputDatasetFacetBuilders.forEach(fn -> fn.accept(event, inputFacetsMap::put)));
        return datasets.stream().map(ds -> openLineage.newInputDatasetBuilder().name(ds.getName()).namespace(ds.getNamespace()).inputFacets(mergeFacets(inputFacetsMap, ds.getInputFacets(), InputDatasetInputFacets.class)).facets(mergeFacets(datasetFacetsMap, ds.getFacets(), DatasetFacets.class)).build()).collect(Collectors.toList());
    }
    return datasets;
}
Also used : OpenLineageClient(io.openlineage.spark.agent.client.OpenLineageClient) Arrays(java.util.Arrays) InputDataset(io.openlineage.client.OpenLineage.InputDataset) RunFacetsBuilder(io.openlineage.client.OpenLineage.RunFacetsBuilder) RunEventBuilder(io.openlineage.client.OpenLineage.RunEventBuilder) DatasetFacets(io.openlineage.client.OpenLineage.DatasetFacets) OutputDataset(io.openlineage.client.OpenLineage.OutputDataset) Map(java.util.Map) JobFacet(io.openlineage.client.OpenLineage.JobFacet) SparkListenerSQLExecutionStart(org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart) JsonAnyGetter(com.fasterxml.jackson.annotation.JsonAnyGetter) TypeReference(com.fasterxml.jackson.core.type.TypeReference) JsonDeserializer(com.fasterxml.jackson.databind.JsonDeserializer) Method(java.lang.reflect.Method) RunEvent(io.openlineage.client.OpenLineage.RunEvent) Stage(org.apache.spark.scheduler.Stage) ScalaConversionUtils.toScalaFn(io.openlineage.spark.agent.util.ScalaConversionUtils.toScalaFn) LogicalPlan(org.apache.spark.sql.catalyst.plans.logical.LogicalPlan) PartialFunction(scala.PartialFunction) NonNull(lombok.NonNull) Collection(java.util.Collection) DatasetFacet(io.openlineage.client.OpenLineage.DatasetFacet) Collectors(java.util.stream.Collectors) IntrospectionException(java.beans.IntrospectionException) OutputDatasetOutputFacets(io.openlineage.client.OpenLineage.OutputDatasetOutputFacets) PlanUtils(io.openlineage.spark.agent.util.PlanUtils) JsonAnySetter(com.fasterxml.jackson.annotation.JsonAnySetter) List(java.util.List) Slf4j(lombok.extern.slf4j.Slf4j) Stream(java.util.stream.Stream) CustomFacetBuilder(io.openlineage.spark.api.CustomFacetBuilder) Type(java.lang.reflect.Type) PropertyDescriptor(java.beans.PropertyDescriptor) Optional(java.util.Optional) RDD(org.apache.spark.rdd.RDD) RunFacet(io.openlineage.client.OpenLineage.RunFacet) JobFailed(org.apache.spark.scheduler.JobFailed) OutputDatasetFacet(io.openlineage.client.OpenLineage.OutputDatasetFacet) ParentRunFacet(io.openlineage.client.OpenLineage.ParentRunFacet) Function1(scala.Function1) HashMap(java.util.HashMap) InputDatasetInputFacets(io.openlineage.client.OpenLineage.InputDatasetInputFacets) ScalaConversionUtils.fromSeq(io.openlineage.spark.agent.util.ScalaConversionUtils.fromSeq) ArrayList(java.util.ArrayList) Introspector(java.beans.Introspector) RunFacets(io.openlineage.client.OpenLineage.RunFacets) DeserializationProblemHandler(com.fasterxml.jackson.databind.deser.DeserializationProblemHandler) JobBuilder(io.openlineage.client.OpenLineage.JobBuilder) BeanInfo(java.beans.BeanInfo) SparkListenerJobEnd(org.apache.spark.scheduler.SparkListenerJobEnd) SparkListenerSQLExecutionEnd(org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionEnd) DeserializationContext(com.fasterxml.jackson.databind.DeserializationContext) JsonParser(com.fasterxml.jackson.core.JsonParser) OpenLineageContext(io.openlineage.spark.api.OpenLineageContext) ActiveJob(org.apache.spark.scheduler.ActiveJob) SparkListenerJobStart(org.apache.spark.scheduler.SparkListenerJobStart) ObjectMapper(com.fasterxml.jackson.databind.ObjectMapper) InputDatasetFacet(io.openlineage.client.OpenLineage.InputDatasetFacet) ScalaConversionUtils(io.openlineage.spark.agent.util.ScalaConversionUtils) SparkListenerStageSubmitted(org.apache.spark.scheduler.SparkListenerStageSubmitted) OpenLineageEventHandlerFactory(io.openlineage.spark.api.OpenLineageEventHandlerFactory) ParameterizedType(java.lang.reflect.ParameterizedType) AllArgsConstructor(lombok.AllArgsConstructor) SparkListenerStageCompleted(org.apache.spark.scheduler.SparkListenerStageCompleted) OpenLineage(io.openlineage.client.OpenLineage) Collections(java.util.Collections) HashMap(java.util.HashMap) DatasetFacets(io.openlineage.client.OpenLineage.DatasetFacets) InputDatasetFacet(io.openlineage.client.OpenLineage.InputDatasetFacet) InputDataset(io.openlineage.client.OpenLineage.InputDataset) OpenLineage(io.openlineage.client.OpenLineage) Collection(java.util.Collection) LogicalPlan(org.apache.spark.sql.catalyst.plans.logical.LogicalPlan) InputDatasetInputFacets(io.openlineage.client.OpenLineage.InputDatasetInputFacets)

Example 3 with InputDataset

use of io.openlineage.client.OpenLineage.InputDataset in project OpenLineage by OpenLineage.

the class OpenLineageTest method jsonSerialization.

@Test
public void jsonSerialization() throws JsonProcessingException {
    ZonedDateTime now = ZonedDateTime.now(ZoneId.of("UTC"));
    URI producer = URI.create("producer");
    OpenLineage ol = new OpenLineage(producer);
    UUID runId = UUID.randomUUID();
    RunFacets runFacets = ol.newRunFacetsBuilder().nominalTime(ol.newNominalTimeRunFacet(now, now)).build();
    Run run = ol.newRun(runId, runFacets);
    String name = "jobName";
    String namespace = "namespace";
    JobFacets jobFacets = ol.newJobFacetsBuilder().build();
    Job job = ol.newJob(namespace, name, jobFacets);
    List<InputDataset> inputs = Arrays.asList(ol.newInputDataset("ins", "input", null, null));
    List<OutputDataset> outputs = Arrays.asList(ol.newOutputDataset("ons", "output", null, null));
    RunEvent runStateUpdate = ol.newRunEvent(OpenLineage.RunEvent.EventType.START, now, run, job, inputs, outputs);
    String json = mapper.writeValueAsString(runStateUpdate);
    RunEvent read = mapper.readValue(json, RunEvent.class);
    assertEquals(producer, read.getProducer());
    assertEquals(runId, read.getRun().getRunId());
    assertEquals(name, read.getJob().getName());
    assertEquals(namespace, read.getJob().getNamespace());
    assertEquals(runStateUpdate.getEventType(), read.getEventType());
    assertEquals(runStateUpdate.getEventTime(), read.getEventTime());
    assertEquals(1, runStateUpdate.getInputs().size());
    NominalTimeRunFacet nominalTime = runStateUpdate.getRun().getFacets().getNominalTime();
    assertEquals(now, nominalTime.getNominalStartTime());
    assertEquals(now, nominalTime.getNominalEndTime());
    InputDataset inputDataset = runStateUpdate.getInputs().get(0);
    assertEquals("ins", inputDataset.getNamespace());
    assertEquals("input", inputDataset.getName());
    assertEquals(1, runStateUpdate.getOutputs().size());
    OutputDataset outputDataset = runStateUpdate.getOutputs().get(0);
    assertEquals("ons", outputDataset.getNamespace());
    assertEquals("output", outputDataset.getName());
    assertEquals(roundTrip(json), roundTrip(mapper.writeValueAsString(read)));
}
Also used : NominalTimeRunFacet(io.openlineage.client.OpenLineage.NominalTimeRunFacet) Run(io.openlineage.client.OpenLineage.Run) URI(java.net.URI) ZonedDateTime(java.time.ZonedDateTime) InputDataset(io.openlineage.client.OpenLineage.InputDataset) OutputDataset(io.openlineage.client.OpenLineage.OutputDataset) UUID(java.util.UUID) RunFacets(io.openlineage.client.OpenLineage.RunFacets) JobFacets(io.openlineage.client.OpenLineage.JobFacets) Job(io.openlineage.client.OpenLineage.Job) RunEvent(io.openlineage.client.OpenLineage.RunEvent) Test(org.junit.Test)

Example 4 with InputDataset

use of io.openlineage.client.OpenLineage.InputDataset in project OpenLineage by OpenLineage.

the class OpenLineageTest method factory.

@Test
public void factory() throws JsonProcessingException {
    ZonedDateTime now = ZonedDateTime.now(ZoneId.of("UTC"));
    URI producer = URI.create("producer");
    OpenLineage ol = new OpenLineage(producer);
    UUID runId = UUID.randomUUID();
    RunFacets runFacets = ol.newRunFacetsBuilder().nominalTime(ol.newNominalTimeRunFacetBuilder().nominalStartTime(now).nominalEndTime(now).build()).build();
    Run run = ol.newRunBuilder().runId(runId).facets(runFacets).build();
    String name = "jobName";
    String namespace = "namespace";
    JobFacets jobFacets = ol.newJobFacetsBuilder().build();
    Job job = ol.newJobBuilder().namespace(namespace).name(name).facets(jobFacets).build();
    List<InputDataset> inputs = Arrays.asList(ol.newInputDatasetBuilder().namespace("ins").name("input").facets(ol.newDatasetFacetsBuilder().version(ol.newDatasetVersionDatasetFacet("input-version")).build()).inputFacets(ol.newInputDatasetInputFacetsBuilder().dataQualityMetrics(ol.newDataQualityMetricsInputDatasetFacetBuilder().rowCount(10L).bytes(20L).columnMetrics(ol.newDataQualityMetricsInputDatasetFacetColumnMetricsBuilder().put("mycol", ol.newDataQualityMetricsInputDatasetFacetColumnMetricsAdditionalBuilder().count(10D).distinctCount(10L).max(30D).min(5D).nullCount(1L).sum(3000D).quantiles(ol.newDataQualityMetricsInputDatasetFacetColumnMetricsAdditionalQuantilesBuilder().put("25", 52D).build()).build()).build()).build()).build()).build());
    List<OutputDataset> outputs = Arrays.asList(ol.newOutputDatasetBuilder().namespace("ons").name("output").facets(ol.newDatasetFacetsBuilder().version(ol.newDatasetVersionDatasetFacet("output-version")).build()).outputFacets(ol.newOutputDatasetOutputFacetsBuilder().outputStatistics(ol.newOutputStatisticsOutputDatasetFacet(10L, 20L)).build()).build());
    RunEvent runStateUpdate = ol.newRunEventBuilder().eventType(OpenLineage.RunEvent.EventType.START).eventTime(now).run(run).job(job).inputs(inputs).outputs(outputs).build();
    ObjectMapper mapper = new ObjectMapper();
    mapper.registerModule(new JavaTimeModule());
    mapper.setSerializationInclusion(Include.NON_NULL);
    mapper.disable(SerializationFeature.WRITE_DATES_AS_TIMESTAMPS);
    mapper.configure(SerializationFeature.INDENT_OUTPUT, true);
    String json = mapper.writeValueAsString(runStateUpdate);
    {
        RunEvent read = mapper.readValue(json, RunEvent.class);
        assertEquals(producer, read.getProducer());
        assertEquals(runId, read.getRun().getRunId());
        assertEquals(name, read.getJob().getName());
        assertEquals(namespace, read.getJob().getNamespace());
        assertEquals(runStateUpdate.getEventType(), read.getEventType());
        assertEquals(runStateUpdate.getEventTime(), read.getEventTime());
        assertEquals(1, runStateUpdate.getInputs().size());
        InputDataset inputDataset = runStateUpdate.getInputs().get(0);
        assertEquals("ins", inputDataset.getNamespace());
        assertEquals("input", inputDataset.getName());
        assertEquals("input-version", inputDataset.getFacets().getVersion().getDatasetVersion());
        DataQualityMetricsInputDatasetFacet dq = inputDataset.getInputFacets().getDataQualityMetrics();
        assertEquals((Long) 10L, dq.getRowCount());
        assertEquals((Long) 20L, dq.getBytes());
        DataQualityMetricsInputDatasetFacetColumnMetricsAdditional colMetrics = dq.getColumnMetrics().getAdditionalProperties().get("mycol");
        assertEquals((Double) 10D, colMetrics.getCount());
        assertEquals((Long) 10L, colMetrics.getDistinctCount());
        assertEquals((Double) 30D, colMetrics.getMax());
        assertEquals((Double) 5D, colMetrics.getMin());
        assertEquals((Long) 1L, colMetrics.getNullCount());
        assertEquals((Double) 3000D, colMetrics.getSum());
        assertEquals((Double) 52D, colMetrics.getQuantiles().getAdditionalProperties().get("25"));
        assertEquals(1, runStateUpdate.getOutputs().size());
        OutputDataset outputDataset = runStateUpdate.getOutputs().get(0);
        assertEquals("ons", outputDataset.getNamespace());
        assertEquals("output", outputDataset.getName());
        assertEquals("output-version", outputDataset.getFacets().getVersion().getDatasetVersion());
        assertEquals(roundTrip(json), roundTrip(mapper.writeValueAsString(read)));
        assertEquals((Long) 10L, outputDataset.getOutputFacets().getOutputStatistics().getRowCount());
        assertEquals((Long) 20L, outputDataset.getOutputFacets().getOutputStatistics().getSize());
        assertEquals(json, mapper.writeValueAsString(read));
    }
    {
        io.openlineage.server.OpenLineage.RunEvent readServer = mapper.readValue(json, io.openlineage.server.OpenLineage.RunEvent.class);
        assertEquals(producer, readServer.getProducer());
        assertEquals(runId, readServer.getRun().getRunId());
        assertEquals(name, readServer.getJob().getName());
        assertEquals(namespace, readServer.getJob().getNamespace());
        assertEquals(runStateUpdate.getEventType().name(), readServer.getEventType().name());
        assertEquals(runStateUpdate.getEventTime(), readServer.getEventTime());
        assertEquals(json, mapper.writeValueAsString(readServer));
    }
}
Also used : JavaTimeModule(com.fasterxml.jackson.datatype.jsr310.JavaTimeModule) Run(io.openlineage.client.OpenLineage.Run) URI(java.net.URI) ZonedDateTime(java.time.ZonedDateTime) InputDataset(io.openlineage.client.OpenLineage.InputDataset) OutputDataset(io.openlineage.client.OpenLineage.OutputDataset) DataQualityMetricsInputDatasetFacet(io.openlineage.client.OpenLineage.DataQualityMetricsInputDatasetFacet) UUID(java.util.UUID) RunFacets(io.openlineage.client.OpenLineage.RunFacets) JobFacets(io.openlineage.client.OpenLineage.JobFacets) Job(io.openlineage.client.OpenLineage.Job) DataQualityMetricsInputDatasetFacetColumnMetricsAdditional(io.openlineage.client.OpenLineage.DataQualityMetricsInputDatasetFacetColumnMetricsAdditional) RunEvent(io.openlineage.client.OpenLineage.RunEvent) ObjectMapper(com.fasterxml.jackson.databind.ObjectMapper) Test(org.junit.Test)

Example 5 with InputDataset

use of io.openlineage.client.OpenLineage.InputDataset in project OpenLineage by OpenLineage.

the class AbstractQueryPlanDatasetBuilderTest method testApplyOnBuilderWithGenericArg.

@Test
public void testApplyOnBuilderWithGenericArg() {
    SparkSession session = SparkSession.builder().config("spark.sql.warehouse.dir", "/tmp/warehouse").master("local").getOrCreate();
    OpenLineage openLineage = new OpenLineage(OpenLineageClient.OPEN_LINEAGE_CLIENT_URI);
    InputDataset expected = openLineage.newInputDataset("namespace", "the_name", null, null);
    OpenLineageContext context = createContext(session, openLineage);
    MyGenericArgInputDatasetBuilder<SparkListenerJobEnd> builder = new MyGenericArgInputDatasetBuilder<>(context, true, expected);
    SparkListenerJobEnd jobEnd = new SparkListenerJobEnd(1, 2, null);
    // Even though our instance of builder is parameterized with SparkListenerJobEnd, it's not
    // *compiled* with that argument, so the isDefinedAt method fails to resolve the type arg
    Assertions.assertFalse(((PartialFunction) builder).isDefinedAt(jobEnd));
}
Also used : SparkSession(org.apache.spark.sql.SparkSession) InputDataset(io.openlineage.client.OpenLineage.InputDataset) SparkListenerJobEnd(org.apache.spark.scheduler.SparkListenerJobEnd) OpenLineage(io.openlineage.client.OpenLineage) Test(org.junit.jupiter.api.Test)

Aggregations

InputDataset (io.openlineage.client.OpenLineage.InputDataset)8 OpenLineage (io.openlineage.client.OpenLineage)5 OutputDataset (io.openlineage.client.OpenLineage.OutputDataset)5 SparkListenerJobEnd (org.apache.spark.scheduler.SparkListenerJobEnd)5 RunEvent (io.openlineage.client.OpenLineage.RunEvent)4 RunFacets (io.openlineage.client.OpenLineage.RunFacets)4 ObjectMapper (com.fasterxml.jackson.databind.ObjectMapper)3 DatasetFacet (io.openlineage.client.OpenLineage.DatasetFacet)3 InputDatasetFacet (io.openlineage.client.OpenLineage.InputDatasetFacet)3 JobFacet (io.openlineage.client.OpenLineage.JobFacet)3 OutputDatasetFacet (io.openlineage.client.OpenLineage.OutputDatasetFacet)3 RunFacet (io.openlineage.client.OpenLineage.RunFacet)3 CustomFacetBuilder (io.openlineage.spark.api.CustomFacetBuilder)3 OpenLineageContext (io.openlineage.spark.api.OpenLineageContext)3 OpenLineageEventHandlerFactory (io.openlineage.spark.api.OpenLineageEventHandlerFactory)3 JsonAnyGetter (com.fasterxml.jackson.annotation.JsonAnyGetter)2 JsonAnySetter (com.fasterxml.jackson.annotation.JsonAnySetter)2 JsonParser (com.fasterxml.jackson.core.JsonParser)2 TypeReference (com.fasterxml.jackson.core.type.TypeReference)2 DeserializationContext (com.fasterxml.jackson.databind.DeserializationContext)2