Search in sources :

Example 1 with ScalaConversionUtils

use of io.openlineage.spark.agent.util.ScalaConversionUtils in project OpenLineage by OpenLineage.

the class IcebergHandler method getDatasetIdentifier.

@Override
public DatasetIdentifier getDatasetIdentifier(SparkSession session, TableCatalog tableCatalog, Identifier identifier, Map<String, String> properties) {
    SparkCatalog sparkCatalog = (SparkCatalog) tableCatalog;
    String catalogName = sparkCatalog.name();
    String prefix = String.format("spark.sql.catalog.%s", catalogName);
    Map<String, String> conf = ScalaConversionUtils.<String, String>fromMap(session.conf().getAll());
    log.info(conf.toString());
    Map<String, String> catalogConf = conf.entrySet().stream().filter(x -> x.getKey().startsWith(prefix)).filter(x -> x.getKey().length() > prefix.length()).collect(Collectors.toMap(// handle dot after prefix
    x -> x.getKey().substring(prefix.length() + 1), Map.Entry::getValue));
    log.info(catalogConf.toString());
    if (catalogConf.isEmpty() || !catalogConf.containsKey("type")) {
        throw new UnsupportedCatalogException(catalogName);
    }
    log.info(catalogConf.get("type"));
    switch(catalogConf.get("type")) {
        case "hadoop":
            return getHadoopIdentifier(catalogConf, identifier.toString());
        case "hive":
            return getHiveIdentifier(session, catalogConf.get(CatalogProperties.URI), identifier.toString());
        default:
            throw new UnsupportedCatalogException(catalogConf.get("type"));
    }
}
Also used : SparkCatalog(org.apache.iceberg.spark.SparkCatalog) SneakyThrows(lombok.SneakyThrows) DatasetIdentifier(io.openlineage.spark.agent.util.DatasetIdentifier) PathUtils(io.openlineage.spark.agent.util.PathUtils) ScalaConversionUtils(io.openlineage.spark.agent.util.ScalaConversionUtils) Collectors(java.util.stream.Collectors) CatalogProperties(org.apache.iceberg.CatalogProperties) Slf4j(lombok.extern.slf4j.Slf4j) TableCatalog(org.apache.spark.sql.connector.catalog.TableCatalog) NoSuchTableException(org.apache.spark.sql.catalyst.analysis.NoSuchTableException) TableProviderFacet(io.openlineage.spark.agent.facets.TableProviderFacet) Map(java.util.Map) Optional(java.util.Optional) Path(org.apache.hadoop.fs.Path) URI(java.net.URI) Identifier(org.apache.spark.sql.connector.catalog.Identifier) SparkTable(org.apache.iceberg.spark.source.SparkTable) Nullable(javax.annotation.Nullable) SparkConfUtils(io.openlineage.spark.agent.util.SparkConfUtils) SparkSession(org.apache.spark.sql.SparkSession) SparkCatalog(org.apache.iceberg.spark.SparkCatalog) Map(java.util.Map)

Example 2 with ScalaConversionUtils

use of io.openlineage.spark.agent.util.ScalaConversionUtils in project OpenLineage by OpenLineage.

the class OpenLineageSparkListener method onJobStart.

/**
 * called by the SparkListener when a job starts
 */
@Override
public void onJobStart(SparkListenerJobStart jobStart) {
    Optional<ActiveJob> activeJob = asJavaOptional(SparkSession.getDefaultSession().map(sparkContextFromSession).orElse(activeSparkContext)).flatMap(ctx -> Optional.ofNullable(ctx.dagScheduler()).map(ds -> ds.jobIdToActiveJob().get(jobStart.jobId()))).flatMap(ScalaConversionUtils::asJavaOptional);
    Set<Integer> stages = ScalaConversionUtils.fromSeq(jobStart.stageIds()).stream().map(Integer.class::cast).collect(Collectors.toSet());
    jobMetrics.addJobStages(jobStart.jobId(), stages);
    ExecutionContext context = Optional.ofNullable(getSqlExecutionId(jobStart.properties())).map(Optional::of).orElseGet(() -> asJavaOptional(SparkSession.getDefaultSession().map(sparkContextFromSession).orElse(activeSparkContext)).flatMap(ctx -> Optional.ofNullable(ctx.dagScheduler()).map(ds -> ds.jobIdToActiveJob().get(jobStart.jobId())).flatMap(ScalaConversionUtils::asJavaOptional)).map(job -> getSqlExecutionId(job.properties()))).map(id -> {
        long executionId = Long.parseLong(id);
        return getExecutionContext(jobStart.jobId(), executionId);
    }).orElseGet(() -> getExecutionContext(jobStart.jobId()));
    // set it in the rddExecutionRegistry so jobEnd is called
    rddExecutionRegistry.put(jobStart.jobId(), context);
    activeJob.ifPresent(context::setActiveJob);
    context.start(jobStart);
}
Also used : OpenLineageClient(io.openlineage.spark.agent.client.OpenLineageClient) SparkListenerApplicationStart(org.apache.spark.scheduler.SparkListenerApplicationStart) DEFAULTS(io.openlineage.spark.agent.ArgumentParser.DEFAULTS) URISyntaxException(java.net.URISyntaxException) ZonedDateTime(java.time.ZonedDateTime) Function0(scala.Function0) Function1(scala.Function1) SparkConfUtils.findSparkConfigKey(io.openlineage.spark.agent.util.SparkConfUtils.findSparkConfigKey) HashMap(java.util.HashMap) ScalaConversionUtils.asJavaOptional(io.openlineage.spark.agent.util.ScalaConversionUtils.asJavaOptional) Map(java.util.Map) Configuration(org.apache.hadoop.conf.Configuration) SparkListenerTaskEnd(org.apache.spark.scheduler.SparkListenerTaskEnd) SparkListenerSQLExecutionStart(org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart) SparkContext$(org.apache.spark.SparkContext$) ContextFactory(io.openlineage.spark.agent.lifecycle.ContextFactory) SparkListenerApplicationEnd(org.apache.spark.scheduler.SparkListenerApplicationEnd) SparkEnv(org.apache.spark.SparkEnv) WeakHashMap(java.util.WeakHashMap) SparkListenerJobEnd(org.apache.spark.scheduler.SparkListenerJobEnd) SparkSession(org.apache.spark.sql.SparkSession) SparkListenerSQLExecutionEnd(org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionEnd) PrintWriter(java.io.PrintWriter) Properties(java.util.Properties) ActiveJob(org.apache.spark.scheduler.ActiveJob) SparkListenerJobStart(org.apache.spark.scheduler.SparkListenerJobStart) ByteArrayOutputStream(org.apache.commons.io.output.ByteArrayOutputStream) SparkConf(org.apache.spark.SparkConf) SparkContext(org.apache.spark.SparkContext) Set(java.util.Set) ScalaConversionUtils(io.openlineage.spark.agent.util.ScalaConversionUtils) Field(java.lang.reflect.Field) Option(scala.Option) Collectors(java.util.stream.Collectors) SparkListenerEvent(org.apache.spark.scheduler.SparkListenerEvent) Slf4j(lombok.extern.slf4j.Slf4j) SparkConfUtils.findSparkUrlParams(io.openlineage.spark.agent.util.SparkConfUtils.findSparkUrlParams) Optional(java.util.Optional) PairRDDFunctions(org.apache.spark.rdd.PairRDDFunctions) ExecutionContext(io.openlineage.spark.agent.lifecycle.ExecutionContext) PairRDDFunctionsTransformer(io.openlineage.spark.agent.transformers.PairRDDFunctionsTransformer) OpenLineage(io.openlineage.client.OpenLineage) Collections(java.util.Collections) RDD(org.apache.spark.rdd.RDD) SparkEnv$(org.apache.spark.SparkEnv$) ExecutionContext(io.openlineage.spark.agent.lifecycle.ExecutionContext) ActiveJob(org.apache.spark.scheduler.ActiveJob) ScalaConversionUtils(io.openlineage.spark.agent.util.ScalaConversionUtils)

Aggregations

ScalaConversionUtils (io.openlineage.spark.agent.util.ScalaConversionUtils)2 Map (java.util.Map)2 Optional (java.util.Optional)2 Collectors (java.util.stream.Collectors)2 Slf4j (lombok.extern.slf4j.Slf4j)2 SparkSession (org.apache.spark.sql.SparkSession)2 OpenLineage (io.openlineage.client.OpenLineage)1 DEFAULTS (io.openlineage.spark.agent.ArgumentParser.DEFAULTS)1 OpenLineageClient (io.openlineage.spark.agent.client.OpenLineageClient)1 TableProviderFacet (io.openlineage.spark.agent.facets.TableProviderFacet)1 ContextFactory (io.openlineage.spark.agent.lifecycle.ContextFactory)1 ExecutionContext (io.openlineage.spark.agent.lifecycle.ExecutionContext)1 PairRDDFunctionsTransformer (io.openlineage.spark.agent.transformers.PairRDDFunctionsTransformer)1 DatasetIdentifier (io.openlineage.spark.agent.util.DatasetIdentifier)1 PathUtils (io.openlineage.spark.agent.util.PathUtils)1 ScalaConversionUtils.asJavaOptional (io.openlineage.spark.agent.util.ScalaConversionUtils.asJavaOptional)1 SparkConfUtils (io.openlineage.spark.agent.util.SparkConfUtils)1 SparkConfUtils.findSparkConfigKey (io.openlineage.spark.agent.util.SparkConfUtils.findSparkConfigKey)1 SparkConfUtils.findSparkUrlParams (io.openlineage.spark.agent.util.SparkConfUtils.findSparkUrlParams)1 PrintWriter (java.io.PrintWriter)1