Search in sources :

Example 1 with FlowEdgeContext

use of org.apache.gobblin.service.modules.flow.FlowEdgeContext in project incubator-gobblin by apache.

the class AbstractPathFinder method constructPath.

/**
 * @param flowEdgeContext of the last {@link FlowEdge} in the path.
 * @return a {@link Dag} of {@link JobExecutionPlan}s for the input {@link FlowSpec}.
 */
List<FlowEdgeContext> constructPath(FlowEdgeContext flowEdgeContext) {
    // Backtrace from the last edge using the path map and push each edge into a LIFO data structure.
    List<FlowEdgeContext> path = new LinkedList<>();
    path.add(flowEdgeContext);
    FlowEdgeContext currentFlowEdgeContext = flowEdgeContext;
    // While we are not at the first edge in the path, add the edge to the path
    while (!this.pathMap.get(currentFlowEdgeContext).equals(currentFlowEdgeContext)) {
        path.add(0, this.pathMap.get(currentFlowEdgeContext));
        currentFlowEdgeContext = this.pathMap.get(currentFlowEdgeContext);
    }
    return path;
}
Also used : FlowEdgeContext(org.apache.gobblin.service.modules.flow.FlowEdgeContext) LinkedList(java.util.LinkedList)

Example 2 with FlowEdgeContext

use of org.apache.gobblin.service.modules.flow.FlowEdgeContext in project incubator-gobblin by apache.

the class AbstractPathFinder method findPath.

@Override
public FlowGraphPath findPath() throws PathFinderException {
    FlowGraphPath flowGraphPath = new FlowGraphPath(flowSpec, flowExecutionId);
    // flow graph.
    for (DataNode destNode : this.destNodes) {
        List<FlowEdgeContext> path = findPathUnicast(destNode);
        if (path != null) {
            log.info("Path to destination node {} found for flow {}. Path - {}", destNode.getId(), flowSpec.getUri(), path);
            flowGraphPath.addPath(path);
        } else {
            log.error("Path to destination node {} could not be found for flow {}.", destNode.getId(), flowSpec.getUri());
            // No path to at least one of the destination nodes.
            return null;
        }
    }
    return flowGraphPath;
}
Also used : FlowEdgeContext(org.apache.gobblin.service.modules.flow.FlowEdgeContext) DataNode(org.apache.gobblin.service.modules.flowgraph.DataNode) FlowGraphPath(org.apache.gobblin.service.modules.flow.FlowGraphPath)

Example 3 with FlowEdgeContext

use of org.apache.gobblin.service.modules.flow.FlowEdgeContext in project incubator-gobblin by apache.

the class BFSPathFinder method findPathUnicast.

/**
 * A simple path finding algorithm based on Breadth-First Search. At every step the algorithm adds the adjacent {@link FlowEdge}s
 * to a queue. The {@link FlowEdge}s whose output {@link DatasetDescriptor} matches the destDatasetDescriptor are
 * added first to the queue. This ensures that dataset transformations are always performed closest to the source.
 * @return a path of {@link FlowEdgeContext}s starting at the srcNode and ending at the destNode.
 */
public List<FlowEdgeContext> findPathUnicast(DataNode destNode) {
    // Initialization of auxiliary data structures used for path computation
    this.pathMap = new HashMap<>();
    // Base condition 1: Source Node or Dest Node is inactive; return null
    if (!srcNode.isActive() || !destNode.isActive()) {
        log.warn("Either source node {} or destination node {} is inactive; skipping path computation.", this.srcNode.getId(), destNode.getId());
        return null;
    }
    // Base condition 2: Check if we are already at the target. If so, return an empty path.
    if ((srcNode.equals(destNode)) && destDatasetDescriptor.contains(srcDatasetDescriptor)) {
        return new ArrayList<>(0);
    }
    LinkedList<FlowEdgeContext> edgeQueue = new LinkedList<>(getNextEdges(srcNode, srcDatasetDescriptor, destDatasetDescriptor));
    for (FlowEdgeContext flowEdgeContext : edgeQueue) {
        this.pathMap.put(flowEdgeContext, flowEdgeContext);
    }
    // If the edge E' satisfies 1 and 2, add it to the edge queue for further consideration.
    while (!edgeQueue.isEmpty()) {
        FlowEdgeContext flowEdgeContext = edgeQueue.pop();
        DataNode currentNode = this.flowGraph.getNode(flowEdgeContext.getEdge().getDest());
        DatasetDescriptor currentOutputDatasetDescriptor = flowEdgeContext.getOutputDatasetDescriptor();
        // Are we done?
        if (isPathFound(currentNode, destNode, currentOutputDatasetDescriptor, destDatasetDescriptor)) {
            return constructPath(flowEdgeContext);
        }
        // Expand the currentNode to its adjacent edges and add them to the queue.
        List<FlowEdgeContext> nextEdges = getNextEdges(currentNode, currentOutputDatasetDescriptor, destDatasetDescriptor);
        for (FlowEdgeContext childFlowEdgeContext : nextEdges) {
            // queue.
            if (!this.pathMap.containsKey(childFlowEdgeContext)) {
                edgeQueue.add(childFlowEdgeContext);
                this.pathMap.put(childFlowEdgeContext, flowEdgeContext);
            }
        }
    }
    // No path found. Return null.
    return null;
}
Also used : FlowEdgeContext(org.apache.gobblin.service.modules.flow.FlowEdgeContext) DatasetDescriptor(org.apache.gobblin.service.modules.dataset.DatasetDescriptor) DataNode(org.apache.gobblin.service.modules.flowgraph.DataNode) ArrayList(java.util.ArrayList) LinkedList(java.util.LinkedList)

Example 4 with FlowEdgeContext

use of org.apache.gobblin.service.modules.flow.FlowEdgeContext in project incubator-gobblin by apache.

the class AbstractPathFinder method getNextEdges.

/**
 * A helper method that sorts the {@link FlowEdge}s incident on srcNode based on whether the FlowEdge has an
 * output {@link DatasetDescriptor} that is compatible with the targetDatasetDescriptor.
 * @param dataNode the {@link DataNode} to be expanded for determining candidate edges.
 * @param currentDatasetDescriptor Output {@link DatasetDescriptor} of the current edge.
 * @param destDatasetDescriptor Target {@link DatasetDescriptor}.
 * @return prioritized list of {@link FlowEdge}s to be added to the edge queue for expansion.
 */
List<FlowEdgeContext> getNextEdges(DataNode dataNode, DatasetDescriptor currentDatasetDescriptor, DatasetDescriptor destDatasetDescriptor) {
    List<FlowEdgeContext> prioritizedEdgeList = new LinkedList<>();
    List<String> edgeIds = ConfigUtils.getStringList(this.flowConfig, ConfigurationKeys.WHITELISTED_EDGE_IDS);
    for (FlowEdge flowEdge : this.flowGraph.getEdges(dataNode)) {
        if (!edgeIds.isEmpty() && !edgeIds.contains(flowEdge.getId())) {
            continue;
        }
        try {
            DataNode edgeDestination = this.flowGraph.getNode(flowEdge.getDest());
            // Base condition: Skip this FLowEdge, if it is inactive or if the destination of this edge is inactive.
            if (!edgeDestination.isActive() || !flowEdge.isActive()) {
                continue;
            }
            boolean foundExecutor = false;
            // Iterate over all executors for this edge. Find the first one that resolves the underlying flow template.
            for (SpecExecutor specExecutor : flowEdge.getExecutors()) {
                Config mergedConfig = getMergedConfig(flowEdge);
                List<Pair<DatasetDescriptor, DatasetDescriptor>> datasetDescriptorPairs = flowEdge.getFlowTemplate().getDatasetDescriptors(mergedConfig, false);
                for (Pair<DatasetDescriptor, DatasetDescriptor> datasetDescriptorPair : datasetDescriptorPairs) {
                    DatasetDescriptor inputDatasetDescriptor = datasetDescriptorPair.getLeft();
                    DatasetDescriptor outputDatasetDescriptor = datasetDescriptorPair.getRight();
                    try {
                        flowEdge.getFlowTemplate().tryResolving(mergedConfig, datasetDescriptorPair.getLeft(), datasetDescriptorPair.getRight());
                    } catch (JobTemplate.TemplateException | ConfigException | SpecNotFoundException e) {
                        flowSpec.addCompilationError(flowEdge.getSrc(), flowEdge.getDest(), "Error compiling edge " + flowEdge.toString() + ": " + e.toString());
                        continue;
                    }
                    if (inputDatasetDescriptor.contains(currentDatasetDescriptor)) {
                        DatasetDescriptor edgeOutputDescriptor = makeOutputDescriptorSpecific(currentDatasetDescriptor, outputDatasetDescriptor);
                        FlowEdgeContext flowEdgeContext = new FlowEdgeContext(flowEdge, currentDatasetDescriptor, edgeOutputDescriptor, mergedConfig, specExecutor);
                        if (destDatasetDescriptor.getFormatConfig().contains(outputDatasetDescriptor.getFormatConfig())) {
                            /*
                Add to the front of the edge list if platform-independent properties of the output descriptor is compatible
                with those of destination dataset descriptor.
                In other words, we prioritize edges that perform data transformations as close to the source as possible.
                */
                            prioritizedEdgeList.add(0, flowEdgeContext);
                        } else {
                            prioritizedEdgeList.add(flowEdgeContext);
                        }
                        foundExecutor = true;
                    }
                }
                // TODO: Choose the min-cost executor for the FlowEdge as opposed to the first one that resolves.
                if (foundExecutor) {
                    break;
                }
            }
        } catch (IOException | ReflectiveOperationException | SpecNotFoundException | JobTemplate.TemplateException e) {
            // Skip the edge; and continue
            log.warn("Skipping edge {} with config {} due to exception: {}", flowEdge.getId(), flowConfig.toString(), e);
        }
    }
    return prioritizedEdgeList;
}
Also used : FlowEdge(org.apache.gobblin.service.modules.flowgraph.FlowEdge) DatasetDescriptor(org.apache.gobblin.service.modules.dataset.DatasetDescriptor) Config(com.typesafe.config.Config) ConfigException(com.typesafe.config.ConfigException) IOException(java.io.IOException) LinkedList(java.util.LinkedList) FlowEdgeContext(org.apache.gobblin.service.modules.flow.FlowEdgeContext) SpecNotFoundException(org.apache.gobblin.runtime.api.SpecNotFoundException) DataNode(org.apache.gobblin.service.modules.flowgraph.DataNode) SpecExecutor(org.apache.gobblin.runtime.api.SpecExecutor) Pair(org.apache.commons.lang3.tuple.Pair)

Aggregations

FlowEdgeContext (org.apache.gobblin.service.modules.flow.FlowEdgeContext)4 LinkedList (java.util.LinkedList)3 DataNode (org.apache.gobblin.service.modules.flowgraph.DataNode)3 DatasetDescriptor (org.apache.gobblin.service.modules.dataset.DatasetDescriptor)2 Config (com.typesafe.config.Config)1 ConfigException (com.typesafe.config.ConfigException)1 IOException (java.io.IOException)1 ArrayList (java.util.ArrayList)1 Pair (org.apache.commons.lang3.tuple.Pair)1 SpecExecutor (org.apache.gobblin.runtime.api.SpecExecutor)1 SpecNotFoundException (org.apache.gobblin.runtime.api.SpecNotFoundException)1 FlowGraphPath (org.apache.gobblin.service.modules.flow.FlowGraphPath)1 FlowEdge (org.apache.gobblin.service.modules.flowgraph.FlowEdge)1