Search in sources :

Example 1 with PartitionException

use of org.apache.flink.runtime.io.network.partition.PartitionException in project flink by apache.

the class RestartPipelinedRegionFailoverStrategy method getTasksNeedingRestart.

// ------------------------------------------------------------------------
// task failure handling
// ------------------------------------------------------------------------
/**
 * Returns a set of IDs corresponding to the set of vertices that should be restarted. In this
 * strategy, all task vertices in 'involved' regions are proposed to be restarted. The
 * 'involved' regions are calculated with rules below: 1. The region containing the failed task
 * is always involved 2. If an input result partition of an involved region is not available,
 * i.e. Missing or Corrupted, the region containing the partition producer task is involved 3.
 * If a region is involved, all of its consumer regions are involved
 *
 * @param executionVertexId ID of the failed task
 * @param cause cause of the failure
 * @return set of IDs of vertices to restart
 */
@Override
public Set<ExecutionVertexID> getTasksNeedingRestart(ExecutionVertexID executionVertexId, Throwable cause) {
    LOG.info("Calculating tasks to restart to recover the failed task {}.", executionVertexId);
    final SchedulingPipelinedRegion failedRegion = topology.getPipelinedRegionOfVertex(executionVertexId);
    if (failedRegion == null) {
        // TODO: show the task name in the log
        throw new IllegalStateException("Can not find the failover region for task " + executionVertexId, cause);
    }
    // if the failure cause is data consumption error, mark the corresponding data partition to
    // be failed,
    // so that the failover process will try to recover it
    Optional<PartitionException> dataConsumptionException = ExceptionUtils.findThrowable(cause, PartitionException.class);
    if (dataConsumptionException.isPresent()) {
        resultPartitionAvailabilityChecker.markResultPartitionFailed(dataConsumptionException.get().getPartitionId().getPartitionId());
    }
    // calculate the tasks to restart based on the result of regions to restart
    Set<ExecutionVertexID> tasksToRestart = new HashSet<>();
    for (SchedulingPipelinedRegion region : getRegionsToRestart(failedRegion)) {
        for (SchedulingExecutionVertex vertex : region.getVertices()) {
            // we do not need to restart tasks which are already in the initial state
            if (vertex.getState() != ExecutionState.CREATED) {
                tasksToRestart.add(vertex.getId());
            }
        }
    }
    // the previous failed partition will be recovered. remove its failed state from the checker
    if (dataConsumptionException.isPresent()) {
        resultPartitionAvailabilityChecker.removeResultPartitionFromFailedState(dataConsumptionException.get().getPartitionId().getPartitionId());
    }
    LOG.info("{} tasks should be restarted to recover the failed task {}. ", tasksToRestart.size(), executionVertexId);
    return tasksToRestart;
}
Also used : SchedulingExecutionVertex(org.apache.flink.runtime.scheduler.strategy.SchedulingExecutionVertex) ExecutionVertexID(org.apache.flink.runtime.scheduler.strategy.ExecutionVertexID) PartitionException(org.apache.flink.runtime.io.network.partition.PartitionException) SchedulingPipelinedRegion(org.apache.flink.runtime.scheduler.strategy.SchedulingPipelinedRegion) HashSet(java.util.HashSet)

Aggregations

HashSet (java.util.HashSet)1 PartitionException (org.apache.flink.runtime.io.network.partition.PartitionException)1 ExecutionVertexID (org.apache.flink.runtime.scheduler.strategy.ExecutionVertexID)1 SchedulingExecutionVertex (org.apache.flink.runtime.scheduler.strategy.SchedulingExecutionVertex)1 SchedulingPipelinedRegion (org.apache.flink.runtime.scheduler.strategy.SchedulingPipelinedRegion)1