Search in sources :

Example 16 with SlotNotFoundException

use of org.apache.flink.runtime.taskexecutor.slot.SlotNotFoundException in project flink by apache.

the class TaskExecutor method handleAcceptedSlotOffers.

@Nonnull
private BiConsumer<Iterable<SlotOffer>, Throwable> handleAcceptedSlotOffers(JobID jobId, JobMasterGateway jobMasterGateway, JobMasterId jobMasterId, Collection<SlotOffer> offeredSlots, UUID offerId) {
    return (Iterable<SlotOffer> acceptedSlots, Throwable throwable) -> {
        // check if this is the latest offer
        if (!offerId.equals(currentSlotOfferPerJob.get(jobId))) {
            // If this offer is outdated then it can be safely ignored.
            // If the response for a given slot is identical in both offers (accepted/rejected),
            // then this is naturally the case since the end-result is the same.
            // If the responses differ, then there are 2 cases to consider:
            // 1) initially rejected, later accepted
            // This can happen when the resource requirements of a job increases between
            // offers.
            // In this case the first response MUST be ignored, so that
            // the slot can be properly activated when the second response arrives.
            // 2) initially accepted, later rejected
            // This can happen when the resource requirements of a job decrease between
            // offers.
            // In this case the first response MAY be ignored, because the job no longer
            // requires the slot (and already has initiated steps to free it) and we can thus
            // assume that any in-flight task submissions are no longer relevant for the job
            // execution.
            log.debug("Discard slot offer response since there is a newer offer for the job {}.", jobId);
            return;
        }
        if (throwable != null) {
            if (throwable instanceof TimeoutException) {
                log.info("Slot offering to JobManager did not finish in time. Retrying the slot offering.");
                // We ran into a timeout. Try again.
                offerSlotsToJobManager(jobId);
            } else {
                log.warn("Slot offering to JobManager failed. Freeing the slots " + "and returning them to the ResourceManager.", throwable);
                // We encountered an exception. Free the slots and return them to the RM.
                for (SlotOffer reservedSlot : offeredSlots) {
                    freeSlotInternal(reservedSlot.getAllocationId(), throwable);
                }
            }
        } else {
            // check if the response is still valid
            if (isJobManagerConnectionValid(jobId, jobMasterId)) {
                // mark accepted slots active
                for (SlotOffer acceptedSlot : acceptedSlots) {
                    final AllocationID allocationId = acceptedSlot.getAllocationId();
                    try {
                        if (!taskSlotTable.markSlotActive(allocationId)) {
                            // the slot is either free or releasing at the moment
                            final String message = "Could not mark slot " + allocationId + " active.";
                            log.debug(message);
                            jobMasterGateway.failSlot(getResourceID(), allocationId, new FlinkException(message));
                        }
                    } catch (SlotNotFoundException e) {
                        final String message = "Could not mark slot " + allocationId + " active.";
                        jobMasterGateway.failSlot(getResourceID(), allocationId, new FlinkException(message));
                    }
                    offeredSlots.remove(acceptedSlot);
                }
                final Exception e = new Exception("The slot was rejected by the JobManager.");
                for (SlotOffer rejectedSlot : offeredSlots) {
                    freeSlotInternal(rejectedSlot.getAllocationId(), e);
                }
            } else {
                // discard the response since there is a new leader for the job
                log.debug("Discard slot offer response since there is a new leader " + "for the job {}.", jobId);
            }
        }
    };
}
Also used : SlotNotFoundException(org.apache.flink.runtime.taskexecutor.slot.SlotNotFoundException) SlotOffer(org.apache.flink.runtime.taskexecutor.slot.SlotOffer) AllocationID(org.apache.flink.runtime.clusterframework.types.AllocationID) FlinkException(org.apache.flink.util.FlinkException) TaskNotRunningException(org.apache.flink.runtime.operators.coordination.TaskNotRunningException) CheckpointException(org.apache.flink.runtime.checkpoint.CheckpointException) SlotOccupiedException(org.apache.flink.runtime.taskexecutor.exceptions.SlotOccupiedException) SlotAllocationException(org.apache.flink.runtime.taskexecutor.exceptions.SlotAllocationException) FlinkException(org.apache.flink.util.FlinkException) TaskSubmissionException(org.apache.flink.runtime.taskexecutor.exceptions.TaskSubmissionException) TaskException(org.apache.flink.runtime.taskexecutor.exceptions.TaskException) SlotNotActiveException(org.apache.flink.runtime.taskexecutor.slot.SlotNotActiveException) SlotNotFoundException(org.apache.flink.runtime.taskexecutor.slot.SlotNotFoundException) IOException(java.io.IOException) TimeoutException(java.util.concurrent.TimeoutException) RegistrationTimeoutException(org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException) CompletionException(java.util.concurrent.CompletionException) TaskManagerException(org.apache.flink.runtime.taskexecutor.exceptions.TaskManagerException) TimeoutException(java.util.concurrent.TimeoutException) RegistrationTimeoutException(org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException) Nonnull(javax.annotation.Nonnull)

Example 17 with SlotNotFoundException

use of org.apache.flink.runtime.taskexecutor.slot.SlotNotFoundException in project flink by apache.

the class TaskExecutor method disconnectJobManagerConnection.

private void disconnectJobManagerConnection(JobTable.Connection jobManagerConnection, Exception cause) {
    final JobID jobId = jobManagerConnection.getJobId();
    if (log.isDebugEnabled()) {
        log.debug("Close JobManager connection for job {}.", jobId, cause);
    } else {
        log.info("Close JobManager connection for job {}.", jobId);
    }
    // 1. fail tasks running under this JobID
    Iterator<Task> tasks = taskSlotTable.getTasks(jobId);
    final FlinkException failureCause = new FlinkException(String.format("Disconnect from JobManager responsible for %s.", jobId), cause);
    while (tasks.hasNext()) {
        tasks.next().failExternally(failureCause);
    }
    // 2. Move the active slots to state allocated (possible to time out again)
    Set<AllocationID> activeSlotAllocationIDs = taskSlotTable.getActiveTaskSlotAllocationIdsPerJob(jobId);
    final FlinkException freeingCause = new FlinkException("Slot could not be marked inactive.");
    for (AllocationID activeSlotAllocationID : activeSlotAllocationIDs) {
        try {
            if (!taskSlotTable.markSlotInactive(activeSlotAllocationID, taskManagerConfiguration.getSlotTimeout())) {
                freeSlotInternal(activeSlotAllocationID, freeingCause);
            }
        } catch (SlotNotFoundException e) {
            log.debug("Could not mark the slot {} inactive.", activeSlotAllocationID, e);
        }
    }
    // 3. Disassociate from the JobManager
    try {
        jobManagerHeartbeatManager.unmonitorTarget(jobManagerConnection.getResourceId());
        disassociateFromJobManager(jobManagerConnection, cause);
    } catch (IOException e) {
        log.warn("Could not properly disassociate from JobManager {}.", jobManagerConnection.getJobManagerGateway().getAddress(), e);
    }
    jobManagerConnection.disconnect();
}
Also used : SlotNotFoundException(org.apache.flink.runtime.taskexecutor.slot.SlotNotFoundException) Task(org.apache.flink.runtime.taskmanager.Task) AllocationID(org.apache.flink.runtime.clusterframework.types.AllocationID) IOException(java.io.IOException) JobID(org.apache.flink.api.common.JobID) FlinkException(org.apache.flink.util.FlinkException)

Example 18 with SlotNotFoundException

use of org.apache.flink.runtime.taskexecutor.slot.SlotNotFoundException in project flink-mirror by flink-ci.

the class TaskExecutor method allocateSlotForJob.

private boolean allocateSlotForJob(JobID jobId, SlotID slotId, AllocationID allocationId, ResourceProfile resourceProfile, String targetAddress) throws SlotAllocationException {
    allocateSlot(slotId, jobId, allocationId, resourceProfile);
    final JobTable.Job job;
    try {
        job = jobTable.getOrCreateJob(jobId, () -> registerNewJobAndCreateServices(jobId, targetAddress));
    } catch (Exception e) {
        // free the allocated slot
        try {
            taskSlotTable.freeSlot(allocationId);
        } catch (SlotNotFoundException slotNotFoundException) {
            // slot no longer existent, this should actually never happen, because we've
            // just allocated the slot. So let's fail hard in this case!
            onFatalError(slotNotFoundException);
        }
        // release local state under the allocation id.
        localStateStoresManager.releaseLocalStateForAllocationId(allocationId);
        // sanity check
        if (!taskSlotTable.isSlotFree(slotId.getSlotNumber())) {
            onFatalError(new Exception("Could not free slot " + slotId));
        }
        throw new SlotAllocationException("Could not create new job.", e);
    }
    return job.isConnected();
}
Also used : SlotNotFoundException(org.apache.flink.runtime.taskexecutor.slot.SlotNotFoundException) SlotAllocationException(org.apache.flink.runtime.taskexecutor.exceptions.SlotAllocationException) TaskNotRunningException(org.apache.flink.runtime.operators.coordination.TaskNotRunningException) CheckpointException(org.apache.flink.runtime.checkpoint.CheckpointException) SlotOccupiedException(org.apache.flink.runtime.taskexecutor.exceptions.SlotOccupiedException) SlotAllocationException(org.apache.flink.runtime.taskexecutor.exceptions.SlotAllocationException) FlinkException(org.apache.flink.util.FlinkException) TaskSubmissionException(org.apache.flink.runtime.taskexecutor.exceptions.TaskSubmissionException) TaskException(org.apache.flink.runtime.taskexecutor.exceptions.TaskException) SlotNotActiveException(org.apache.flink.runtime.taskexecutor.slot.SlotNotActiveException) SlotNotFoundException(org.apache.flink.runtime.taskexecutor.slot.SlotNotFoundException) IOException(java.io.IOException) TimeoutException(java.util.concurrent.TimeoutException) RegistrationTimeoutException(org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException) CompletionException(java.util.concurrent.CompletionException) TaskManagerException(org.apache.flink.runtime.taskexecutor.exceptions.TaskManagerException)

Example 19 with SlotNotFoundException

use of org.apache.flink.runtime.taskexecutor.slot.SlotNotFoundException in project flink-mirror by flink-ci.

the class TaskExecutor method freeSlotInternal.

private void freeSlotInternal(AllocationID allocationId, Throwable cause) {
    checkNotNull(allocationId);
    // information
    if (isRunning()) {
        log.debug("Free slot with allocation id {} because: {}", allocationId, cause.getMessage());
        try {
            final JobID jobId = taskSlotTable.getOwningJob(allocationId);
            final int slotIndex = taskSlotTable.freeSlot(allocationId, cause);
            slotAllocationSnapshotPersistenceService.deleteAllocationSnapshot(slotIndex);
            if (slotIndex != -1) {
                if (isConnectedToResourceManager()) {
                    // the slot was freed. Tell the RM about it
                    ResourceManagerGateway resourceManagerGateway = establishedResourceManagerConnection.getResourceManagerGateway();
                    resourceManagerGateway.notifySlotAvailable(establishedResourceManagerConnection.getTaskExecutorRegistrationId(), new SlotID(getResourceID(), slotIndex), allocationId);
                }
                if (jobId != null) {
                    closeJobManagerConnectionIfNoAllocatedResources(jobId);
                }
            }
        } catch (SlotNotFoundException e) {
            log.debug("Could not free slot for allocation id {}.", allocationId, e);
        }
        localStateStoresManager.releaseLocalStateForAllocationId(allocationId);
    } else {
        log.debug("Ignoring the freeing of slot {} because the TaskExecutor is shutting down.", allocationId);
    }
}
Also used : SlotNotFoundException(org.apache.flink.runtime.taskexecutor.slot.SlotNotFoundException) SlotID(org.apache.flink.runtime.clusterframework.types.SlotID) JobID(org.apache.flink.api.common.JobID) RpcEndpoint(org.apache.flink.runtime.rpc.RpcEndpoint) ResourceManagerGateway(org.apache.flink.runtime.resourcemanager.ResourceManagerGateway)

Example 20 with SlotNotFoundException

use of org.apache.flink.runtime.taskexecutor.slot.SlotNotFoundException in project flink-mirror by flink-ci.

the class TaskExecutor method handleAcceptedSlotOffers.

@Nonnull
private BiConsumer<Iterable<SlotOffer>, Throwable> handleAcceptedSlotOffers(JobID jobId, JobMasterGateway jobMasterGateway, JobMasterId jobMasterId, Collection<SlotOffer> offeredSlots, UUID offerId) {
    return (Iterable<SlotOffer> acceptedSlots, Throwable throwable) -> {
        // check if this is the latest offer
        if (!offerId.equals(currentSlotOfferPerJob.get(jobId))) {
            // If this offer is outdated then it can be safely ignored.
            // If the response for a given slot is identical in both offers (accepted/rejected),
            // then this is naturally the case since the end-result is the same.
            // If the responses differ, then there are 2 cases to consider:
            // 1) initially rejected, later accepted
            // This can happen when the resource requirements of a job increases between
            // offers.
            // In this case the first response MUST be ignored, so that
            // the slot can be properly activated when the second response arrives.
            // 2) initially accepted, later rejected
            // This can happen when the resource requirements of a job decrease between
            // offers.
            // In this case the first response MAY be ignored, because the job no longer
            // requires the slot (and already has initiated steps to free it) and we can thus
            // assume that any in-flight task submissions are no longer relevant for the job
            // execution.
            log.debug("Discard slot offer response since there is a newer offer for the job {}.", jobId);
            return;
        }
        if (throwable != null) {
            if (throwable instanceof TimeoutException) {
                log.info("Slot offering to JobManager did not finish in time. Retrying the slot offering.");
                // We ran into a timeout. Try again.
                offerSlotsToJobManager(jobId);
            } else {
                log.warn("Slot offering to JobManager failed. Freeing the slots " + "and returning them to the ResourceManager.", throwable);
                // We encountered an exception. Free the slots and return them to the RM.
                for (SlotOffer reservedSlot : offeredSlots) {
                    freeSlotInternal(reservedSlot.getAllocationId(), throwable);
                }
            }
        } else {
            // check if the response is still valid
            if (isJobManagerConnectionValid(jobId, jobMasterId)) {
                // mark accepted slots active
                for (SlotOffer acceptedSlot : acceptedSlots) {
                    final AllocationID allocationId = acceptedSlot.getAllocationId();
                    try {
                        if (!taskSlotTable.markSlotActive(allocationId)) {
                            // the slot is either free or releasing at the moment
                            final String message = "Could not mark slot " + allocationId + " active.";
                            log.debug(message);
                            jobMasterGateway.failSlot(getResourceID(), allocationId, new FlinkException(message));
                        }
                    } catch (SlotNotFoundException e) {
                        final String message = "Could not mark slot " + allocationId + " active.";
                        jobMasterGateway.failSlot(getResourceID(), allocationId, new FlinkException(message));
                    }
                    offeredSlots.remove(acceptedSlot);
                }
                final Exception e = new Exception("The slot was rejected by the JobManager.");
                for (SlotOffer rejectedSlot : offeredSlots) {
                    freeSlotInternal(rejectedSlot.getAllocationId(), e);
                }
            } else {
                // discard the response since there is a new leader for the job
                log.debug("Discard slot offer response since there is a new leader " + "for the job {}.", jobId);
            }
        }
    };
}
Also used : SlotNotFoundException(org.apache.flink.runtime.taskexecutor.slot.SlotNotFoundException) SlotOffer(org.apache.flink.runtime.taskexecutor.slot.SlotOffer) AllocationID(org.apache.flink.runtime.clusterframework.types.AllocationID) FlinkException(org.apache.flink.util.FlinkException) TaskNotRunningException(org.apache.flink.runtime.operators.coordination.TaskNotRunningException) CheckpointException(org.apache.flink.runtime.checkpoint.CheckpointException) SlotOccupiedException(org.apache.flink.runtime.taskexecutor.exceptions.SlotOccupiedException) SlotAllocationException(org.apache.flink.runtime.taskexecutor.exceptions.SlotAllocationException) FlinkException(org.apache.flink.util.FlinkException) TaskSubmissionException(org.apache.flink.runtime.taskexecutor.exceptions.TaskSubmissionException) TaskException(org.apache.flink.runtime.taskexecutor.exceptions.TaskException) SlotNotActiveException(org.apache.flink.runtime.taskexecutor.slot.SlotNotActiveException) SlotNotFoundException(org.apache.flink.runtime.taskexecutor.slot.SlotNotFoundException) IOException(java.io.IOException) TimeoutException(java.util.concurrent.TimeoutException) RegistrationTimeoutException(org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException) CompletionException(java.util.concurrent.CompletionException) TaskManagerException(org.apache.flink.runtime.taskexecutor.exceptions.TaskManagerException) TimeoutException(java.util.concurrent.TimeoutException) RegistrationTimeoutException(org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException) Nonnull(javax.annotation.Nonnull)

Aggregations

SlotNotFoundException (org.apache.flink.runtime.taskexecutor.slot.SlotNotFoundException)23 IOException (java.io.IOException)16 TaskSubmissionException (org.apache.flink.runtime.taskexecutor.exceptions.TaskSubmissionException)13 SlotNotActiveException (org.apache.flink.runtime.taskexecutor.slot.SlotNotActiveException)13 JobID (org.apache.flink.api.common.JobID)10 AllocationID (org.apache.flink.runtime.clusterframework.types.AllocationID)10 TimeoutException (java.util.concurrent.TimeoutException)9 SlotAllocationException (org.apache.flink.runtime.taskexecutor.exceptions.SlotAllocationException)9 TaskException (org.apache.flink.runtime.taskexecutor.exceptions.TaskException)9 FlinkException (org.apache.flink.util.FlinkException)9 Task (org.apache.flink.runtime.taskmanager.Task)8 MemoryManager (org.apache.flink.runtime.memory.MemoryManager)6 CompletionException (java.util.concurrent.CompletionException)5 CheckpointException (org.apache.flink.runtime.checkpoint.CheckpointException)5 TaskNotRunningException (org.apache.flink.runtime.operators.coordination.TaskNotRunningException)5 RegistrationTimeoutException (org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException)5 SlotOccupiedException (org.apache.flink.runtime.taskexecutor.exceptions.SlotOccupiedException)5 TaskManagerException (org.apache.flink.runtime.taskexecutor.exceptions.TaskManagerException)5 SlotID (org.apache.flink.runtime.clusterframework.types.SlotID)4 LibraryCacheManager (org.apache.flink.runtime.execution.librarycache.LibraryCacheManager)4