Search in sources :

Example 26 with Task

use of com.netflix.titus.api.jobmanager.model.job.Task in project titus-control-plane by Netflix.

the class JobReconciliationFrameworkFactory method newRestoredEngine.

private InternalReconciliationEngine<JobManagerReconcilerEvent> newRestoredEngine(Job job, List<Task> tasks) {
    EntityHolder jobHolder = EntityHolder.newRoot(job.getId(), job);
    for (Task task : tasks) {
        EntityHolder taskHolder = EntityHolder.newRoot(task.getId(), task);
        EntityHolder decorated = TaskTimeoutChangeActions.setTimeoutOnRestoreFromStore(jobManagerConfiguration, taskHolder, clock);
        jobHolder = jobHolder.addChild(decorated);
    }
    return newEngine(jobHolder, false);
}
Also used : Task(com.netflix.titus.api.jobmanager.model.job.Task) EntityHolder(com.netflix.titus.common.framework.reconciler.EntityHolder)

Example 27 with Task

use of com.netflix.titus.api.jobmanager.model.job.Task in project titus-control-plane by Netflix.

the class JobReconciliationFrameworkFactory method validateTask.

private Optional<Task> validateTask(Task task) {
    // Perform strict validation for reporting purposes
    Set<ValidationError> strictViolations = strictEntitySanitizer.validate(task);
    if (!strictViolations.isEmpty()) {
        logger.error("No strictly consistent task record found: taskId={}, violations={}", task.getId(), EntitySanitizerUtil.toStringMap(strictViolations));
        errorCollector.strictlyInvalidTask(task.getId());
    }
    // Required checks
    Set<ValidationError> violations = permissiveEntitySanitizer.validate(task);
    if (!violations.isEmpty()) {
        logger.error("Bad task record found: taskId={}, violations={}", task.getId(), EntitySanitizerUtil.toStringMap(violations));
        if (jobManagerConfiguration.isFailOnDataValidation()) {
            return Optional.empty();
        }
    }
    // If version is missing (old task objects) create one based on the current task state.
    Task taskWithVersion = task;
    if (task.getVersion() == null || task.getVersion().getTimestamp() < 0) {
        Version newVersion = Version.newBuilder().withTimestamp(task.getStatus().getTimestamp()).build();
        taskWithVersion = task.toBuilder().withVersion(newVersion).build();
    }
    return Optional.of(taskWithVersion);
}
Also used : Task(com.netflix.titus.api.jobmanager.model.job.Task) Version(com.netflix.titus.api.jobmanager.model.job.Version) ValidationError(com.netflix.titus.common.model.sanitizer.ValidationError)

Example 28 with Task

use of com.netflix.titus.api.jobmanager.model.job.Task in project titus-control-plane by Netflix.

the class JobReconciliationFrameworkFactory method newInstance.

ReconciliationFramework<JobManagerReconcilerEvent> newInstance() {
    List<Pair<Job, List<Task>>> jobsAndTasks = loadJobsAndTasksFromStore(errorCollector);
    // initialize fenzo with running tasks
    List<InternalReconciliationEngine<JobManagerReconcilerEvent>> engines = new ArrayList<>();
    for (Pair<Job, List<Task>> pair : jobsAndTasks) {
        Job job = pair.getLeft();
        List<Task> tasks = pair.getRight();
        InternalReconciliationEngine<JobManagerReconcilerEvent> engine = newRestoredEngine(job, tasks);
        engines.add(engine);
        for (Task task : tasks) {
            Optional<Task> validatedTask = validateTask(task);
            if (!validatedTask.isPresent()) {
                errorCollector.invalidTaskRecord(task.getId());
            }
        }
    }
    errorCollector.failIfTooManyBadRecords();
    return new DefaultReconciliationFramework<>(engines, bootstrapModel -> newEngine(bootstrapModel, true), jobManagerConfiguration.getReconcilerIdleTimeoutMs(), jobManagerConfiguration.getReconcilerActiveTimeoutMs(), jobManagerConfiguration.getCheckpointIntervalMs(), INDEX_COMPARATORS, JOB_EVENT_FACTORY, registry, optionalScheduler);
}
Also used : Task(com.netflix.titus.api.jobmanager.model.job.Task) ArrayList(java.util.ArrayList) JobManagerReconcilerEvent(com.netflix.titus.master.jobmanager.service.event.JobManagerReconcilerEvent) DefaultReconciliationFramework(com.netflix.titus.common.framework.reconciler.internal.DefaultReconciliationFramework) List(java.util.List) ArrayList(java.util.ArrayList) InternalReconciliationEngine(com.netflix.titus.common.framework.reconciler.internal.InternalReconciliationEngine) Job(com.netflix.titus.api.jobmanager.model.job.Job) Pair(com.netflix.titus.common.util.tuple.Pair)

Example 29 with Task

use of com.netflix.titus.api.jobmanager.model.job.Task in project titus-control-plane by Netflix.

the class JobReconciliationFrameworkFactory method loadJobsAndTasksFromStore.

private List<Pair<Job, List<Task>>> loadJobsAndTasksFromStore(InitializationErrorCollector errorCollector) {
    long startTime = clock.wallTime();
    // load all job/task pairs
    List<Pair<Job, Pair<List<Task>, Integer>>> jobTasksPairs;
    try {
        jobTasksPairs = store.init().andThen(store.retrieveJobs().flatMap(retrievedJobsAndErrors -> {
            errorCollector.corruptedJobRecords(retrievedJobsAndErrors.getRight());
            List<Job<?>> retrievedJobs = retrievedJobsAndErrors.getLeft();
            List<Observable<Pair<Job, Pair<List<Task>, Integer>>>> retrieveTasksObservables = new ArrayList<>();
            for (Job job : retrievedJobs) {
                // TODO Finished jobs that were not archived immediately should be archived by background archive process
                if (job.getStatus().getState() == JobState.Finished) {
                    logger.info("Not loading finished job: {}", job.getId());
                    continue;
                }
                Optional<Job> validatedJob = validateJob(job);
                if (validatedJob.isPresent()) {
                    Observable<Pair<Job, Pair<List<Task>, Integer>>> retrieveTasksObservable = store.retrieveTasksForJob(job.getId()).map(taskList -> new Pair<>(validatedJob.get(), taskList));
                    retrieveTasksObservables.add(retrieveTasksObservable);
                } else {
                    errorCollector.invalidJob(job.getId());
                }
            }
            return Observable.merge(retrieveTasksObservables, MAX_RETRIEVE_TASK_CONCURRENCY);
        })).toList().toBlocking().singleOrDefault(Collections.emptyList());
        int corruptedTaskRecords = jobTasksPairs.stream().mapToInt(p -> p.getRight().getRight()).sum();
        errorCollector.corruptedTaskRecords(corruptedTaskRecords);
        int taskCount = jobTasksPairs.stream().map(p -> p.getRight().getLeft().size()).reduce(0, (a, v) -> a + v);
        loadedJobs.set(jobTasksPairs.size());
        loadedTasks.set(taskCount);
        for (Pair<Job, Pair<List<Task>, Integer>> jobTaskPair : jobTasksPairs) {
            Job job = jobTaskPair.getLeft();
            List<Task> tasks = jobTaskPair.getRight().getLeft();
            List<String> taskStrings = tasks.stream().map(t -> String.format("<%s,ks:%s>", t.getId(), t.getStatus().getState())).collect(Collectors.toList());
            logger.info("Loaded job: {} with tasks: {}", job.getId(), taskStrings);
        }
        logger.info("{} jobs and {} tasks loaded from store in {}ms", jobTasksPairs.size(), taskCount, clock.wallTime() - startTime);
    } catch (Exception e) {
        logger.error("Failed to load jobs from the store during initialization:", e);
        throw new IllegalStateException("Failed to load jobs from the store during initialization", e);
    } finally {
        storeLoadTimeMs.set(clock.wallTime() - startTime);
    }
    return jobTasksPairs.stream().map(p -> Pair.of(p.getLeft(), p.getRight().getLeft())).collect(Collectors.toList());
}
Also used : IndexKind(com.netflix.titus.master.jobmanager.service.DefaultV3JobOperations.IndexKind) TitusChangeAction(com.netflix.titus.master.jobmanager.service.common.action.TitusChangeAction) Task(com.netflix.titus.api.jobmanager.model.job.Task) InternalReconciliationEngine(com.netflix.titus.common.framework.reconciler.internal.InternalReconciliationEngine) LoggerFactory(org.slf4j.LoggerFactory) DefaultReconciliationFramework(com.netflix.titus.common.framework.reconciler.internal.DefaultReconciliationFramework) ValidationError(com.netflix.titus.common.model.sanitizer.ValidationError) FeatureActivationConfiguration(com.netflix.titus.api.FeatureActivationConfiguration) JobEventFactory(com.netflix.titus.master.jobmanager.service.event.JobEventFactory) Map(java.util.Map) JobState(com.netflix.titus.api.jobmanager.model.job.JobState) BasicTag(com.netflix.spectator.api.BasicTag) JobStore(com.netflix.titus.api.jobmanager.store.JobStore) DifferenceResolver(com.netflix.titus.common.framework.reconciler.ReconciliationEngine.DifferenceResolver) Job(com.netflix.titus.api.jobmanager.model.job.Job) Set(java.util.Set) JobFunctions(com.netflix.titus.api.jobmanager.model.job.JobFunctions) Scheduler(rx.Scheduler) Collectors(java.util.stream.Collectors) TaskState(com.netflix.titus.api.jobmanager.model.job.TaskState) List(java.util.List) Optional(java.util.Optional) JobManagerReconcilerEvent(com.netflix.titus.master.jobmanager.service.event.JobManagerReconcilerEvent) Clock(com.netflix.titus.common.util.time.Clock) Gauge(com.netflix.spectator.api.Gauge) EntitySanitizer(com.netflix.titus.common.model.sanitizer.EntitySanitizer) ApplicationSlaManagementService(com.netflix.titus.master.service.management.ApplicationSlaManagementService) DefaultReconciliationEngine(com.netflix.titus.common.framework.reconciler.internal.DefaultReconciliationEngine) MetricConstants(com.netflix.titus.master.MetricConstants) Singleton(javax.inject.Singleton) TaskTimeoutChangeActions(com.netflix.titus.master.jobmanager.service.common.action.task.TaskTimeoutChangeActions) ArrayList(java.util.ArrayList) Observable(rx.Observable) Inject(javax.inject.Inject) Pair(com.netflix.titus.common.util.tuple.Pair) BatchJobExt(com.netflix.titus.api.jobmanager.model.job.ext.BatchJobExt) ChangeAction(com.netflix.titus.common.framework.reconciler.ChangeAction) Named(javax.inject.Named) JobDescriptor(com.netflix.titus.api.jobmanager.model.job.JobDescriptor) JOB_PERMISSIVE_SANITIZER(com.netflix.titus.api.jobmanager.model.job.sanitizer.JobSanitizerBuilder.JOB_PERMISSIVE_SANITIZER) Logger(org.slf4j.Logger) Tag(com.netflix.spectator.api.Tag) ServiceJobExt(com.netflix.titus.api.jobmanager.model.job.ext.ServiceJobExt) JOB_STRICT_SANITIZER(com.netflix.titus.api.jobmanager.model.job.sanitizer.JobSanitizerBuilder.JOB_STRICT_SANITIZER) EntityHolder(com.netflix.titus.common.framework.reconciler.EntityHolder) V3JobOperations(com.netflix.titus.api.jobmanager.service.V3JobOperations) TaskAttributes(com.netflix.titus.api.jobmanager.TaskAttributes) Version(com.netflix.titus.api.jobmanager.model.job.Version) Registry(com.netflix.spectator.api.Registry) ReconciliationFramework(com.netflix.titus.common.framework.reconciler.ReconciliationFramework) TitusRuntime(com.netflix.titus.common.runtime.TitusRuntime) Comparator(java.util.Comparator) EntitySanitizerUtil(com.netflix.titus.common.model.sanitizer.EntitySanitizerUtil) Collections(java.util.Collections) DifferenceResolvers(com.netflix.titus.common.framework.reconciler.DifferenceResolvers) Task(com.netflix.titus.api.jobmanager.model.job.Task) Optional(java.util.Optional) Observable(rx.Observable) List(java.util.List) ArrayList(java.util.ArrayList) Job(com.netflix.titus.api.jobmanager.model.job.Job) Pair(com.netflix.titus.common.util.tuple.Pair)

Example 30 with Task

use of com.netflix.titus.api.jobmanager.model.job.Task in project titus-control-plane by Netflix.

the class KubeNotificationProcessor method handlePodUpdatedEvent.

private Mono<Void> handlePodUpdatedEvent(PodEvent event, Job job, Task task) {
    // This is basic sanity check. If it fails, we have a major problem with pod state.
    if (event.getPod() == null || event.getPod().getStatus() == null || event.getPod().getStatus().getPhase() == null) {
        logger.warn("Pod notification with pod without status or phase set: taskId={}, pod={}", task.getId(), event.getPod());
        metricsNoChangesApplied.increment();
        return Mono.empty();
    }
    PodWrapper podWrapper = new PodWrapper(event.getPod());
    Optional<V1Node> node;
    if (event instanceof PodUpdatedEvent) {
        node = ((PodUpdatedEvent) event).getNode();
    } else if (event instanceof PodDeletedEvent) {
        node = ((PodDeletedEvent) event).getNode();
    } else {
        node = Optional.empty();
    }
    Either<TaskStatus, String> newTaskStatusOrError = new PodToTaskMapper(podWrapper, node, task, event instanceof PodDeletedEvent, containerResultCodeResolver, titusRuntime).getNewTaskStatus();
    if (newTaskStatusOrError.hasError()) {
        logger.info(newTaskStatusOrError.getError());
        metricsNoChangesApplied.increment();
        return Mono.empty();
    }
    TaskStatus newTaskStatus = newTaskStatusOrError.getValue();
    if (TaskStatus.areEquivalent(task.getStatus(), newTaskStatus)) {
        logger.info("Pod change notification does not change task status: taskId={}, status={}, eventSequenceNumber={}", task.getId(), newTaskStatus, event.getSequenceNumber());
    } else {
        logger.info("Pod notification changes task status: taskId={}, fromStatus={}, toStatus={}, eventSequenceNumber={}", task.getId(), task.getStatus(), newTaskStatus, event.getSequenceNumber());
    }
    // against most up to date task version.
    if (!updateTaskStatus(podWrapper, newTaskStatus, node, task, true).isPresent()) {
        return Mono.empty();
    }
    return ReactorExt.toMono(v3JobOperations.updateTask(task.getId(), current -> updateTaskStatus(podWrapper, newTaskStatus, node, current, false), V3JobOperations.Trigger.Kube, "Pod status updated from kubernetes node (k8phase='" + event.getPod().getStatus().getPhase() + "', taskState=" + task.getStatus().getState() + ")", KUBE_CALL_METADATA));
}
Also used : Retry(reactor.util.retry.Retry) Task(com.netflix.titus.api.jobmanager.model.job.Task) CollectionsExt(com.netflix.titus.common.util.CollectionsExt) LoggerFactory(org.slf4j.LoggerFactory) V1PodStatus(io.kubernetes.client.openapi.models.V1PodStatus) ReactorExt(com.netflix.titus.common.util.rx.ReactorExt) KubeUtil(com.netflix.titus.master.kubernetes.KubeUtil) TITUS_NODE_DOMAIN(com.netflix.titus.runtime.kubernetes.KubeConstants.TITUS_NODE_DOMAIN) Duration(java.time.Duration) Map(java.util.Map) DirectKubeApiServerIntegrator(com.netflix.titus.master.kubernetes.client.DirectKubeApiServerIntegrator) Either(com.netflix.titus.common.util.tuple.Either) CallMetadata(com.netflix.titus.api.model.callmetadata.CallMetadata) PodEvent(com.netflix.titus.master.kubernetes.client.model.PodEvent) Job(com.netflix.titus.api.jobmanager.model.job.Job) TaskStatus(com.netflix.titus.api.jobmanager.model.job.TaskStatus) JobFunctions(com.netflix.titus.api.jobmanager.model.job.JobFunctions) TaskState(com.netflix.titus.api.jobmanager.model.job.TaskState) PodNotFoundEvent(com.netflix.titus.master.kubernetes.client.model.PodNotFoundEvent) Timer(com.netflix.spectator.api.Timer) List(java.util.List) Optional(java.util.Optional) PodWrapper(com.netflix.titus.master.kubernetes.client.model.PodWrapper) Gauge(com.netflix.spectator.api.Gauge) Disposable(reactor.core.Disposable) Stopwatch(com.google.common.base.Stopwatch) PodDeletedEvent(com.netflix.titus.master.kubernetes.client.model.PodDeletedEvent) Counter(com.netflix.spectator.api.Counter) HashMap(java.util.HashMap) MetricConstants(com.netflix.titus.master.MetricConstants) V1Node(io.kubernetes.client.openapi.models.V1Node) Singleton(javax.inject.Singleton) Scheduler(reactor.core.scheduler.Scheduler) ArrayList(java.util.ArrayList) Inject(javax.inject.Inject) Pair(com.netflix.titus.common.util.tuple.Pair) ContainerResultCodeResolver(com.netflix.titus.master.kubernetes.ContainerResultCodeResolver) Schedulers(reactor.core.scheduler.Schedulers) Evaluators.acceptNotNull(com.netflix.titus.common.util.Evaluators.acceptNotNull) KubeJobManagementReconciler(com.netflix.titus.master.kubernetes.controller.KubeJobManagementReconciler) ExecutorService(java.util.concurrent.ExecutorService) ExecutorsExt(com.netflix.titus.common.util.ExecutorsExt) Logger(org.slf4j.Logger) PodUpdatedEvent(com.netflix.titus.master.kubernetes.client.model.PodUpdatedEvent) Mono(reactor.core.publisher.Mono) Activator(com.netflix.titus.common.util.guice.annotation.Activator) TimeUnit(java.util.concurrent.TimeUnit) AtomicLong(java.util.concurrent.atomic.AtomicLong) ExecutableStatus(com.netflix.titus.api.jobmanager.model.job.ExecutableStatus) V3JobOperations(com.netflix.titus.api.jobmanager.service.V3JobOperations) TaskAttributes(com.netflix.titus.api.jobmanager.TaskAttributes) PodToTaskMapper(com.netflix.titus.master.kubernetes.PodToTaskMapper) V1ContainerState(io.kubernetes.client.openapi.models.V1ContainerState) VisibleForTesting(com.google.common.annotations.VisibleForTesting) TitusRuntime(com.netflix.titus.common.runtime.TitusRuntime) Comparator(java.util.Comparator) Evaluators(com.netflix.titus.common.util.Evaluators) PodToTaskMapper(com.netflix.titus.master.kubernetes.PodToTaskMapper) PodDeletedEvent(com.netflix.titus.master.kubernetes.client.model.PodDeletedEvent) V1Node(io.kubernetes.client.openapi.models.V1Node) PodWrapper(com.netflix.titus.master.kubernetes.client.model.PodWrapper) PodUpdatedEvent(com.netflix.titus.master.kubernetes.client.model.PodUpdatedEvent) TaskStatus(com.netflix.titus.api.jobmanager.model.job.TaskStatus)

Aggregations

Task (com.netflix.titus.api.jobmanager.model.job.Task)222 Test (org.junit.Test)98 ArrayList (java.util.ArrayList)63 List (java.util.List)62 Job (com.netflix.titus.api.jobmanager.model.job.Job)58 BatchJobTask (com.netflix.titus.api.jobmanager.model.job.BatchJobTask)45 TaskStatus (com.netflix.titus.api.jobmanager.model.job.TaskStatus)45 TaskState (com.netflix.titus.api.jobmanager.model.job.TaskState)42 TitusRuntime (com.netflix.titus.common.runtime.TitusRuntime)38 BatchJobExt (com.netflix.titus.api.jobmanager.model.job.ext.BatchJobExt)34 Pair (com.netflix.titus.common.util.tuple.Pair)32 V1Pod (io.kubernetes.client.openapi.models.V1Pod)32 V3JobOperations (com.netflix.titus.api.jobmanager.service.V3JobOperations)31 ServiceJobTask (com.netflix.titus.api.jobmanager.model.job.ServiceJobTask)29 Optional (java.util.Optional)27 Collections (java.util.Collections)26 Collectors (java.util.stream.Collectors)25 CallMetadata (com.netflix.titus.api.model.callmetadata.CallMetadata)24 HashMap (java.util.HashMap)24 TaskUpdateEvent (com.netflix.titus.api.jobmanager.model.job.event.TaskUpdateEvent)23