Search in sources :

Example 36 with BatchJobExt

use of com.netflix.titus.api.jobmanager.model.job.ext.BatchJobExt in project titus-control-plane by Netflix.

the class ExtendedJobSanitizerTest method testLegacyBatchJobDisruptionBudgetRewrite.

@Test
public void testLegacyBatchJobDisruptionBudgetRewrite() {
    JobDescriptor<BatchJobExt> jobDescriptor = newBatchJob().getValue().toBuilder().withDisruptionBudget(DisruptionBudget.none()).build();
    ExtendedJobSanitizer sanitizer = new ExtendedJobSanitizer(configuration, jobAssertions, entitySanitizer, disruptionBudgetSanitizer, jd -> false, jd -> false, titusRuntime);
    Optional<JobDescriptor<BatchJobExt>> sanitizedOpt = sanitizer.sanitize(jobDescriptor);
    assertThat(sanitizedOpt).isNotEmpty();
    JobDescriptor<BatchJobExt> sanitized = sanitizedOpt.get();
    String nonCompliant = sanitized.getAttributes().get(TITUS_NON_COMPLIANT_FEATURES);
    assertThat(nonCompliant).contains(JobFeatureComplianceChecks.DISRUPTION_BUDGET_FEATURE);
    SelfManagedDisruptionBudgetPolicy policy = (SelfManagedDisruptionBudgetPolicy) sanitized.getDisruptionBudget().getDisruptionBudgetPolicy();
    assertThat(policy.getRelocationTimeMs()).isEqualTo((long) ((jobDescriptor.getExtensions()).getRuntimeLimitMs() * BATCH_RUNTIME_LIMIT_FACTOR));
}
Also used : SelfManagedDisruptionBudgetPolicy(com.netflix.titus.api.jobmanager.model.job.disruptionbudget.SelfManagedDisruptionBudgetPolicy) JobDescriptor(com.netflix.titus.api.jobmanager.model.job.JobDescriptor) BatchJobExt(com.netflix.titus.api.jobmanager.model.job.ext.BatchJobExt) Test(org.junit.Test)

Example 37 with BatchJobExt

use of com.netflix.titus.api.jobmanager.model.job.ext.BatchJobExt in project titus-control-plane by Netflix.

the class BatchDifferenceResolver method applyStore.

private List<ChangeAction> applyStore(ReconciliationEngine<JobManagerReconcilerEvent> engine, BatchJobView refJobView, EntityHolder storeJob, AtomicInteger allowedNewTasks) {
    if (!storeWriteRetryInterceptor.executionLimits(storeJob)) {
        return Collections.emptyList();
    }
    List<ChangeAction> actions = new ArrayList<>();
    EntityHolder refJobHolder = refJobView.getJobHolder();
    Job<BatchJobExt> refJob = refJobHolder.getEntity();
    if (!refJobHolder.getEntity().equals(storeJob.getEntity())) {
        actions.add(storeWriteRetryInterceptor.apply(BasicJobActions.updateJobInStore(engine, jobStore)));
    }
    boolean isJobTerminating = refJob.getStatus().getState() == JobState.KillInitiated;
    for (EntityHolder referenceTask : refJobHolder.getChildren()) {
        Optional<EntityHolder> storeHolder = storeJob.findById(referenceTask.getId());
        boolean refAndStoreInSync = storeHolder.isPresent() && DifferenceResolverUtils.areEquivalent(storeHolder.get(), referenceTask);
        boolean shouldRetry = !isJobTerminating && DifferenceResolverUtils.shouldRetry(refJob, referenceTask.getEntity()) && allowedNewTasks.get() > 0;
        if (refAndStoreInSync) {
            BatchJobTask storeTask = storeHolder.get().getEntity();
            if (shouldRetry && TaskRetryers.shouldRetryNow(referenceTask, clock)) {
                logger.info("Retrying task: oldTaskId={}, index={}", referenceTask.getId(), storeTask.getIndex());
                createNewTaskAction(refJobView, storeTask.getIndex(), Optional.of(referenceTask), Collections.emptyList(), Collections.emptyList()).ifPresent(actions::add);
            }
        } else {
            Task task = referenceTask.getEntity();
            CallMetadata callMetadata = RECONCILER_CALLMETADATA.toBuilder().withCallReason("Writing runtime state changes to store").build();
            actions.add(storeWriteRetryInterceptor.apply(BasicTaskActions.writeReferenceTaskToStore(jobStore, engine, task.getId(), callMetadata, titusRuntime)));
        }
        // Both current and delayed retries are counted
        if (shouldRetry) {
            allowedNewTasks.decrementAndGet();
        }
    }
    return actions;
}
Also used : Task(com.netflix.titus.api.jobmanager.model.job.Task) BatchJobTask(com.netflix.titus.api.jobmanager.model.job.BatchJobTask) TitusChangeAction(com.netflix.titus.master.jobmanager.service.common.action.TitusChangeAction) ChangeAction(com.netflix.titus.common.framework.reconciler.ChangeAction) CallMetadata(com.netflix.titus.api.model.callmetadata.CallMetadata) BatchJobExt(com.netflix.titus.api.jobmanager.model.job.ext.BatchJobExt) ArrayList(java.util.ArrayList) BatchJobTask(com.netflix.titus.api.jobmanager.model.job.BatchJobTask) EntityHolder(com.netflix.titus.common.framework.reconciler.EntityHolder)

Example 38 with BatchJobExt

use of com.netflix.titus.api.jobmanager.model.job.ext.BatchJobExt in project titus-control-plane by Netflix.

the class DifferenceResolverUtils method findTaskStateTimeouts.

/**
 * Find all tasks that are stuck in a specific state. The number of {@link ChangeAction changes} will be limited
 * by the {@link TokenBucket stuckInStateRateLimiter}
 */
public static List<ChangeAction> findTaskStateTimeouts(ReconciliationEngine<JobManagerReconcilerEvent> engine, JobView runningJobView, JobManagerConfiguration configuration, JobServiceRuntime runtime, JobStore jobStore, VersionSupplier versionSupplier, TokenBucket stuckInStateRateLimiter, TitusRuntime titusRuntime) {
    Clock clock = titusRuntime.getClock();
    List<ChangeAction> actions = new ArrayList<>();
    runningJobView.getJobHolder().getChildren().forEach(taskHolder -> {
        Task task = taskHolder.getEntity();
        TaskState taskState = task.getStatus().getState();
        if (JobFunctions.isBatchJob(runningJobView.getJob()) && taskState == TaskState.Started) {
            Job<BatchJobExt> batchJob = runningJobView.getJob();
            // We expect runtime limit to be always set, so this is just extra safety measure.
            long runtimeLimitMs = Math.max(BatchJobExt.RUNTIME_LIMIT_MIN, batchJob.getJobDescriptor().getExtensions().getRuntimeLimitMs());
            long deadline = task.getStatus().getTimestamp() + runtimeLimitMs;
            if (deadline < clock.wallTime()) {
                actions.add(KillInitiatedActions.reconcilerInitiatedTaskKillInitiated(engine, task, runtime, jobStore, versionSupplier, TaskStatus.REASON_RUNTIME_LIMIT_EXCEEDED, "Task running too long (runtimeLimit=" + runtimeLimitMs + "ms)", titusRuntime));
            }
            return;
        }
        TaskTimeoutChangeActions.TimeoutStatus timeoutStatus = TaskTimeoutChangeActions.getTimeoutStatus(taskHolder, clock);
        switch(timeoutStatus) {
            case Ignore:
            case Pending:
                break;
            case NotSet:
                long timeoutMs = -1;
                switch(taskState) {
                    case Launched:
                        timeoutMs = configuration.getTaskInLaunchedStateTimeoutMs();
                        break;
                    case StartInitiated:
                        timeoutMs = isBatch(runningJobView.getJob()) ? configuration.getBatchTaskInStartInitiatedStateTimeoutMs() : configuration.getServiceTaskInStartInitiatedStateTimeoutMs();
                        break;
                    case KillInitiated:
                        timeoutMs = configuration.getTaskInKillInitiatedStateTimeoutMs();
                        break;
                }
                if (timeoutMs > 0) {
                    actions.add(TaskTimeoutChangeActions.setTimeout(taskHolder.getId(), task.getStatus().getState(), timeoutMs, clock));
                }
                break;
            case TimedOut:
                if (!stuckInStateRateLimiter.tryTake()) {
                    break;
                }
                if (task.getStatus().getState() == TaskState.KillInitiated) {
                    int attempts = TaskTimeoutChangeActions.getKillInitiatedAttempts(taskHolder) + 1;
                    if (attempts >= configuration.getTaskKillAttempts()) {
                        actions.add(BasicTaskActions.updateTaskInRunningModel(task.getId(), V3JobOperations.Trigger.Reconciler, configuration, engine, taskParam -> Optional.of(taskParam.toBuilder().withStatus(taskParam.getStatus().toBuilder().withState(TaskState.Finished).withReasonCode(TaskStatus.REASON_STUCK_IN_KILLING_STATE).withReasonMessage("stuck in " + taskState + "state").build()).build()), "TimedOut in KillInitiated state", versionSupplier, titusRuntime, JobManagerConstants.RECONCILER_CALLMETADATA.toBuilder().withCallReason("Kill initiated").build()));
                    } else {
                        actions.add(TaskTimeoutChangeActions.incrementTaskKillAttempt(task.getId(), configuration.getTaskInKillInitiatedStateTimeoutMs(), clock));
                        actions.add(KillInitiatedActions.reconcilerInitiatedTaskKillInitiated(engine, task, runtime, jobStore, versionSupplier, TaskStatus.REASON_STUCK_IN_KILLING_STATE, "Another kill attempt (" + (attempts + 1) + ')', titusRuntime));
                    }
                } else {
                    actions.add(KillInitiatedActions.reconcilerInitiatedTaskKillInitiated(engine, task, runtime, jobStore, versionSupplier, TaskStatus.REASON_STUCK_IN_STATE, "Task stuck in " + taskState + " state", titusRuntime));
                }
                break;
        }
    });
    return actions;
}
Also used : JobManagerConstants(com.netflix.titus.api.jobmanager.service.JobManagerConstants) JobServiceRuntime(com.netflix.titus.master.jobmanager.service.JobServiceRuntime) Task(com.netflix.titus.api.jobmanager.model.job.Task) HashMap(java.util.HashMap) Function(java.util.function.Function) TaskTimeoutChangeActions(com.netflix.titus.master.jobmanager.service.common.action.task.TaskTimeoutChangeActions) ArrayList(java.util.ArrayList) EbsVolume(com.netflix.titus.api.jobmanager.model.job.ebs.EbsVolume) TASK_ATTRIBUTES_EBS_VOLUME_ID(com.netflix.titus.api.jobmanager.TaskAttributes.TASK_ATTRIBUTES_EBS_VOLUME_ID) HashSet(java.util.HashSet) Map(java.util.Map) JobState(com.netflix.titus.api.jobmanager.model.job.JobState) BatchJobExt(com.netflix.titus.api.jobmanager.model.job.ext.BatchJobExt) ChangeAction(com.netflix.titus.common.framework.reconciler.ChangeAction) JobManagerConfiguration(com.netflix.titus.master.jobmanager.service.JobManagerConfiguration) JobStore(com.netflix.titus.api.jobmanager.store.JobStore) JobDescriptor(com.netflix.titus.api.jobmanager.model.job.JobDescriptor) Job(com.netflix.titus.api.jobmanager.model.job.Job) ServiceJobExt(com.netflix.titus.api.jobmanager.model.job.ext.ServiceJobExt) TaskStatus(com.netflix.titus.api.jobmanager.model.job.TaskStatus) Set(java.util.Set) JobFunctions(com.netflix.titus.api.jobmanager.model.job.JobFunctions) Collectors(java.util.stream.Collectors) TaskState(com.netflix.titus.api.jobmanager.model.job.TaskState) EntityHolder(com.netflix.titus.common.framework.reconciler.EntityHolder) Consumer(java.util.function.Consumer) List(java.util.List) ExecutableStatus(com.netflix.titus.api.jobmanager.model.job.ExecutableStatus) V3JobOperations(com.netflix.titus.api.jobmanager.service.V3JobOperations) VersionSupplier(com.netflix.titus.master.jobmanager.service.VersionSupplier) ReconciliationEngine(com.netflix.titus.common.framework.reconciler.ReconciliationEngine) Optional(java.util.Optional) BasicTaskActions(com.netflix.titus.master.jobmanager.service.common.action.task.BasicTaskActions) JobManagerReconcilerEvent(com.netflix.titus.master.jobmanager.service.event.JobManagerReconcilerEvent) TitusRuntime(com.netflix.titus.common.runtime.TitusRuntime) TokenBucket(com.netflix.titus.common.util.limiter.tokenbucket.TokenBucket) Clock(com.netflix.titus.common.util.time.Clock) KillInitiatedActions(com.netflix.titus.master.jobmanager.service.common.action.task.KillInitiatedActions) TASK_ATTRIBUTES_IP_ALLOCATION_ID(com.netflix.titus.api.jobmanager.TaskAttributes.TASK_ATTRIBUTES_IP_ALLOCATION_ID) Task(com.netflix.titus.api.jobmanager.model.job.Task) ChangeAction(com.netflix.titus.common.framework.reconciler.ChangeAction) BatchJobExt(com.netflix.titus.api.jobmanager.model.job.ext.BatchJobExt) ArrayList(java.util.ArrayList) TaskTimeoutChangeActions(com.netflix.titus.master.jobmanager.service.common.action.task.TaskTimeoutChangeActions) Clock(com.netflix.titus.common.util.time.Clock) TaskState(com.netflix.titus.api.jobmanager.model.job.TaskState)

Example 39 with BatchJobExt

use of com.netflix.titus.api.jobmanager.model.job.ext.BatchJobExt in project titus-control-plane by Netflix.

the class JobRuntimePredictionSanitizer method capPredictionToRuntimeLimit.

/**
 * Use the prediction when available and shorter than the runtime limit, otherwise the runtime limit becomes
 * the prediction if within {@link JobRuntimePredictionConfiguration#getMaxOpportunisticRuntimeLimitMs()}
 */
@SuppressWarnings("unchecked")
private JobDescriptor capPredictionToRuntimeLimit(JobDescriptor jobDescriptor) {
    // non-batch jobs have been filtered before this point, it is safe to cast
    BatchJobExt extensions = ((JobDescriptor<BatchJobExt>) jobDescriptor).getExtensions();
    long runtimeLimitMs = extensions.getRuntimeLimitMs();
    if (runtimeLimitMs <= 0 || runtimeLimitMs > configuration.getMaxOpportunisticRuntimeLimitMs()) {
        // no runtime limit or too high to be used, noop
        return jobDescriptor;
    }
    return JobFunctions.getJobRuntimePrediction(jobDescriptor).filter(prediction -> runtimeLimitMs > prediction.toMillis()).map(ignored -> jobDescriptor).orElseGet(() -> JobFunctions.appendJobDescriptorAttributes(jobDescriptor, ImmutableMap.<String, String>builder().put(JOB_ATTRIBUTES_RUNTIME_PREDICTION_SEC, Double.toString(runtimeLimitMs / 1000.0)).put(JOB_ATTRIBUTES_RUNTIME_PREDICTION_CONFIDENCE, Double.toString(1.0)).build()));
}
Also used : JobRuntimePredictions(com.netflix.titus.runtime.connector.prediction.JobRuntimePredictions) JOB_ATTRIBUTES_SANITIZATION_SKIPPED_RUNTIME_PREDICTION(com.netflix.titus.api.jobmanager.JobAttributes.JOB_ATTRIBUTES_SANITIZATION_SKIPPED_RUNTIME_PREDICTION) CollectionsExt(com.netflix.titus.common.util.CollectionsExt) LoggerFactory(org.slf4j.LoggerFactory) JobRuntimePrediction(com.netflix.titus.runtime.connector.prediction.JobRuntimePrediction) HashMap(java.util.HashMap) UnaryOperator(java.util.function.UnaryOperator) Singleton(javax.inject.Singleton) Inject(javax.inject.Inject) JobRuntimePredictionClient(com.netflix.titus.runtime.connector.prediction.JobRuntimePredictionClient) JOB_ATTRIBUTES_RUNTIME_PREDICTION_VERSION(com.netflix.titus.api.jobmanager.JobAttributes.JOB_ATTRIBUTES_RUNTIME_PREDICTION_VERSION) JOB_ATTRIBUTES_RUNTIME_PREDICTION_AVAILABLE(com.netflix.titus.api.jobmanager.JobAttributes.JOB_ATTRIBUTES_RUNTIME_PREDICTION_AVAILABLE) Map(java.util.Map) BatchJobExt(com.netflix.titus.api.jobmanager.model.job.ext.BatchJobExt) AdmissionSanitizer(com.netflix.titus.common.model.admission.AdmissionSanitizer) JobFunctions.appendJobDescriptorAttributes(com.netflix.titus.api.jobmanager.model.job.JobFunctions.appendJobDescriptorAttributes) JobDescriptor(com.netflix.titus.api.jobmanager.model.job.JobDescriptor) Logger(org.slf4j.Logger) JOB_ATTRIBUTES_RUNTIME_PREDICTION_SEC(com.netflix.titus.api.jobmanager.JobAttributes.JOB_ATTRIBUTES_RUNTIME_PREDICTION_SEC) ImmutableMap(com.google.common.collect.ImmutableMap) JOB_ATTRIBUTES_RUNTIME_PREDICTION_CONFIDENCE(com.netflix.titus.api.jobmanager.JobAttributes.JOB_ATTRIBUTES_RUNTIME_PREDICTION_CONFIDENCE) JobFunctions(com.netflix.titus.api.jobmanager.model.job.JobFunctions) Mono(reactor.core.publisher.Mono) JOB_PARAMETER_SKIP_RUNTIME_PREDICTION(com.netflix.titus.api.jobmanager.JobAttributes.JOB_PARAMETER_SKIP_RUNTIME_PREDICTION) JOB_ATTRIBUTES_RUNTIME_PREDICTION_MODEL_ID(com.netflix.titus.api.jobmanager.JobAttributes.JOB_ATTRIBUTES_RUNTIME_PREDICTION_MODEL_ID) FunctionExt(com.netflix.titus.common.util.FunctionExt) Optional(java.util.Optional) TitusRuntime(com.netflix.titus.common.runtime.TitusRuntime) JobDescriptor(com.netflix.titus.api.jobmanager.model.job.JobDescriptor) BatchJobExt(com.netflix.titus.api.jobmanager.model.job.ext.BatchJobExt)

Example 40 with BatchJobExt

use of com.netflix.titus.api.jobmanager.model.job.ext.BatchJobExt in project titus-control-plane by Netflix.

the class JobModelSanitizationTest method testBatchJobWithIncompleteEfsDefinition.

@Test
public void testBatchJobWithIncompleteEfsDefinition() {
    JobDescriptor<BatchJobExt> jobDescriptor = oneTaskBatchJobDescriptor();
    JobDescriptor<BatchJobExt> incompleteEfsDefinition = JobModel.newJobDescriptor(jobDescriptor).withContainer(JobModel.newContainer(jobDescriptor.getContainer()).withContainerResources(JobModel.newContainerResources(jobDescriptor.getContainer().getContainerResources()).withEfsMounts(Collections.singletonList(new EfsMount("efsId#1", "/data", null, null))).build()).build()).build();
    Job<BatchJobExt> job = JobGenerator.batchJobs(incompleteEfsDefinition).getValue();
    // EFS violation expected
    assertThat(entitySanitizer.validate(job)).hasSize(1);
    // Now do cleanup
    Job<BatchJobExt> sanitized = entitySanitizer.sanitize(job).get();
    assertThat(entitySanitizer.validate(sanitized)).isEmpty();
}
Also used : BatchJobExt(com.netflix.titus.api.jobmanager.model.job.ext.BatchJobExt) EfsMount(com.netflix.titus.api.model.EfsMount) Test(org.junit.Test)

Aggregations

BatchJobExt (com.netflix.titus.api.jobmanager.model.job.ext.BatchJobExt)73 Test (org.junit.Test)55 Task (com.netflix.titus.api.jobmanager.model.job.Task)30 BatchJobTask (com.netflix.titus.api.jobmanager.model.job.BatchJobTask)25 List (java.util.List)20 ArrayList (java.util.ArrayList)19 JobStore (com.netflix.titus.api.jobmanager.store.JobStore)17 HashMap (java.util.HashMap)16 V1Affinity (io.kubernetes.client.openapi.models.V1Affinity)14 IntegrationNotParallelizableTest (com.netflix.titus.testkit.junit.category.IntegrationNotParallelizableTest)13 ServiceJobTask (com.netflix.titus.api.jobmanager.model.job.ServiceJobTask)11 V1Pod (io.kubernetes.client.openapi.models.V1Pod)11 Job (com.netflix.titus.api.jobmanager.model.job.Job)10 JobDescriptor (com.netflix.titus.api.jobmanager.model.job.JobDescriptor)10 Container (com.netflix.titus.api.jobmanager.model.job.Container)6 Map (java.util.Map)6 Assertions.assertThat (org.assertj.core.api.Assertions.assertThat)6 V1Container (io.kubernetes.client.openapi.models.V1Container)5 BasicContainer (com.netflix.titus.api.jobmanager.model.job.BasicContainer)4 Image (com.netflix.titus.api.jobmanager.model.job.Image)4