Search in sources :

Example 1 with ExecutionContext

use of com.netflix.titus.common.framework.scheduler.ExecutionContext in project titus-control-plane by Netflix.

the class DefaultClusterMembershipService method clusterStateEvaluator.

private Mono<Void> clusterStateEvaluator(ExecutionContext context) {
    return Mono.defer(() -> {
        ClusterMember localMember = connector.getLocalClusterMemberRevision().getCurrent();
        ClusterMemberLeadershipState localLeadershipState = connector.getLocalLeadershipRevision().getCurrent().getLeadershipState();
        HealthStatus health = healthIndicator.health();
        // Explicitly disabled
        if (!configuration.isLeaderElectionEnabled() || !localMember.isEnabled()) {
            if (localLeadershipState == ClusterMemberLeadershipState.NonLeader) {
                logger.info("Local member excluded from the leader election. Leaving the leader election process");
                return connector.leaveLeadershipGroup(true).flatMap(success -> success ? connector.register(current -> toInactive(current, "Marked by a user as disabled")).ignoreElement().cast(Void.class) : Mono.empty());
            }
            if (localLeadershipState == ClusterMemberLeadershipState.Disabled && localMember.isActive()) {
                return connector.register(current -> toInactive(current, "Marked by a user as disabled")).ignoreElement().cast(Void.class);
            }
            return Mono.empty();
        }
        // Re-enable if healthy
        if (health.getHealthState() == HealthState.Healthy) {
            if (localLeadershipState == ClusterMemberLeadershipState.Disabled) {
                logger.info("Re-enabling local member which is in the disabled state");
                return connector.joinLeadershipGroup().then(connector.register(this::toActive).ignoreElement().cast(Void.class));
            }
            if (!localMember.isActive()) {
                return connector.register(this::toActive).ignoreElement().cast(Void.class);
            }
            return Mono.empty();
        }
        // Disable if unhealthy (and not the leader)
        if (localLeadershipState != ClusterMemberLeadershipState.Disabled && localLeadershipState != ClusterMemberLeadershipState.Leader) {
            logger.info("Disabling local member as it is unhealthy: {}", health);
            return connector.leaveLeadershipGroup(true).flatMap(success -> success ? connector.register(current -> toInactive(current, "Unhealthy: " + health)).ignoreElement().cast(Void.class) : Mono.empty());
        }
        if (localLeadershipState == ClusterMemberLeadershipState.Disabled && localMember.isActive()) {
            return connector.register(current -> toInactive(current, "Unhealthy: " + health)).ignoreElement().cast(Void.class);
        }
        return Mono.empty();
    }).doOnError(error -> {
        logger.info("Cluster membership health evaluation error: {}", error.getMessage());
        logger.debug("Stack trace", error);
    }).doOnTerminate(() -> {
        metrics.updateLocal(connector.getLocalLeadershipRevision().getCurrent().getLeadershipState(), healthIndicator.health());
        metrics.updateSiblings(connector.getClusterMemberSiblings());
    });
}
Also used : ExecutionContext(com.netflix.titus.common.framework.scheduler.ExecutionContext) Disposable(reactor.core.Disposable) LoggerFactory(org.slf4j.LoggerFactory) Singleton(javax.inject.Singleton) ReactorExt(com.netflix.titus.common.util.rx.ReactorExt) Function(java.util.function.Function) ScheduleReference(com.netflix.titus.common.framework.scheduler.ScheduleReference) Inject(javax.inject.Inject) ClusterMembershipService(com.netflix.titus.api.clustermembership.service.ClusterMembershipService) HealthStatus(com.netflix.titus.api.health.HealthStatus) ClusterMembershipConnector(com.netflix.titus.api.clustermembership.connector.ClusterMembershipConnector) Duration(java.time.Duration) Map(java.util.Map) Schedulers(reactor.core.scheduler.Schedulers) ClusterMember(com.netflix.titus.api.clustermembership.model.ClusterMember) ClusterMembershipRevision(com.netflix.titus.api.clustermembership.model.ClusterMembershipRevision) Logger(org.slf4j.Logger) ClusterMemberLeadershipState(com.netflix.titus.api.clustermembership.model.ClusterMemberLeadershipState) Retryers(com.netflix.titus.common.util.retry.Retryers) ClusterMemberLeadership(com.netflix.titus.api.clustermembership.model.ClusterMemberLeadership) Mono(reactor.core.publisher.Mono) ClusterMembershipServiceException(com.netflix.titus.api.clustermembership.service.ClusterMembershipServiceException) HealthIndicator(com.netflix.titus.api.health.HealthIndicator) Flux(reactor.core.publisher.Flux) ClusterMembershipEvent(com.netflix.titus.api.clustermembership.model.event.ClusterMembershipEvent) HealthState(com.netflix.titus.api.health.HealthState) ScheduleDescriptor(com.netflix.titus.common.framework.scheduler.model.ScheduleDescriptor) Optional(java.util.Optional) TitusRuntime(com.netflix.titus.common.runtime.TitusRuntime) Clock(com.netflix.titus.common.util.time.Clock) ClusterMember(com.netflix.titus.api.clustermembership.model.ClusterMember) HealthStatus(com.netflix.titus.api.health.HealthStatus) ClusterMemberLeadershipState(com.netflix.titus.api.clustermembership.model.ClusterMemberLeadershipState)

Example 2 with ExecutionContext

use of com.netflix.titus.common.framework.scheduler.ExecutionContext in project titus-control-plane by Netflix.

the class DefaultNodeConditionControllerTest method checkTasksTerminatedDueToBadNodeConditions.

@Test
public void checkTasksTerminatedDueToBadNodeConditions() {
    // Mock jobs, tasks & nodes
    Map<String, TitusNode> nodeMap = buildNodes();
    List<Job<BatchJobExt>> jobs = getJobs(true);
    Map<String, List<Task>> tasksByJobIdMap = buildTasksForJobAndNodeAssignment(new ArrayList<>(nodeMap.values()), jobs);
    TitusRuntime titusRuntime = mock(TitusRuntime.class);
    when(titusRuntime.getRegistry()).thenReturn(new DefaultRegistry());
    RelocationConfiguration configuration = mock(RelocationConfiguration.class);
    when(configuration.getBadNodeConditionPattern()).thenReturn(".*Failure");
    when(configuration.isTaskTerminationOnBadNodeConditionEnabled()).thenReturn(true);
    NodeDataResolver nodeDataResolver = mock(NodeDataResolver.class);
    when(nodeDataResolver.resolve()).thenReturn(nodeMap);
    JobDataReplicator jobDataReplicator = mock(JobDataReplicator.class);
    when(jobDataReplicator.getStalenessMs()).thenReturn(0L);
    ReadOnlyJobOperations readOnlyJobOperations = mock(ReadOnlyJobOperations.class);
    when(readOnlyJobOperations.getJobs()).thenReturn(new ArrayList<>(jobs));
    tasksByJobIdMap.forEach((key, value) -> when(readOnlyJobOperations.getTasks(key)).thenReturn(value));
    JobManagementClient jobManagementClient = mock(JobManagementClient.class);
    Set<String> terminatedTaskIds = new HashSet<>();
    when(jobManagementClient.killTask(anyString(), anyBoolean(), any())).thenAnswer(invocation -> {
        String taskIdToBeTerminated = invocation.getArgument(0);
        terminatedTaskIds.add(taskIdToBeTerminated);
        return Mono.empty();
    });
    DefaultNodeConditionController nodeConditionCtrl = new DefaultNodeConditionController(configuration, nodeDataResolver, jobDataReplicator, readOnlyJobOperations, jobManagementClient, titusRuntime);
    ExecutionContext executionContext = ExecutionContext.newBuilder().withIteration(ExecutionId.initial()).build();
    StepVerifier.create(nodeConditionCtrl.handleNodesWithBadCondition(executionContext)).verifyComplete();
    assertThat(terminatedTaskIds).isNotEmpty();
    assertThat(terminatedTaskIds.size()).isEqualTo(2);
    verifyTerminatedTasksOnBadNodes(terminatedTaskIds, tasksByJobIdMap, nodeMap);
}
Also used : JobDataReplicator(com.netflix.titus.runtime.connector.jobmanager.JobDataReplicator) ReadOnlyJobOperations(com.netflix.titus.api.jobmanager.service.ReadOnlyJobOperations) JobManagementClient(com.netflix.titus.runtime.connector.jobmanager.JobManagementClient) NodeDataResolver(com.netflix.titus.supplementary.relocation.connector.NodeDataResolver) ArgumentMatchers.anyString(org.mockito.ArgumentMatchers.anyString) TitusRuntime(com.netflix.titus.common.runtime.TitusRuntime) ExecutionContext(com.netflix.titus.common.framework.scheduler.ExecutionContext) DefaultRegistry(com.netflix.spectator.api.DefaultRegistry) ArrayList(java.util.ArrayList) List(java.util.List) TitusNode(com.netflix.titus.supplementary.relocation.connector.TitusNode) Job(com.netflix.titus.api.jobmanager.model.job.Job) RelocationConfiguration(com.netflix.titus.supplementary.relocation.RelocationConfiguration) HashSet(java.util.HashSet) Test(org.junit.Test)

Example 3 with ExecutionContext

use of com.netflix.titus.common.framework.scheduler.ExecutionContext in project titus-control-plane by Netflix.

the class DefaultNodeConditionControllerTest method noTerminationsOnDataStaleness.

@Test
public void noTerminationsOnDataStaleness() {
    TitusRuntime titusRuntime = mock(TitusRuntime.class);
    when(titusRuntime.getRegistry()).thenReturn(new DefaultRegistry());
    RelocationConfiguration configuration = mock(RelocationConfiguration.class);
    when(configuration.getBadNodeConditionPattern()).thenReturn(".*Problem");
    when(configuration.isTaskTerminationOnBadNodeConditionEnabled()).thenReturn(true);
    when(configuration.getDataStalenessThresholdMs()).thenReturn(8000L);
    NodeDataResolver nodeDataResolver = mock(NodeDataResolver.class);
    when(nodeDataResolver.getStalenessMs()).thenReturn(5L);
    JobDataReplicator jobDataReplicator = mock(JobDataReplicator.class);
    when(jobDataReplicator.getStalenessMs()).thenReturn(10L);
    ReadOnlyJobOperations readOnlyJobOperations = mock(ReadOnlyJobOperations.class);
    JobManagementClient jobManagementClient = mock(JobManagementClient.class);
    Set<String> terminatedTaskIds = new HashSet<>();
    when(jobManagementClient.killTask(anyString(), anyBoolean(), any())).thenAnswer(invocation -> {
        String taskIdToBeTerminated = invocation.getArgument(0);
        terminatedTaskIds.add(taskIdToBeTerminated);
        return Mono.empty();
    });
    DefaultNodeConditionController nodeConditionCtrl = new DefaultNodeConditionController(configuration, nodeDataResolver, jobDataReplicator, readOnlyJobOperations, jobManagementClient, titusRuntime);
    ExecutionContext executionContext = ExecutionContext.newBuilder().withIteration(ExecutionId.initial()).build();
    StepVerifier.create(nodeConditionCtrl.handleNodesWithBadCondition(executionContext)).verifyComplete();
    // No tasks terminated
    assertThat(terminatedTaskIds).isEmpty();
}
Also used : JobDataReplicator(com.netflix.titus.runtime.connector.jobmanager.JobDataReplicator) ReadOnlyJobOperations(com.netflix.titus.api.jobmanager.service.ReadOnlyJobOperations) ExecutionContext(com.netflix.titus.common.framework.scheduler.ExecutionContext) DefaultRegistry(com.netflix.spectator.api.DefaultRegistry) JobManagementClient(com.netflix.titus.runtime.connector.jobmanager.JobManagementClient) NodeDataResolver(com.netflix.titus.supplementary.relocation.connector.NodeDataResolver) ArgumentMatchers.anyString(org.mockito.ArgumentMatchers.anyString) TitusRuntime(com.netflix.titus.common.runtime.TitusRuntime) RelocationConfiguration(com.netflix.titus.supplementary.relocation.RelocationConfiguration) HashSet(java.util.HashSet) Test(org.junit.Test)

Example 4 with ExecutionContext

use of com.netflix.titus.common.framework.scheduler.ExecutionContext in project titus-control-plane by Netflix.

the class DefaultNodeConditionControllerTest method badNodeConditionsIgnoredForJobsNotOptingIn.

@Test
public void badNodeConditionsIgnoredForJobsNotOptingIn() {
    Map<String, TitusNode> nodeMap = buildNodes();
    List<Job<BatchJobExt>> jobs = getJobs(false);
    Map<String, List<Task>> stringListMap = buildTasksForJobAndNodeAssignment(new ArrayList<>(nodeMap.values()), jobs);
    TitusRuntime titusRuntime = mock(TitusRuntime.class);
    when(titusRuntime.getRegistry()).thenReturn(new DefaultRegistry());
    RelocationConfiguration configuration = mock(RelocationConfiguration.class);
    when(configuration.getBadNodeConditionPattern()).thenReturn(".*Failure");
    when(configuration.isTaskTerminationOnBadNodeConditionEnabled()).thenReturn(true);
    NodeDataResolver nodeDataResolver = mock(NodeDataResolver.class);
    when(nodeDataResolver.resolve()).thenReturn(nodeMap);
    JobDataReplicator jobDataReplicator = mock(JobDataReplicator.class);
    when(jobDataReplicator.getStalenessMs()).thenReturn(0L);
    // Job attribute "terminateContainerOnBadAgent" = False
    ReadOnlyJobOperations readOnlyJobOperations = mock(ReadOnlyJobOperations.class);
    when(readOnlyJobOperations.getJobs()).thenReturn(new ArrayList<>(jobs));
    stringListMap.forEach((key, value) -> when(readOnlyJobOperations.getTasks(key)).thenReturn(value));
    JobManagementClient jobManagementClient = mock(JobManagementClient.class);
    Set<String> terminatedTaskIds = new HashSet<>();
    when(jobManagementClient.killTask(anyString(), anyBoolean(), any())).thenAnswer(invocation -> {
        String taskIdToBeTerminated = invocation.getArgument(0);
        terminatedTaskIds.add(taskIdToBeTerminated);
        return Mono.empty();
    });
    DefaultNodeConditionController nodeConditionController = new DefaultNodeConditionController(configuration, nodeDataResolver, jobDataReplicator, readOnlyJobOperations, jobManagementClient, titusRuntime);
    ExecutionContext executionContext = ExecutionContext.newBuilder().withIteration(ExecutionId.initial()).build();
    StepVerifier.create(nodeConditionController.handleNodesWithBadCondition(executionContext)).verifyComplete();
    // no tasks should be terminated for jobs
    assertThat(terminatedTaskIds).isEmpty();
}
Also used : JobDataReplicator(com.netflix.titus.runtime.connector.jobmanager.JobDataReplicator) ReadOnlyJobOperations(com.netflix.titus.api.jobmanager.service.ReadOnlyJobOperations) JobManagementClient(com.netflix.titus.runtime.connector.jobmanager.JobManagementClient) NodeDataResolver(com.netflix.titus.supplementary.relocation.connector.NodeDataResolver) ArgumentMatchers.anyString(org.mockito.ArgumentMatchers.anyString) TitusRuntime(com.netflix.titus.common.runtime.TitusRuntime) ExecutionContext(com.netflix.titus.common.framework.scheduler.ExecutionContext) DefaultRegistry(com.netflix.spectator.api.DefaultRegistry) ArrayList(java.util.ArrayList) List(java.util.List) TitusNode(com.netflix.titus.supplementary.relocation.connector.TitusNode) Job(com.netflix.titus.api.jobmanager.model.job.Job) RelocationConfiguration(com.netflix.titus.supplementary.relocation.RelocationConfiguration) HashSet(java.util.HashSet) Test(org.junit.Test)

Aggregations

ExecutionContext (com.netflix.titus.common.framework.scheduler.ExecutionContext)4 TitusRuntime (com.netflix.titus.common.runtime.TitusRuntime)4 DefaultRegistry (com.netflix.spectator.api.DefaultRegistry)3 ReadOnlyJobOperations (com.netflix.titus.api.jobmanager.service.ReadOnlyJobOperations)3 JobDataReplicator (com.netflix.titus.runtime.connector.jobmanager.JobDataReplicator)3 JobManagementClient (com.netflix.titus.runtime.connector.jobmanager.JobManagementClient)3 RelocationConfiguration (com.netflix.titus.supplementary.relocation.RelocationConfiguration)3 NodeDataResolver (com.netflix.titus.supplementary.relocation.connector.NodeDataResolver)3 HashSet (java.util.HashSet)3 Test (org.junit.Test)3 ArgumentMatchers.anyString (org.mockito.ArgumentMatchers.anyString)3 Job (com.netflix.titus.api.jobmanager.model.job.Job)2 TitusNode (com.netflix.titus.supplementary.relocation.connector.TitusNode)2 ArrayList (java.util.ArrayList)2 List (java.util.List)2 ClusterMembershipConnector (com.netflix.titus.api.clustermembership.connector.ClusterMembershipConnector)1 ClusterMember (com.netflix.titus.api.clustermembership.model.ClusterMember)1 ClusterMemberLeadership (com.netflix.titus.api.clustermembership.model.ClusterMemberLeadership)1 ClusterMemberLeadershipState (com.netflix.titus.api.clustermembership.model.ClusterMemberLeadershipState)1 ClusterMembershipRevision (com.netflix.titus.api.clustermembership.model.ClusterMembershipRevision)1