Search in sources :

Example 1 with RecoveryStep

use of com.mesosphere.sdk.scheduler.recovery.RecoveryStep in project dcos-commons by mesosphere.

the class ServiceTest method transientToCustomPermanentFailureTransition.

@Test
public void transientToCustomPermanentFailureTransition() throws Exception {
    Protos.Offer unacceptableOffer = Protos.Offer.newBuilder().setId(Protos.OfferID.newBuilder().setValue(UUID.randomUUID().toString())).setFrameworkId(TestConstants.FRAMEWORK_ID).setSlaveId(TestConstants.AGENT_ID).setHostname(TestConstants.HOSTNAME).addResources(Protos.Resource.newBuilder().setName("mem").setType(Protos.Value.Type.SCALAR).setScalar(Protos.Value.Scalar.newBuilder().setValue(1.0))).build();
    Collection<SimulationTick> ticks = new ArrayList<>();
    ticks.add(Send.register());
    ticks.add(Expect.reconciledImplicitly());
    // Verify that service launches 1 hello pod then 2 world pods.
    ticks.add(Send.offerBuilder("hello").build());
    ticks.add(Expect.launchedTasks("hello-0-server"));
    // Send another offer before hello-0 is finished:
    ticks.add(Send.offerBuilder("world").build());
    ticks.add(Expect.declinedLastOffer());
    // Running, no readiness check is applicable:
    ticks.add(Send.taskStatus("hello-0-server", Protos.TaskState.TASK_RUNNING).build());
    // Now world-0 will deploy:
    ticks.add(Send.offerBuilder("world").build());
    ticks.add(Expect.launchedTasks("world-0-server"));
    // With world-0's readiness check passing, world-1 still won't launch due to a hostname placement constraint:
    ticks.add(Send.taskStatus("world-0-server", Protos.TaskState.TASK_RUNNING).setReadinessCheckExitCode(0).build());
    // world-1 will finally launch if the offered hostname is different:
    ticks.add(Send.offerBuilder("world").setHostname("host-foo").build());
    ticks.add(Expect.launchedTasks("world-1-server"));
    ticks.add(Send.taskStatus("world-1-server", Protos.TaskState.TASK_RUNNING).setReadinessCheckExitCode(0).build());
    // *** Complete initial deployment. ***
    ticks.add(Expect.allPlansComplete());
    // Kill hello-0 to trigger transient recovery
    ticks.add(Send.taskStatus("hello-0-server", Protos.TaskState.TASK_FAILED).build());
    // Send an unused offer to trigger an evaluation of the recovery plan
    ticks.add(Send.offer(unacceptableOffer));
    // Expect default transient recovery triggered
    ticks.add(Expect.recoveryStepStatus("hello-0:[server]", "hello-0:[server]", Status.PREPARED));
    // Now trigger custom permanent replacement of that pod
    ticks.add(Send.replacePod("hello-0"));
    // Send an unused offer to trigger an evaluation of the recovery plan
    ticks.add(Send.offer(unacceptableOffer));
    // Custom expectation not relevant to other tests
    Expect expectSingleRecoveryPhase = new Expect() {

        @Override
        public void expect(ClusterState state, SchedulerDriver mockDriver) throws AssertionError {
            Plan recoveryPlan = state.getPlans().stream().filter(plan -> plan.getName().equals("recovery")).findAny().get();
            Assert.assertEquals(1, recoveryPlan.getChildren().size());
        }

        @Override
        public String getDescription() {
            return "Single recovery phase";
        }
    };
    ticks.add(expectSingleRecoveryPhase);
    ticks.add(Expect.recoveryStepStatus("custom-hello-recovery", "hello-0", Status.PREPARED));
    // Complete recovery
    ticks.add(Send.offerBuilder("hello").build());
    ticks.add(Expect.launchedTasks("hello-0-server"));
    ticks.add(Send.taskStatus("hello-0-server", Protos.TaskState.TASK_RUNNING).build());
    ticks.add(Expect.allPlansComplete());
    new ServiceTestRunner().setRecoveryManagerFactory(new RecoveryPlanOverriderFactory() {

        @Override
        public RecoveryPlanOverrider create(StateStore stateStore, Collection<Plan> plans) {
            return new RecoveryPlanOverrider() {

                @Override
                public Optional<Phase> override(PodInstanceRequirement podInstanceRequirement) {
                    if (podInstanceRequirement.getPodInstance().getPod().getType().equals("hello") && podInstanceRequirement.getRecoveryType().equals(RecoveryType.PERMANENT)) {
                        Phase phase = new DefaultPhase("custom-hello-recovery", Arrays.asList(new RecoveryStep(podInstanceRequirement.getPodInstance().getName(), podInstanceRequirement, new UnconstrainedLaunchConstrainer(), stateStore)), new SerialStrategy<>(), Collections.emptyList());
                        return Optional.of(phase);
                    }
                    return Optional.empty();
                }
            };
        }
    }).run(ticks);
}
Also used : StateStore(com.mesosphere.sdk.state.StateStore) RecoveryPlanOverriderFactory(com.mesosphere.sdk.scheduler.recovery.RecoveryPlanOverriderFactory) RecoveryStep(com.mesosphere.sdk.scheduler.recovery.RecoveryStep) RecoveryPlanOverrider(com.mesosphere.sdk.scheduler.recovery.RecoveryPlanOverrider) Protos(org.apache.mesos.Protos) UnconstrainedLaunchConstrainer(com.mesosphere.sdk.scheduler.recovery.constrain.UnconstrainedLaunchConstrainer) SchedulerDriver(org.apache.mesos.SchedulerDriver) Test(org.junit.Test)

Example 2 with RecoveryStep

use of com.mesosphere.sdk.scheduler.recovery.RecoveryStep in project dcos-commons by mesosphere.

the class HdfsRecoveryPlanOverrider method getRecoveryPhase.

private Phase getRecoveryPhase(Plan inputPlan, int index, String phaseName) {
    Phase inputPhase = getPhaseForNodeType(inputPlan, phaseName);
    int offset = index * 2;
    // Bootstrap
    Step inputBootstrapStep = inputPhase.getChildren().get(offset + 0);
    PodInstanceRequirement bootstrapPodInstanceRequirement = PodInstanceRequirement.newBuilder(inputBootstrapStep.start().get().getPodInstance(), inputBootstrapStep.start().get().getTasksToLaunch()).recoveryType(RecoveryType.PERMANENT).build();
    Step bootstrapStep = new RecoveryStep(inputBootstrapStep.getName(), bootstrapPodInstanceRequirement, new UnconstrainedLaunchConstrainer(), stateStore);
    // JournalNode or NameNode
    Step inputNodeStep = inputPhase.getChildren().get(offset + 1);
    PodInstanceRequirement nameNodePodInstanceRequirement = PodInstanceRequirement.newBuilder(inputNodeStep.start().get().getPodInstance(), inputNodeStep.start().get().getTasksToLaunch()).recoveryType(RecoveryType.TRANSIENT).build();
    Step nodeStep = new RecoveryStep(inputNodeStep.getName(), nameNodePodInstanceRequirement, new UnconstrainedLaunchConstrainer(), stateStore);
    return new DefaultPhase(String.format(PHASE_NAME_TEMPLATE, phaseName), Arrays.asList(bootstrapStep, nodeStep), new SerialStrategy<>(), Collections.emptyList());
}
Also used : RecoveryStep(com.mesosphere.sdk.scheduler.recovery.RecoveryStep) UnconstrainedLaunchConstrainer(com.mesosphere.sdk.scheduler.recovery.constrain.UnconstrainedLaunchConstrainer) RecoveryStep(com.mesosphere.sdk.scheduler.recovery.RecoveryStep)

Example 3 with RecoveryStep

use of com.mesosphere.sdk.scheduler.recovery.RecoveryStep in project dcos-commons by mesosphere.

the class CassandraRecoveryPlanOverrider method getNodeRecoveryPhase.

private Phase getNodeRecoveryPhase(Plan inputPlan, int index) {
    Phase inputPhase = inputPlan.getChildren().get(0);
    Step inputLaunchStep = inputPhase.getChildren().get(index);
    // Dig all the way down into the command, so we can append the replace_address option to it.
    PodInstance podInstance = inputLaunchStep.start().get().getPodInstance();
    PodSpec podSpec = podInstance.getPod();
    TaskSpec taskSpec = podSpec.getTasks().stream().filter(t -> t.getName().equals("server")).findFirst().get();
    CommandSpec command = taskSpec.getCommand().get();
    // Get IP address for the pre-existing node.
    Optional<Protos.TaskStatus> status = StateStoreUtils.getTaskStatusFromProperty(stateStore, TaskSpec.getInstanceName(podInstance, taskSpec));
    if (!status.isPresent()) {
        logger.error("No previously stored TaskStatus to pull IP address from in Cassandra recovery");
        return null;
    }
    String replaceIp = status.get().getContainerStatus().getNetworkInfos(0).getIpAddresses(0).getIpAddress();
    DefaultCommandSpec.Builder builder = DefaultCommandSpec.newBuilder(command);
    builder.value(String.format("%s -Dcassandra.replace_address=%s -Dcassandra.consistent.rangemovement=false%n", command.getValue().trim(), replaceIp));
    // Rebuild a new PodSpec with the modified command, and add it to the phase we return.
    TaskSpec newTaskSpec = DefaultTaskSpec.newBuilder(taskSpec).commandSpec(builder.build()).build();
    List<TaskSpec> tasks = podSpec.getTasks().stream().map(t -> {
        if (t.getName().equals(newTaskSpec.getName())) {
            return newTaskSpec;
        }
        return t;
    }).collect(Collectors.toList());
    PodSpec newPodSpec = DefaultPodSpec.newBuilder(podSpec).tasks(tasks).build();
    PodInstance newPodInstance = new DefaultPodInstance(newPodSpec, index);
    PodInstanceRequirement replacePodInstanceRequirement = PodInstanceRequirement.newBuilder(newPodInstance, inputLaunchStep.getPodInstanceRequirement().get().getTasksToLaunch()).recoveryType(RecoveryType.PERMANENT).build();
    Step replaceStep = new RecoveryStep(inputLaunchStep.getName(), replacePodInstanceRequirement, new UnconstrainedLaunchConstrainer(), stateStore);
    List<Step> steps = new ArrayList<>();
    steps.add(replaceStep);
    // Restart all other nodes if replacing a seed node to refresh IP resolution
    int replaceIndex = replaceStep.getPodInstanceRequirement().get().getPodInstance().getIndex();
    if (CassandraSeedUtils.isSeedNode(replaceIndex)) {
        logger.info("Scheduling restart of all nodes other than 'node-{}' to refresh seed node address.", replaceIndex);
        List<Step> restartSteps = inputPhase.getChildren().stream().filter(step -> step.getPodInstanceRequirement().get().getPodInstance().getIndex() != replaceIndex).map(step -> {
            PodInstanceRequirement restartPodInstanceRequirement = PodInstanceRequirement.newBuilder(step.getPodInstanceRequirement().get().getPodInstance(), step.getPodInstanceRequirement().get().getTasksToLaunch()).recoveryType(RecoveryType.TRANSIENT).build();
            return new RecoveryStep(step.getName(), restartPodInstanceRequirement, new UnconstrainedLaunchConstrainer(), stateStore);
        }).collect(Collectors.toList());
        steps.addAll(restartSteps);
    }
    return new DefaultPhase(RECOVERY_PHASE_NAME, steps, new SerialStrategy<>(), Collections.emptyList());
}
Also used : Protos(org.apache.mesos.Protos) java.util(java.util) Logger(org.slf4j.Logger) TaskSpec(com.mesosphere.sdk.specification.TaskSpec) LoggerFactory(org.slf4j.LoggerFactory) StateStoreUtils(com.mesosphere.sdk.state.StateStoreUtils) RecoveryType(com.mesosphere.sdk.scheduler.recovery.RecoveryType) DefaultPodSpec(com.mesosphere.sdk.specification.DefaultPodSpec) RecoveryPlanOverrider(com.mesosphere.sdk.scheduler.recovery.RecoveryPlanOverrider) DefaultTaskSpec(com.mesosphere.sdk.specification.DefaultTaskSpec) Collectors(java.util.stream.Collectors) PodSpec(com.mesosphere.sdk.specification.PodSpec) CommandSpec(com.mesosphere.sdk.specification.CommandSpec) SerialStrategy(com.mesosphere.sdk.scheduler.plan.strategy.SerialStrategy) UnconstrainedLaunchConstrainer(com.mesosphere.sdk.scheduler.recovery.constrain.UnconstrainedLaunchConstrainer) StateStore(com.mesosphere.sdk.state.StateStore) com.mesosphere.sdk.scheduler.plan(com.mesosphere.sdk.scheduler.plan) RecoveryStep(com.mesosphere.sdk.scheduler.recovery.RecoveryStep) PodInstance(com.mesosphere.sdk.specification.PodInstance) DefaultCommandSpec(com.mesosphere.sdk.specification.DefaultCommandSpec) DefaultPodSpec(com.mesosphere.sdk.specification.DefaultPodSpec) PodSpec(com.mesosphere.sdk.specification.PodSpec) PodInstance(com.mesosphere.sdk.specification.PodInstance) TaskSpec(com.mesosphere.sdk.specification.TaskSpec) DefaultTaskSpec(com.mesosphere.sdk.specification.DefaultTaskSpec) RecoveryStep(com.mesosphere.sdk.scheduler.recovery.RecoveryStep) RecoveryStep(com.mesosphere.sdk.scheduler.recovery.RecoveryStep) CommandSpec(com.mesosphere.sdk.specification.CommandSpec) DefaultCommandSpec(com.mesosphere.sdk.specification.DefaultCommandSpec) UnconstrainedLaunchConstrainer(com.mesosphere.sdk.scheduler.recovery.constrain.UnconstrainedLaunchConstrainer) DefaultCommandSpec(com.mesosphere.sdk.specification.DefaultCommandSpec)

Aggregations

RecoveryStep (com.mesosphere.sdk.scheduler.recovery.RecoveryStep)3 UnconstrainedLaunchConstrainer (com.mesosphere.sdk.scheduler.recovery.constrain.UnconstrainedLaunchConstrainer)3 RecoveryPlanOverrider (com.mesosphere.sdk.scheduler.recovery.RecoveryPlanOverrider)2 StateStore (com.mesosphere.sdk.state.StateStore)2 Protos (org.apache.mesos.Protos)2 com.mesosphere.sdk.scheduler.plan (com.mesosphere.sdk.scheduler.plan)1 SerialStrategy (com.mesosphere.sdk.scheduler.plan.strategy.SerialStrategy)1 RecoveryPlanOverriderFactory (com.mesosphere.sdk.scheduler.recovery.RecoveryPlanOverriderFactory)1 RecoveryType (com.mesosphere.sdk.scheduler.recovery.RecoveryType)1 CommandSpec (com.mesosphere.sdk.specification.CommandSpec)1 DefaultCommandSpec (com.mesosphere.sdk.specification.DefaultCommandSpec)1 DefaultPodSpec (com.mesosphere.sdk.specification.DefaultPodSpec)1 DefaultTaskSpec (com.mesosphere.sdk.specification.DefaultTaskSpec)1 PodInstance (com.mesosphere.sdk.specification.PodInstance)1 PodSpec (com.mesosphere.sdk.specification.PodSpec)1 TaskSpec (com.mesosphere.sdk.specification.TaskSpec)1 StateStoreUtils (com.mesosphere.sdk.state.StateStoreUtils)1 java.util (java.util)1 Collectors (java.util.stream.Collectors)1 SchedulerDriver (org.apache.mesos.SchedulerDriver)1