Search in sources :

Example 21 with HeliosRuntimeException

use of com.spotify.helios.common.HeliosRuntimeException in project helios by spotify.

the class ZooKeeperMasterModel method stopDeploymentGroup.

@Override
public void stopDeploymentGroup(final String deploymentGroupName) throws DeploymentGroupDoesNotExistException {
    checkNotNull(deploymentGroupName, "name");
    log.info("stop deployment-group: name={}", deploymentGroupName);
    final ZooKeeperClient client = provider.get("stopDeploymentGroup");
    // Delete deployment group tasks (if any) and set DG state to FAILED
    final DeploymentGroupStatus status = DeploymentGroupStatus.newBuilder().setState(FAILED).setError("Stopped by user").build();
    final String statusPath = Paths.statusDeploymentGroup(deploymentGroupName);
    final String tasksPath = Paths.statusDeploymentGroupTasks(deploymentGroupName);
    try {
        client.ensurePath(Paths.statusDeploymentGroupTasks());
        final List<ZooKeeperOperation> operations = Lists.newArrayList();
        // NOTE: This remove operation is racey. If tasks exist and the rollout finishes before the
        // delete() is executed then this will fail. Conversely, if it doesn't exist but is created
        // before the transaction is executed it will also fail. This is annoying for users, but at
        // least means we won't have inconsistent state.
        //
        // That the set() is first in the list of operations is important because of the
        // kludgy error checking we do below to disambiguate "doesn't exist" failures from the race
        // condition mentioned below.
        operations.add(set(statusPath, status));
        final Stat tasksStat = client.exists(tasksPath);
        if (tasksStat != null) {
            operations.add(delete(tasksPath));
        } else {
            // There doesn't seem to be a "check that node doesn't exist" operation so we
            // do a create and a delete on the same path to emulate it.
            operations.add(create(tasksPath));
            operations.add(delete(tasksPath));
        }
        client.transaction(operations);
    } catch (final NoNodeException e) {
        // Yes, the way you figure out which operation in a transaction failed is retarded.
        if (((OpResult.ErrorResult) e.getResults().get(0)).getErr() == KeeperException.Code.NONODE.intValue()) {
            throw new DeploymentGroupDoesNotExistException(deploymentGroupName);
        } else {
            throw new HeliosRuntimeException("stop deployment-group " + deploymentGroupName + " failed due to a race condition, please retry", e);
        }
    } catch (final KeeperException e) {
        throw new HeliosRuntimeException("stop deployment-group " + deploymentGroupName + " failed", e);
    }
}
Also used : Stat(org.apache.zookeeper.data.Stat) NoNodeException(org.apache.zookeeper.KeeperException.NoNodeException) ZooKeeperClient(com.spotify.helios.servicescommon.coordination.ZooKeeperClient) ZooKeeperOperation(com.spotify.helios.servicescommon.coordination.ZooKeeperOperation) HeliosRuntimeException(com.spotify.helios.common.HeliosRuntimeException) DeploymentGroupStatus(com.spotify.helios.common.descriptors.DeploymentGroupStatus) KeeperException(org.apache.zookeeper.KeeperException)

Example 22 with HeliosRuntimeException

use of com.spotify.helios.common.HeliosRuntimeException in project helios by spotify.

the class ZooKeeperMasterModel method getTasks.

private Map<JobId, Deployment> getTasks(final ZooKeeperClient client, final String host) {
    final Map<JobId, Deployment> jobs = Maps.newHashMap();
    try {
        final String folder = Paths.configHostJobs(host);
        final List<String> jobIds;
        try {
            jobIds = client.getChildren(folder);
        } catch (KeeperException.NoNodeException e) {
            log.warn("Unable to get deployment config for {}", host, e);
            return ImmutableMap.of();
        }
        for (final String jobIdString : jobIds) {
            final JobId jobId = JobId.fromString(jobIdString);
            final String containerPath = Paths.configHostJob(host, jobId);
            try {
                final byte[] data = client.getData(containerPath);
                final Task task = parse(data, Task.class);
                jobs.put(jobId, Deployment.of(jobId, task.getGoal(), task.getDeployerUser(), task.getDeployerMaster(), task.getDeploymentGroupName()));
            } catch (KeeperException.NoNodeException ignored) {
                log.debug("deployment config node disappeared: {}", jobIdString);
            }
        }
    } catch (KeeperException | IOException e) {
        throw new HeliosRuntimeException("getting deployment config failed", e);
    }
    return jobs;
}
Also used : Task(com.spotify.helios.common.descriptors.Task) RolloutTask(com.spotify.helios.common.descriptors.RolloutTask) NoNodeException(org.apache.zookeeper.KeeperException.NoNodeException) HeliosRuntimeException(com.spotify.helios.common.HeliosRuntimeException) Deployment(com.spotify.helios.common.descriptors.Deployment) IOException(java.io.IOException) JobId(com.spotify.helios.common.descriptors.JobId) KeeperException(org.apache.zookeeper.KeeperException)

Example 23 with HeliosRuntimeException

use of com.spotify.helios.common.HeliosRuntimeException in project helios by spotify.

the class TaskRunner method startContainer.

private String startContainer(final String image, final Optional<String> dockerVersion) throws InterruptedException, DockerException {
    // Get container image info
    final ImageInfo imageInfo = docker.inspectImage(image);
    if (imageInfo == null) {
        throw new HeliosRuntimeException("docker inspect image returned null on image " + image);
    }
    // Create container
    final HostConfig hostConfig = config.hostConfig(dockerVersion);
    final ContainerConfig containerConfig = config.containerConfig(imageInfo, dockerVersion).toBuilder().hostConfig(hostConfig).build();
    listener.creating();
    final ContainerCreation container = docker.createContainer(containerConfig, containerName);
    log.info("created container: {}: {}, {}", config, container, containerConfig);
    listener.created(container.id());
    // Start container
    log.info("starting container: {}: {} {}", config, container.id(), hostConfig);
    listener.starting();
    docker.startContainer(container.id());
    log.info("started container: {}: {}", config, container.id());
    listener.started();
    return container.id();
}
Also used : ContainerConfig(com.spotify.docker.client.messages.ContainerConfig) ContainerCreation(com.spotify.docker.client.messages.ContainerCreation) HeliosRuntimeException(com.spotify.helios.common.HeliosRuntimeException) HostConfig(com.spotify.docker.client.messages.HostConfig) ImageInfo(com.spotify.docker.client.messages.ImageInfo)

Example 24 with HeliosRuntimeException

use of com.spotify.helios.common.HeliosRuntimeException in project helios by spotify.

the class ZooKeeperMasterModel method getJobs.

/**
   * Returns a {@link Map} of {@link JobId} to {@link Job} objects for all of the jobs known.
   */
@Override
public Map<JobId, Job> getJobs() {
    log.debug("getting jobs");
    final String folder = Paths.configJobs();
    final ZooKeeperClient client = provider.get("getJobs");
    try {
        final List<String> ids;
        try {
            ids = client.getChildren(folder);
        } catch (NoNodeException e) {
            return Maps.newHashMap();
        }
        final Map<JobId, Job> descriptors = Maps.newHashMap();
        for (final String id : ids) {
            final JobId jobId = JobId.fromString(id);
            final String path = Paths.configJob(jobId);
            try {
                final byte[] data = client.getData(path);
                final Job descriptor = parse(data, Job.class);
                descriptors.put(descriptor.getId(), descriptor);
            } catch (NoNodeException e) {
                // Ignore, the job was deleted before we had a chance to read it.
                log.debug("Ignoring deleted job {}", jobId);
            }
        }
        return descriptors;
    } catch (KeeperException | IOException e) {
        throw new HeliosRuntimeException("getting jobs failed", e);
    }
}
Also used : NoNodeException(org.apache.zookeeper.KeeperException.NoNodeException) ZooKeeperClient(com.spotify.helios.servicescommon.coordination.ZooKeeperClient) HeliosRuntimeException(com.spotify.helios.common.HeliosRuntimeException) IOException(java.io.IOException) Job(com.spotify.helios.common.descriptors.Job) JobId(com.spotify.helios.common.descriptors.JobId) KeeperException(org.apache.zookeeper.KeeperException)

Example 25 with HeliosRuntimeException

use of com.spotify.helios.common.HeliosRuntimeException in project helios by spotify.

the class ZooKeeperMasterModel method deployJobRetry.

private void deployJobRetry(final ZooKeeperClient client, final String host, final Deployment deployment, int count, final String token) throws JobDoesNotExistException, JobAlreadyDeployedException, HostNotFoundException, JobPortAllocationConflictException, TokenVerificationException {
    if (count == 3) {
        throw new HeliosRuntimeException("3 failures (possibly concurrent modifications) while " + "deploying. Giving up.");
    }
    log.info("deploying {}: {} (retry={})", deployment, host, count);
    final JobId id = deployment.getJobId();
    final Job job = getJob(id);
    if (job == null) {
        throw new JobDoesNotExistException(id);
    }
    verifyToken(token, job);
    final UUID operationId = UUID.randomUUID();
    final String jobPath = Paths.configJob(id);
    try {
        Paths.configHostJob(host, id);
    } catch (IllegalArgumentException e) {
        throw new HostNotFoundException("Could not find Helios host '" + host + "'");
    }
    final String taskPath = Paths.configHostJob(host, id);
    final String taskCreationPath = Paths.configHostJobCreation(host, id, operationId);
    final List<Integer> staticPorts = staticPorts(job);
    final Map<String, byte[]> portNodes = Maps.newHashMap();
    final byte[] idJson = id.toJsonBytes();
    for (final int port : staticPorts) {
        final String path = Paths.configHostPort(host, port);
        portNodes.put(path, idJson);
    }
    final Task task = new Task(job, deployment.getGoal(), deployment.getDeployerUser(), deployment.getDeployerMaster(), deployment.getDeploymentGroupName());
    final List<ZooKeeperOperation> operations = Lists.newArrayList(check(jobPath), create(portNodes), create(Paths.configJobHost(id, host)));
    // Attempt to read a task here.
    try {
        client.getNode(taskPath);
        // if we get here the node exists already
        throw new JobAlreadyDeployedException(host, id);
    } catch (NoNodeException e) {
        operations.add(create(taskPath, task));
        operations.add(create(taskCreationPath));
    } catch (KeeperException e) {
        throw new HeliosRuntimeException("reading existing task description failed", e);
    }
    // TODO (dano): Failure handling is racy wrt agent and job modifications.
    try {
        client.transaction(operations);
        log.info("deployed {}: {} (retry={})", deployment, host, count);
    } catch (NoNodeException e) {
        // Either the job, the host or the task went away
        assertJobExists(client, id);
        assertHostExists(client, host);
        // If the job and host still exists, we likely tried to redeploy a job that had an UNDEPLOY
        // goal and lost the race with the agent removing the task before we could set it. Retry.
        deployJobRetry(client, host, deployment, count + 1, token);
    } catch (NodeExistsException e) {
        // Check for conflict due to transaction retry
        try {
            if (client.exists(taskCreationPath) != null) {
                // Our creation operation node existed, we're done here
                return;
            }
        } catch (KeeperException ex) {
            throw new HeliosRuntimeException("checking job deployment failed", ex);
        }
        try {
            // Check if the job was already deployed
            if (client.stat(taskPath) != null) {
                throw new JobAlreadyDeployedException(host, id);
            }
        } catch (KeeperException ex) {
            throw new HeliosRuntimeException("checking job deployment failed", e);
        }
        // Check for static port collisions
        for (final int port : staticPorts) {
            checkForPortConflicts(client, host, port, id);
        }
        // Catch all for logic and ephemeral issues
        throw new HeliosRuntimeException("deploying job failed", e);
    } catch (KeeperException e) {
        throw new HeliosRuntimeException("deploying job failed", e);
    }
}
Also used : Task(com.spotify.helios.common.descriptors.Task) RolloutTask(com.spotify.helios.common.descriptors.RolloutTask) NoNodeException(org.apache.zookeeper.KeeperException.NoNodeException) ZooKeeperOperation(com.spotify.helios.servicescommon.coordination.ZooKeeperOperation) NodeExistsException(org.apache.zookeeper.KeeperException.NodeExistsException) HeliosRuntimeException(com.spotify.helios.common.HeliosRuntimeException) Job(com.spotify.helios.common.descriptors.Job) UUID(java.util.UUID) JobId(com.spotify.helios.common.descriptors.JobId) KeeperException(org.apache.zookeeper.KeeperException)

Aggregations

HeliosRuntimeException (com.spotify.helios.common.HeliosRuntimeException)27 KeeperException (org.apache.zookeeper.KeeperException)23 NoNodeException (org.apache.zookeeper.KeeperException.NoNodeException)20 ZooKeeperClient (com.spotify.helios.servicescommon.coordination.ZooKeeperClient)16 JobId (com.spotify.helios.common.descriptors.JobId)10 IOException (java.io.IOException)10 ZooKeeperOperation (com.spotify.helios.servicescommon.coordination.ZooKeeperOperation)9 Job (com.spotify.helios.common.descriptors.Job)7 RolloutTask (com.spotify.helios.common.descriptors.RolloutTask)5 Task (com.spotify.helios.common.descriptors.Task)5 Deployment (com.spotify.helios.common.descriptors.Deployment)4 DeploymentGroup (com.spotify.helios.common.descriptors.DeploymentGroup)4 UUID (java.util.UUID)4 NodeExistsException (org.apache.zookeeper.KeeperException.NodeExistsException)4 ImmutableList (com.google.common.collect.ImmutableList)3 Node (com.spotify.helios.servicescommon.coordination.Node)3 DeploymentGroupStatus (com.spotify.helios.common.descriptors.DeploymentGroupStatus)2 HostNotFoundException (com.spotify.helios.master.HostNotFoundException)2 RollingUpdateOp (com.spotify.helios.rollingupdate.RollingUpdateOp)2 DefaultZooKeeperClient (com.spotify.helios.servicescommon.coordination.DefaultZooKeeperClient)2