Search in sources :

Example 6 with TimeoutException

use of edu.iu.dsc.tws.api.exceptions.TimeoutException in project twister2 by DSC-SPIDAL.

the class MPIWorkerManager method execute.

public boolean execute(Config config, JobAPI.Job job, IWorkerController workerController, IPersistentVolume persistentVolume, IVolatileVolume volatileVolume, IWorker managedWorker) {
    int workerID = workerController.getWorkerInfo().getWorkerID();
    LOG.info("Waiting on the init barrier before starting IWorker: " + workerID + " with restartCount: " + workerController.workerRestartCount() + " and with re-executionCount: " + JobProgress.getWorkerExecuteCount());
    try {
        workerController.waitOnInitBarrier();
        firstInitBarrierProceeded = true;
    } catch (TimeoutException e) {
        throw new Twister2RuntimeException("Could not pass through the init barrier", e);
    }
    // if it is executing for the first time, release worker ports
    if (JobProgress.getWorkerExecuteCount() == 0) {
        NetworkUtils.releaseWorkerPorts();
    }
    JobProgressImpl.setJobStatus(JobProgress.JobStatus.EXECUTING);
    JobProgressImpl.increaseWorkerExecuteCount();
    try {
        managedWorker.execute(config, job, workerController, persistentVolume, volatileVolume);
        return true;
    } catch (JobFaultyException jfe) {
        // a worker in the cluster should have failed
        JobProgressImpl.setJobStatus(JobProgress.JobStatus.FAULTY);
        throw jfe;
    }
}
Also used : Twister2RuntimeException(edu.iu.dsc.tws.api.exceptions.Twister2RuntimeException) JobFaultyException(edu.iu.dsc.tws.api.exceptions.JobFaultyException) TimeoutException(edu.iu.dsc.tws.api.exceptions.TimeoutException)

Example 7 with TimeoutException

use of edu.iu.dsc.tws.api.exceptions.TimeoutException in project twister2 by DSC-SPIDAL.

the class WorkerManager method execute.

/**
 * Execute IWorker
 * return false if IWorker fails fully after retries
 * return true if execution successful
 * throw an exception if execution fails and the worker needs to be restarted from jvm
 */
public boolean execute() {
    while (JobProgress.getWorkerExecuteCount() < maxRetries) {
        LOG.info("Waiting on the init barrier before starting IWorker: " + workerID + " with restartCount: " + workerController.workerRestartCount() + " and with re-executionCount: " + JobProgress.getWorkerExecuteCount());
        try {
            workerController.waitOnInitBarrier();
            firstInitBarrierProceeded = true;
        } catch (TimeoutException e) {
            throw new Twister2RuntimeException("Could not pass through the init barrier", e);
        }
        LOG.fine("Proceeded through INIT barrier. Starting Worker: " + workerID);
        JobProgressImpl.setJobStatus(JobProgress.JobStatus.EXECUTING);
        JobProgressImpl.increaseWorkerExecuteCount();
        JobProgressImpl.setRestartedWorkers(restartedWorkers.values());
        try {
            managedWorker.execute(config, job, workerController, persistentVolume, volatileVolume);
        } catch (JobFaultyException cue) {
            // a worker in the cluster should have failed
            // we will try to re-execute this worker
            JobProgressImpl.setJobStatus(JobProgress.JobStatus.FAULTY);
            LOG.warning("thrown JobFaultyException. Some workers should have failed.");
        }
        // we need to make sure whether that all workers finished successfully also
        if (JobProgress.isJobHealthy()) {
            try {
                // wait on the barrier indefinitely until all workers arrive
                // or the barrier is broken with with a job fault
                LOG.info("Worker completed, waiting for other workers to finish at the final barrier.");
                workerController.waitOnBarrier(Long.MAX_VALUE);
                LOG.info("Worker finished successfully");
                return true;
            } catch (TimeoutException e) {
                // this should never happen
                throw new Twister2RuntimeException("Could not pass through the final barrier", e);
            } catch (JobFaultyException e) {
                JobProgressImpl.setJobStatus(JobProgress.JobStatus.FAULTY);
                LOG.warning("thrown JobFaultyException. Some workers failed before finishing.");
            }
        }
    }
    LOG.info(String.format("Re-executed IWorker %d times and failed, we are exiting", maxRetries));
    return false;
}
Also used : Twister2RuntimeException(edu.iu.dsc.tws.api.exceptions.Twister2RuntimeException) JobFaultyException(edu.iu.dsc.tws.api.exceptions.JobFaultyException) TimeoutException(edu.iu.dsc.tws.api.exceptions.TimeoutException)

Example 8 with TimeoutException

use of edu.iu.dsc.tws.api.exceptions.TimeoutException in project twister2 by DSC-SPIDAL.

the class TaskWorker method execute.

@Override
public void execute(Config cfg, JobAPI.Job job, IWorkerController wController, IPersistentVolume pVolume, IVolatileVolume vVolume) {
    this.config = cfg;
    this.workerId = wController.getWorkerInfo().getWorkerID();
    this.workerController = wController;
    this.persistentVolume = pVolume;
    this.volatileVolume = vVolume;
    ISenderToDriver senderToDriver = JMWorkerAgent.getJMWorkerAgent().getDriverAgent();
    workerEnvironment = WorkerEnvironment.init(config, job, workerController, pVolume, vVolume);
    computeEnvironment = ComputeEnvironment.init(workerEnvironment);
    // to keep backward compatibility
    taskExecutor = computeEnvironment.getTaskExecutor();
    // call execute
    execute();
    // wait for the sync
    try {
        workerEnvironment.getWorkerController().waitOnBarrier();
    } catch (TimeoutException timeoutException) {
        LOG.log(Level.SEVERE, timeoutException.getMessage(), timeoutException);
    }
    computeEnvironment.close();
    // lets terminate the network
    workerEnvironment.close();
    // we are done executing
    // If the execute returns without any errors we assume that the job completed properly
    JobExecutionState.WorkerJobState workerState = JobExecutionState.WorkerJobState.newBuilder().setFailure(false).setJobName(config.getStringValue(Context.JOB_ID)).setWorkerMessage("Worker Completed").build();
    senderToDriver.sendToDriver(workerState);
    LOG.log(Level.FINE, String.format("%d Worker done", workerId));
}
Also used : ISenderToDriver(edu.iu.dsc.tws.api.resource.ISenderToDriver) JobExecutionState(edu.iu.dsc.tws.proto.system.JobExecutionState) TimeoutException(edu.iu.dsc.tws.api.exceptions.TimeoutException)

Example 9 with TimeoutException

use of edu.iu.dsc.tws.api.exceptions.TimeoutException in project twister2 by DSC-SPIDAL.

the class NomadWorkerStarter method startWorker.

private void startWorker() {
    LOG.log(Level.INFO, "A worker process is starting...");
    // lets create the resource plan
    this.workerController = createWorkerController();
    JobMasterAPI.WorkerInfo workerNetworkInfo = workerController.getWorkerInfo();
    try {
        LOG.log(Level.INFO, "Worker IP..:" + Inet4Address.getLocalHost().getHostAddress());
    } catch (UnknownHostException e) {
        e.printStackTrace();
    }
    try {
        List<JobMasterAPI.WorkerInfo> workerInfos = workerController.getAllWorkers();
    } catch (TimeoutException timeoutException) {
        LOG.log(Level.SEVERE, timeoutException.getMessage(), timeoutException);
        return;
    }
    IWorker worker = JobUtils.initializeIWorker(job);
    MPIWorkerManager workerManager = new MPIWorkerManager();
    workerManager.execute(config, job, workerController, null, null, worker);
}
Also used : JobMasterAPI(edu.iu.dsc.tws.proto.jobmaster.JobMasterAPI) UnknownHostException(java.net.UnknownHostException) MPIWorkerManager(edu.iu.dsc.tws.rsched.worker.MPIWorkerManager) IWorker(edu.iu.dsc.tws.api.resource.IWorker) TimeoutException(edu.iu.dsc.tws.api.exceptions.TimeoutException)

Example 10 with TimeoutException

use of edu.iu.dsc.tws.api.exceptions.TimeoutException in project twister2 by DSC-SPIDAL.

the class CDFWRuntime method reinitialize.

private boolean reinitialize() {
    communicator.close();
    List<JobMasterAPI.WorkerInfo> workerInfoList = null;
    try {
        workerInfoList = controller.getAllWorkers();
    } catch (TimeoutException timeoutException) {
        LOG.log(Level.SEVERE, timeoutException.getMessage(), timeoutException);
    }
    // create the channel
    channel = Network.initializeChannel(config, controller);
    String persistent = null;
    // create the communicator
    communicator = new Communicator(config, channel, persistent);
    taskExecutor = new TaskExecutor(config, workerId, workerInfoList, communicator, null);
    return true;
}
Also used : Communicator(edu.iu.dsc.tws.api.comms.Communicator) TaskExecutor(edu.iu.dsc.tws.task.impl.TaskExecutor) TimeoutException(edu.iu.dsc.tws.api.exceptions.TimeoutException)

Aggregations

TimeoutException (edu.iu.dsc.tws.api.exceptions.TimeoutException)17 JobMasterAPI (edu.iu.dsc.tws.proto.jobmaster.JobMasterAPI)6 JobFaultyException (edu.iu.dsc.tws.api.exceptions.JobFaultyException)5 Twister2RuntimeException (edu.iu.dsc.tws.api.exceptions.Twister2RuntimeException)4 Communicator (edu.iu.dsc.tws.api.comms.Communicator)3 TWSChannel (edu.iu.dsc.tws.api.comms.channel.TWSChannel)2 ExecutionPlan (edu.iu.dsc.tws.api.compute.executor.ExecutionPlan)2 TaskSchedulePlan (edu.iu.dsc.tws.api.compute.schedule.elements.TaskSchedulePlan)2 WorkerPlan (edu.iu.dsc.tws.api.compute.schedule.elements.WorkerPlan)2 Config (edu.iu.dsc.tws.api.config.Config)2 Twister2Exception (edu.iu.dsc.tws.api.exceptions.Twister2Exception)2 ISenderToDriver (edu.iu.dsc.tws.api.resource.ISenderToDriver)2 IWorker (edu.iu.dsc.tws.api.resource.IWorker)2 BenchmarkResultsRecorder (edu.iu.dsc.tws.examples.utils.bench.BenchmarkResultsRecorder)2 ExperimentData (edu.iu.dsc.tws.examples.verification.ExperimentData)2 ExecutionPlanBuilder (edu.iu.dsc.tws.executor.core.ExecutionPlanBuilder)2 ExecutorFactory (edu.iu.dsc.tws.executor.threading.ExecutorFactory)2 JobAPI (edu.iu.dsc.tws.proto.system.job.JobAPI)2 RoundRobinTaskScheduler (edu.iu.dsc.tws.tsched.streaming.roundrobin.RoundRobinTaskScheduler)2 UnknownHostException (java.net.UnknownHostException)2