Search in sources :

Example 1 with TimeoutException

use of edu.iu.dsc.tws.api.exceptions.TimeoutException in project twister2 by DSC-SPIDAL.

the class ZKWorkerController method waitOnBarrier.

/**
 * All workers create a znode on the barrier directory
 * Job master watches znode creations/removals on this directory
 * when the number of znodes on that directory reaches the number of workers in the job,
 * Job master publishes AllArrivedOnBarrier event
 * Workers proceed when they get this event or when they time out
 * <p>
 * Workers remove their znodes after they proceed through the barrier
 * so that they can wait on the barrier again
 * Workers are responsible for creating and removing znodes on the barrier
 * Job master removes barrier znode after the job completion or scale down.
 *
 * if timeout is reached, throws TimeoutException.
 */
@Override
public void waitOnBarrier(long timeLimit) throws TimeoutException {
    // do not wait on the barrier
    if (JobProgress.isJobFaulty()) {
        throw new JobFaultyException("Can not wait on the barrier, since the job is faulty.");
    }
    defaultBarrierProceeded = false;
    try {
        ZKBarrierManager.createWorkerZNodeAtDefault(client, rootPath, jobID, workerInfo.getWorkerID(), timeLimit);
    } catch (Twister2Exception e) {
        LOG.log(Level.SEVERE, e.getMessage(), e);
        return;
    }
    // wait until all workers joined or time limit is reached
    long startTime = System.currentTimeMillis();
    long tl = timeLimit > Long.MAX_VALUE / 2 ? Long.MAX_VALUE : timeLimit * 2;
    long delay = 0;
    while (delay < tl) {
        synchronized (defaultBarrierWaitObject) {
            try {
                if (!defaultBarrierProceeded) {
                    defaultBarrierWaitObject.wait(tl - delay);
                    break;
                }
            } catch (InterruptedException e) {
                delay = System.currentTimeMillis() - startTime;
            }
        }
    }
    // delete barrier znode in any case
    try {
        ZKBarrierManager.deleteWorkerZNodeFromDefault(client, rootPath, jobID, workerInfo.getWorkerID());
    } catch (Twister2Exception e) {
        LOG.log(Level.SEVERE, e.getMessage(), e);
    }
    if (defaultBarrierProceeded) {
        if (defaultBarrierResult == JobMasterAPI.BarrierResult.SUCCESS) {
            return;
        } else if (defaultBarrierResult == JobMasterAPI.BarrierResult.JOB_FAULTY) {
            throw new JobFaultyException("Barrier broken since a fault occurred in the job.");
        } else if (defaultBarrierResult == JobMasterAPI.BarrierResult.TIMED_OUT) {
            throw new TimeoutException("Barrier timed out. Not all workers arrived on the time limit: " + timeLimit + "ms.");
        }
        // this should never happen, since we have only these three options
        return;
    } else {
        throw new TimeoutException("Barrier timed out on the worker. " + tl + "ms.");
    }
}
Also used : Twister2Exception(edu.iu.dsc.tws.api.exceptions.Twister2Exception) JobFaultyException(edu.iu.dsc.tws.api.exceptions.JobFaultyException) TimeoutException(edu.iu.dsc.tws.api.exceptions.TimeoutException)

Example 2 with TimeoutException

use of edu.iu.dsc.tws.api.exceptions.TimeoutException in project twister2 by DSC-SPIDAL.

the class ZKWorkerController method waitOnInitBarrier.

/**
 * init barrier
 * the same algorithm as the default barrier
 * @throws TimeoutException
 */
public void waitOnInitBarrier() throws TimeoutException {
    initBarrierProceeded = false;
    long timeLimit = ControllerContext.maxWaitTimeOnInitBarrier(config);
    try {
        ZKBarrierManager.createWorkerZNodeAtInit(client, rootPath, jobID, workerInfo.getWorkerID(), timeLimit);
    } catch (Twister2Exception e) {
        LOG.log(Level.SEVERE, e.getMessage(), e);
        return;
    }
    // wait until all workers joined or the time limit is reached
    long startTime = System.currentTimeMillis();
    long tl = timeLimit > Long.MAX_VALUE / 2 ? Long.MAX_VALUE : timeLimit * 2;
    long delay = 0;
    while (delay < tl) {
        synchronized (initBarrierWaitObject) {
            try {
                if (!initBarrierProceeded) {
                    initBarrierWaitObject.wait(tl - delay);
                    break;
                }
            } catch (InterruptedException e) {
                delay = System.currentTimeMillis() - startTime;
            }
        }
    }
    // delete barrier znode in any case
    try {
        ZKBarrierManager.deleteWorkerZNodeFromInit(client, rootPath, jobID, workerInfo.getWorkerID());
    } catch (Twister2Exception e) {
        LOG.log(Level.SEVERE, e.getMessage(), e);
    }
    if (initBarrierProceeded) {
        if (initBarrierResult == JobMasterAPI.BarrierResult.SUCCESS) {
            return;
        } else if (initBarrierResult == JobMasterAPI.BarrierResult.JOB_FAULTY) {
            throw new JobFaultyException("Barrier broken since a fault occurred in the job.");
        } else if (initBarrierResult == JobMasterAPI.BarrierResult.TIMED_OUT) {
            throw new TimeoutException("Barrier timed out. Not all workers arrived on the time limit: " + timeLimit + "ms.");
        }
        // this should never happen, since we have only these three options
        return;
    } else {
        throw new TimeoutException("Barrier timed out on the worker. " + tl + "ms.");
    }
}
Also used : Twister2Exception(edu.iu.dsc.tws.api.exceptions.Twister2Exception) JobFaultyException(edu.iu.dsc.tws.api.exceptions.JobFaultyException) TimeoutException(edu.iu.dsc.tws.api.exceptions.TimeoutException)

Example 3 with TimeoutException

use of edu.iu.dsc.tws.api.exceptions.TimeoutException in project twister2 by DSC-SPIDAL.

the class TaskUtils method execute.

public static void execute(Config config, int workerID, ComputeGraph graph, IWorkerController workerController) {
    RoundRobinTaskScheduler roundRobinTaskScheduler = new RoundRobinTaskScheduler();
    roundRobinTaskScheduler.initialize(config);
    List<JobMasterAPI.WorkerInfo> workerList = null;
    try {
        workerList = workerController.getAllWorkers();
    } catch (TimeoutException timeoutException) {
        LOG.log(Level.SEVERE, timeoutException.getMessage(), timeoutException);
        return;
    }
    WorkerPlan workerPlan = createWorkerPlan(workerList);
    TaskSchedulePlan taskSchedulePlan = roundRobinTaskScheduler.schedule(graph, workerPlan);
    TWSChannel network = Network.initializeChannel(config, workerController);
    ExecutionPlanBuilder executionPlanBuilder = new ExecutionPlanBuilder(workerID, workerList, new Communicator(config, network), workerController.getCheckpointingClient());
    ExecutionPlan plan = executionPlanBuilder.build(config, graph, taskSchedulePlan);
    ExecutorFactory executor = new ExecutorFactory(config, workerID, network);
    executor.getExecutor(config, plan, graph.getOperationMode()).execute();
}
Also used : TaskSchedulePlan(edu.iu.dsc.tws.api.compute.schedule.elements.TaskSchedulePlan) Communicator(edu.iu.dsc.tws.api.comms.Communicator) TWSChannel(edu.iu.dsc.tws.api.comms.channel.TWSChannel) ExecutionPlan(edu.iu.dsc.tws.api.compute.executor.ExecutionPlan) ExecutionPlanBuilder(edu.iu.dsc.tws.executor.core.ExecutionPlanBuilder) ExecutorFactory(edu.iu.dsc.tws.executor.threading.ExecutorFactory) RoundRobinTaskScheduler(edu.iu.dsc.tws.tsched.streaming.roundrobin.RoundRobinTaskScheduler) TimeoutException(edu.iu.dsc.tws.api.exceptions.TimeoutException) WorkerPlan(edu.iu.dsc.tws.api.compute.schedule.elements.WorkerPlan)

Example 4 with TimeoutException

use of edu.iu.dsc.tws.api.exceptions.TimeoutException in project twister2 by DSC-SPIDAL.

the class TaskUtils method executeBatch.

public static void executeBatch(Config config, int workerID, ComputeGraph graph, IWorkerController workerController) {
    RoundRobinTaskScheduler roundRobinTaskScheduler = new RoundRobinTaskScheduler();
    roundRobinTaskScheduler.initialize(config);
    WorkerPlan workerPlan = null;
    List<JobMasterAPI.WorkerInfo> workerList = null;
    try {
        workerList = workerController.getAllWorkers();
    } catch (TimeoutException timeoutException) {
        LOG.log(Level.SEVERE, timeoutException.getMessage(), timeoutException);
        return;
    }
    workerPlan = createWorkerPlan(workerList);
    TaskSchedulePlan taskSchedulePlan = roundRobinTaskScheduler.schedule(graph, workerPlan);
    TWSChannel network = Network.initializeChannel(config, workerController);
    ExecutionPlanBuilder executionPlanBuilder = new ExecutionPlanBuilder(workerID, workerList, new Communicator(config, network), workerController.getCheckpointingClient());
    ExecutionPlan plan = executionPlanBuilder.build(config, graph, taskSchedulePlan);
    ExecutorFactory executor = new ExecutorFactory(config, workerID, network);
    executor.getExecutor(config, plan, graph.getOperationMode()).execute();
}
Also used : TaskSchedulePlan(edu.iu.dsc.tws.api.compute.schedule.elements.TaskSchedulePlan) Communicator(edu.iu.dsc.tws.api.comms.Communicator) TWSChannel(edu.iu.dsc.tws.api.comms.channel.TWSChannel) ExecutionPlan(edu.iu.dsc.tws.api.compute.executor.ExecutionPlan) ExecutionPlanBuilder(edu.iu.dsc.tws.executor.core.ExecutionPlanBuilder) ExecutorFactory(edu.iu.dsc.tws.executor.threading.ExecutorFactory) RoundRobinTaskScheduler(edu.iu.dsc.tws.tsched.streaming.roundrobin.RoundRobinTaskScheduler) WorkerPlan(edu.iu.dsc.tws.api.compute.schedule.elements.WorkerPlan) TimeoutException(edu.iu.dsc.tws.api.exceptions.TimeoutException)

Example 5 with TimeoutException

use of edu.iu.dsc.tws.api.exceptions.TimeoutException in project twister2 by DSC-SPIDAL.

the class JMWorkerController method sendBarrierRequest.

private void sendBarrierRequest(JobMasterAPI.BarrierType barrierType, long timeLimit) throws TimeoutException {
    JobMasterAPI.BarrierRequest barrierRequest = JobMasterAPI.BarrierRequest.newBuilder().setWorkerID(workerInfo.getWorkerID()).setBarrierType(barrierType).setTimeout(timeLimit).build();
    LOG.fine("Sending BarrierRequest message: \n" + barrierRequest.toString());
    try {
        // set the local wait time for the barrier response to (2 * timeLimit)
        // if the requested time limit is more than half of the long max value,
        // set it to the long max value
        long tl = timeLimit > Long.MAX_VALUE / 2 ? Long.MAX_VALUE : timeLimit * 2;
        Tuple<RequestID, Message> response = rrClient.sendRequestWaitResponse(barrierRequest, tl);
        JobMasterAPI.BarrierResponse barrierResponse = (JobMasterAPI.BarrierResponse) response.getValue();
        if (barrierResponse.getResult() == JobMasterAPI.BarrierResult.SUCCESS) {
            return;
        } else if (barrierResponse.getResult() == JobMasterAPI.BarrierResult.JOB_FAULTY) {
            throw new JobFaultyException("Job became faulty and Default Barrier failed.");
        } else if (barrierResponse.getResult() == JobMasterAPI.BarrierResult.TIMED_OUT) {
            throw new TimeoutException("Barrier timed out. Not all workers arrived at the barrier " + "on the time limit: " + timeLimit + "ms");
        }
    } catch (BlockingSendException e) {
        throw new TimeoutException("Not all workers arrived at the barrier on the time limit: " + timeLimit + "ms.", e);
    }
}
Also used : JobMasterAPI(edu.iu.dsc.tws.proto.jobmaster.JobMasterAPI) BlockingSendException(edu.iu.dsc.tws.api.exceptions.net.BlockingSendException) RequestID(edu.iu.dsc.tws.api.net.request.RequestID) Message(com.google.protobuf.Message) JobFaultyException(edu.iu.dsc.tws.api.exceptions.JobFaultyException) TimeoutException(edu.iu.dsc.tws.api.exceptions.TimeoutException)

Aggregations

TimeoutException (edu.iu.dsc.tws.api.exceptions.TimeoutException)17 JobMasterAPI (edu.iu.dsc.tws.proto.jobmaster.JobMasterAPI)6 JobFaultyException (edu.iu.dsc.tws.api.exceptions.JobFaultyException)5 Twister2RuntimeException (edu.iu.dsc.tws.api.exceptions.Twister2RuntimeException)4 Communicator (edu.iu.dsc.tws.api.comms.Communicator)3 TWSChannel (edu.iu.dsc.tws.api.comms.channel.TWSChannel)2 ExecutionPlan (edu.iu.dsc.tws.api.compute.executor.ExecutionPlan)2 TaskSchedulePlan (edu.iu.dsc.tws.api.compute.schedule.elements.TaskSchedulePlan)2 WorkerPlan (edu.iu.dsc.tws.api.compute.schedule.elements.WorkerPlan)2 Config (edu.iu.dsc.tws.api.config.Config)2 Twister2Exception (edu.iu.dsc.tws.api.exceptions.Twister2Exception)2 ISenderToDriver (edu.iu.dsc.tws.api.resource.ISenderToDriver)2 IWorker (edu.iu.dsc.tws.api.resource.IWorker)2 BenchmarkResultsRecorder (edu.iu.dsc.tws.examples.utils.bench.BenchmarkResultsRecorder)2 ExperimentData (edu.iu.dsc.tws.examples.verification.ExperimentData)2 ExecutionPlanBuilder (edu.iu.dsc.tws.executor.core.ExecutionPlanBuilder)2 ExecutorFactory (edu.iu.dsc.tws.executor.threading.ExecutorFactory)2 JobAPI (edu.iu.dsc.tws.proto.system.job.JobAPI)2 RoundRobinTaskScheduler (edu.iu.dsc.tws.tsched.streaming.roundrobin.RoundRobinTaskScheduler)2 UnknownHostException (java.net.UnknownHostException)2