Search in sources :

Example 16 with Twister2Exception

use of edu.iu.dsc.tws.api.exceptions.Twister2Exception in project twister2 by DSC-SPIDAL.

the class ZKMasterController method workerFailed.

@Override
public void workerFailed(int workerID) {
    JobMasterAPI.WorkerFailed workerFailed = JobMasterAPI.WorkerFailed.newBuilder().setWorkerID(workerID).build();
    JobMasterAPI.JobEvent jobEvent = JobMasterAPI.JobEvent.newBuilder().setFailed(workerFailed).build();
    try {
        ZKEventsManager.publishEvent(client, rootPath, jobID, jobEvent);
    } catch (Twister2Exception e) {
        throw new Twister2RuntimeException(e);
    }
}
Also used : Twister2Exception(edu.iu.dsc.tws.api.exceptions.Twister2Exception) JobMasterAPI(edu.iu.dsc.tws.proto.jobmaster.JobMasterAPI) Twister2RuntimeException(edu.iu.dsc.tws.api.exceptions.Twister2RuntimeException)

Example 17 with Twister2Exception

use of edu.iu.dsc.tws.api.exceptions.Twister2Exception in project twister2 by DSC-SPIDAL.

the class ZKMasterController method initialize.

/**
 * initialize ZKMasterController,
 * create znode children caches for job master to watch proper events
 */
public void initialize(JobMasterState initialState) throws Twister2Exception {
    if (!(initialState == JobMasterState.JM_STARTED || initialState == JobMasterState.JM_RESTARTED)) {
        throw new Twister2Exception("initialState has to be either JobMasterState.JM_STARTED or " + "JobMasterState.JM_RESTARTED. Supplied value: " + initialState);
    }
    try {
        String zkServerAddresses = ZKContext.serverAddresses(config);
        int sessionTimeoutMs = FaultToleranceContext.sessionTimeout(config);
        client = ZKUtils.connectToServer(zkServerAddresses, sessionTimeoutMs);
        // with scaling up/down, it may have been changed
        if (initialState == JobMasterState.JM_RESTARTED) {
            initRestarting();
        } else {
            // We listen for join/remove events for ephemeral children
            ephemChildrenCache = new PathChildrenCache(client, ephemDir, true);
            addEphemChildrenCacheListener(ephemChildrenCache);
            ephemChildrenCache.start();
            // We listen for status updates for persistent path
            persChildrenCache = new PathChildrenCache(client, persDir, true);
            addPersChildrenCacheListener(persChildrenCache);
            persChildrenCache.start();
        }
        // TODO: we nay need to create ephemeral job master znode so that
        // workers can know when jm fails
        // createJobMasterZnode(initialState);
        LOG.info("Job Master: " + jmAddress + " initialized successfully.");
    } catch (Twister2Exception e) {
        throw e;
    } catch (Exception e) {
        throw new Twister2Exception("Exception when initializing ZKMasterController.", e);
    }
}
Also used : Twister2Exception(edu.iu.dsc.tws.api.exceptions.Twister2Exception) PathChildrenCache(org.apache.curator.framework.recipes.cache.PathChildrenCache) Twister2RuntimeException(edu.iu.dsc.tws.api.exceptions.Twister2RuntimeException) Twister2Exception(edu.iu.dsc.tws.api.exceptions.Twister2Exception)

Example 18 with Twister2Exception

use of edu.iu.dsc.tws.api.exceptions.Twister2Exception in project twister2 by DSC-SPIDAL.

the class ZKMasterController method jmRestarted.

public void jmRestarted() {
    // generate en event and inform all other workers
    JobMasterAPI.JobMasterRestarted jmRestarted = JobMasterAPI.JobMasterRestarted.newBuilder().setNumberOfWorkers(numberOfWorkers).setJmAddress(jmAddress).build();
    JobMasterAPI.JobEvent jobEvent = JobMasterAPI.JobEvent.newBuilder().setJmRestarted(jmRestarted).build();
    try {
        ZKEventsManager.publishEvent(client, rootPath, jobID, jobEvent);
    } catch (Twister2Exception e) {
        throw new Twister2RuntimeException(e);
    }
}
Also used : Twister2Exception(edu.iu.dsc.tws.api.exceptions.Twister2Exception) JobMasterAPI(edu.iu.dsc.tws.proto.jobmaster.JobMasterAPI) Twister2RuntimeException(edu.iu.dsc.tws.api.exceptions.Twister2RuntimeException)

Example 19 with Twister2Exception

use of edu.iu.dsc.tws.api.exceptions.Twister2Exception in project twister2 by DSC-SPIDAL.

the class MPILauncher method launch.

@Override
public Twister2JobState launch(JobAPI.Job job) {
    LOG.log(Level.INFO, "Launching job for cluster {0}", MPIContext.clusterType(config));
    Twister2JobState state = new Twister2JobState(false);
    if (!configsOK()) {
        return state;
    }
    // distributing bundle if not running in shared file system
    if (!MPIContext.isSharedFs(config)) {
        LOG.info("Configured as NON SHARED file system. " + "Running bootstrap procedure to distribute files...");
        try {
            this.distributeJobFiles(job);
        } catch (IOException e) {
            LOG.log(Level.SEVERE, "Error in distributing job files", e);
            throw new RuntimeException("Error in distributing job files");
        }
    } else {
        LOG.info("Configured as SHARED file system. " + "Skipping bootstrap procedure & setting up working directory");
        if (!setupWorkingDirectory(job.getJobId())) {
            throw new RuntimeException("Failed to setup the directory");
        }
    }
    config = Config.newBuilder().putAll(config).put(SchedulerContext.WORKING_DIRECTORY, jobWorkingDirectory).build();
    JobMaster jobMaster = null;
    Thread jmThread = null;
    if (JobMasterContext.isJobMasterUsed(config) && JobMasterContext.jobMasterRunsInClient(config)) {
        // Since the job master is running on client we can collect job information
        state.setDetached(false);
        try {
            int port = NetworkUtils.getFreePort();
            String hostAddress = JobMasterContext.jobMasterIP(config);
            if (hostAddress == null) {
                hostAddress = ResourceSchedulerUtils.getHostIP(config);
            }
            // add the port and ip to config
            config = Config.newBuilder().putAll(config).put("__job_master_port__", port).put("__job_master_ip__", hostAddress).build();
            LOG.log(Level.INFO, String.format("Starting the job master: %s:%d", hostAddress, port));
            JobMasterAPI.NodeInfo jobMasterNodeInfo = NodeInfoUtils.createNodeInfo(hostAddress, "default", "default");
            IScalerPerCluster nullScaler = new NullScaler();
            JobMasterAPI.JobMasterState initialState = JobMasterAPI.JobMasterState.JM_STARTED;
            NullTerminator nt = new NullTerminator();
            jobMaster = new JobMaster(config, "0.0.0.0", port, nt, job, jobMasterNodeInfo, nullScaler, initialState);
            jobMaster.addShutdownHook(true);
            jmThread = jobMaster.startJobMasterThreaded();
        } catch (Twister2Exception e) {
            LOG.log(Level.SEVERE, "Exception when starting Job master: ", e);
            throw new RuntimeException(e);
        }
    }
    final boolean[] start = { false };
    // now start the controller, which will get the resources and start
    Thread controllerThread = new Thread(() -> {
        IController controller = new MPIController(true);
        controller.initialize(config);
        start[0] = controller.start(job);
    });
    controllerThread.setName("MPIController");
    controllerThread.start();
    // wait until the controller finishes
    try {
        controllerThread.join();
    } catch (InterruptedException ignore) {
    }
    // now lets wait on client
    if (jmThread != null && JobMasterContext.isJobMasterUsed(config) && JobMasterContext.jobMasterRunsInClient(config)) {
        try {
            jmThread.join();
        } catch (InterruptedException ignore) {
        }
    }
    if (jobMaster != null && jobMaster.getDriver() != null) {
        if (jobMaster.getDriver().getState() != DriverJobState.FAILED) {
            state.setJobstate(DriverJobState.COMPLETED);
        } else {
            state.setJobstate(jobMaster.getDriver().getState());
        }
        state.setFinalMessages(jobMaster.getDriver().getMessages());
    }
    state.setRequestGranted(start[0]);
    return state;
}
Also used : JobMaster(edu.iu.dsc.tws.master.server.JobMaster) Twister2Exception(edu.iu.dsc.tws.api.exceptions.Twister2Exception) IController(edu.iu.dsc.tws.api.scheduler.IController) NullScaler(edu.iu.dsc.tws.api.driver.NullScaler) IOException(java.io.IOException) IScalerPerCluster(edu.iu.dsc.tws.api.driver.IScalerPerCluster) JobMasterAPI(edu.iu.dsc.tws.proto.jobmaster.JobMasterAPI) Twister2JobState(edu.iu.dsc.tws.api.scheduler.Twister2JobState) NullTerminator(edu.iu.dsc.tws.rsched.schedulers.NullTerminator)

Example 20 with Twister2Exception

use of edu.iu.dsc.tws.api.exceptions.Twister2Exception in project twister2 by DSC-SPIDAL.

the class MPIWorkerStarter method startMaster.

/**
 * Start the JobMaster
 */
private void startMaster() {
    try {
        // init the logger
        initJMLogger(config);
        // release the port for JM
        NetworkUtils.releaseWorkerPorts();
        int port = JobMasterContext.jobMasterPort(config);
        String hostAddress = ResourceSchedulerUtils.getHostIP(config);
        LOG.log(Level.INFO, String.format("Starting the job master: %s:%d", hostAddress, port));
        JobMasterAPI.NodeInfo jobMasterNodeInfo = null;
        IScalerPerCluster clusterScaler = new NullScaler();
        JobMasterAPI.JobMasterState initialState = JobMasterAPI.JobMasterState.JM_STARTED;
        NullTerminator nt = new NullTerminator();
        jobMaster = new JobMaster(config, "0.0.0.0", port, nt, job, jobMasterNodeInfo, clusterScaler, initialState);
        jobMaster.startJobMasterBlocking();
        LOG.log(Level.INFO, "JobMaster done... ");
    } catch (Twister2Exception e) {
        LOG.log(Level.SEVERE, "Exception when starting Job master: ", e);
        throw new RuntimeException(e);
    }
}
Also used : JobMaster(edu.iu.dsc.tws.master.server.JobMaster) Twister2Exception(edu.iu.dsc.tws.api.exceptions.Twister2Exception) JobMasterAPI(edu.iu.dsc.tws.proto.jobmaster.JobMasterAPI) Twister2RuntimeException(edu.iu.dsc.tws.api.exceptions.Twister2RuntimeException) NullScaler(edu.iu.dsc.tws.api.driver.NullScaler) IScalerPerCluster(edu.iu.dsc.tws.api.driver.IScalerPerCluster) NullTerminator(edu.iu.dsc.tws.rsched.schedulers.NullTerminator)

Aggregations

Twister2Exception (edu.iu.dsc.tws.api.exceptions.Twister2Exception)36 Twister2RuntimeException (edu.iu.dsc.tws.api.exceptions.Twister2RuntimeException)24 JobMasterAPI (edu.iu.dsc.tws.proto.jobmaster.JobMasterAPI)14 JobMaster (edu.iu.dsc.tws.master.server.JobMaster)7 NullTerminator (edu.iu.dsc.tws.rsched.schedulers.NullTerminator)5 IScalerPerCluster (edu.iu.dsc.tws.api.driver.IScalerPerCluster)4 NullScaler (edu.iu.dsc.tws.api.driver.NullScaler)4 UnknownHostException (java.net.UnknownHostException)4 Config (edu.iu.dsc.tws.api.config.Config)3 K8sScaler (edu.iu.dsc.tws.rsched.schedulers.k8s.driver.K8sScaler)3 InvalidProtocolBufferException (com.google.protobuf.InvalidProtocolBufferException)2 JobFaultyException (edu.iu.dsc.tws.api.exceptions.JobFaultyException)2 TimeoutException (edu.iu.dsc.tws.api.exceptions.TimeoutException)2 IController (edu.iu.dsc.tws.api.scheduler.IController)2 KubernetesController (edu.iu.dsc.tws.rsched.schedulers.k8s.KubernetesController)2 LinkedList (java.util.LinkedList)2 ChildData (org.apache.curator.framework.recipes.cache.ChildData)2 PathChildrenCache (org.apache.curator.framework.recipes.cache.PathChildrenCache)2 Twister2Job (edu.iu.dsc.tws.api.Twister2Job)1 StateStore (edu.iu.dsc.tws.api.checkpointing.StateStore)1