Search in sources :

Example 16 with AMTaskManager

use of com.tencent.angel.master.task.AMTaskManager in project angel by Tencent.

the class TaskCalPerfChecker method check.

@Override
public List<Id> check(AMContext context) {
    double slowestDiscount = context.getConf().getDouble(AngelConf.ANGEL_AM_TASK_SLOWEST_DISCOUNT, AngelConf.DEFAULT_ANGEL_AM_TASK_SLOWEST_DISCOUNT);
    LOG.info("start to check slow workers use TaskCalPerfChecker policy, slowestDiscount = " + slowestDiscount);
    Set<Id> slowWorkers = new HashSet<Id>();
    AMTaskManager taskManage = context.getTaskManager();
    WorkerManager workerManager = context.getWorkerManager();
    Collection<AMTask> tasks = taskManage.getTasks();
    long totalSamples = 0;
    long totalCalTimeMs = 0;
    double averageRate = 0.0;
    Map<TaskId, Double> taskIdToRateMap = new HashMap<TaskId, Double>(tasks.size());
    for (AMTask task : tasks) {
        if (task.getMetrics().containsKey(TaskCounter.TOTAL_CALCULATE_SAMPLES) && task.getMetrics().containsKey(TaskCounter.TOTAL_CALCULATE_TIME_MS)) {
            long sampleNum = Long.valueOf(task.getMetrics().get(TaskCounter.TOTAL_CALCULATE_SAMPLES));
            double calTimeMs = Long.valueOf(task.getMetrics().get(TaskCounter.TOTAL_CALCULATE_TIME_MS));
            LOG.info("for task " + task.getTaskId() + ", sampleNum = " + sampleNum + ", calTimeMs = " + calTimeMs);
            totalSamples += sampleNum;
            totalCalTimeMs += calTimeMs;
            if (sampleNum > 5000000) {
                LOG.info("task " + task.getTaskId() + " calculate rate = " + (calTimeMs * 10000 / sampleNum));
                taskIdToRateMap.put(task.getTaskId(), calTimeMs * 10000 / sampleNum);
            }
        }
    }
    if (totalSamples != 0) {
        averageRate = (double) totalCalTimeMs * 10000 / totalSamples;
    }
    LOG.info("totalSamples = " + totalSamples + ", totalCalTimeMs = " + totalCalTimeMs + ", average calulate time for 10000 samples = " + averageRate + ", the maximum calulate time for 10000 sample = " + averageRate / slowestDiscount);
    for (Map.Entry<TaskId, Double> rateEntry : taskIdToRateMap.entrySet()) {
        if (averageRate < rateEntry.getValue() * slowestDiscount) {
            LOG.info("task " + rateEntry.getKey() + " rate = " + rateEntry.getValue() + " is < " + averageRate * slowestDiscount);
            AMWorker worker = workerManager.getWorker(rateEntry.getKey());
            if (worker != null) {
                LOG.info("put worker " + worker.getId() + " to slow worker list");
                slowWorkers.add(worker.getId());
            }
        }
    }
    List<Id> slowWorkerList = new ArrayList<>(slowWorkers.size());
    slowWorkerList.addAll(slowWorkers);
    return slowWorkerList;
}
Also used : TaskId(com.tencent.angel.worker.task.TaskId) WorkerManager(com.tencent.angel.master.worker.WorkerManager) AMTaskManager(com.tencent.angel.master.task.AMTaskManager) AMWorker(com.tencent.angel.master.worker.worker.AMWorker) Id(com.tencent.angel.common.Id) TaskId(com.tencent.angel.worker.task.TaskId) AMTask(com.tencent.angel.master.task.AMTask)

Aggregations

AMTaskManager (com.tencent.angel.master.task.AMTaskManager)16 Worker (com.tencent.angel.worker.Worker)13 Test (org.junit.Test)13 WorkerManager (com.tencent.angel.master.worker.WorkerManager)11 AngelApplicationMaster (com.tencent.angel.master.AngelApplicationMaster)9 AMTask (com.tencent.angel.master.task.AMTask)6 Location (com.tencent.angel.common.location.Location)5 MasterClient (com.tencent.angel.psagent.client.MasterClient)5 PSAgentMatrixMetaManager (com.tencent.angel.psagent.matrix.PSAgentMatrixMetaManager)4 Matcher (java.util.regex.Matcher)4 Pattern (java.util.regex.Pattern)4 AngelException (com.tencent.angel.exception.AngelException)3 AMWorker (com.tencent.angel.master.worker.worker.AMWorker)3 TaskContext (com.tencent.angel.psagent.task.TaskContext)3 Int2IntOpenHashMap (it.unimi.dsi.fastutil.ints.Int2IntOpenHashMap)3 ServiceException (com.google.protobuf.ServiceException)2 TConnection (com.tencent.angel.ipc.TConnection)2 WorkerAttempt (com.tencent.angel.master.worker.attempt.WorkerAttempt)2 ParameterServerId (com.tencent.angel.ps.ParameterServerId)2 PSAgentLocationManager (com.tencent.angel.psagent.matrix.PSAgentLocationManager)2