Search in sources :

Example 1 with JobShuttingDownException

use of com.vip.saturn.job.exception.JobShuttingDownException in project Saturn by vipshop.

the class ShardingService method shardingIfNecessary.

/**
 * 如果需要分片且当前节点为主节点, 则作业分片.
 */
public synchronized void shardingIfNecessary() throws JobShuttingDownException {
    if (isShutdown) {
        return;
    }
    GetDataStat getDataStat = null;
    if (getJobNodeStorage().isJobNodeExisted(ShardingNode.NECESSARY)) {
        getDataStat = getNecessaryDataStat();
    }
    // sharding necessary内容为空,或者内容是"0"则返回,否则,需要进行sharding处理
    if (getDataStat == null || SHARDING_UN_NECESSARY.equals(getDataStat.getData())) {
        return;
    }
    // 如果不是leader,则等待leader处理完成(这也是一个死循环,知道满足跳出循环的条件:1. 被shutdown 2. 无须sharding而且不处于processing状态)
    if (blockUntilShardingComplatedIfNotLeader()) {
        return;
    }
    // 如果有作业分片处于running状态则等待(无限期)
    waitingOtherJobCompleted();
    // 建立一个临时节点,标记sharding处理中
    getJobNodeStorage().fillEphemeralJobNode(ShardingNode.PROCESSING, "");
    try {
        // 删除作业下面的所有JobServer的sharding节点
        clearShardingInfo();
        int maxRetryTime = 3;
        int retryCount = 0;
        while (!isShutdown) {
            int version = getDataStat.getVersion();
            // 首先尝试从job/leader/sharding/necessary节点获取,如果失败,会从$SaturnExecutors/sharding/content下面获取
            // key is executor, value is sharding items
            Map<String, List<Integer>> shardingItems = namespaceShardingContentService.getShardContent(jobName, getDataStat.getData());
            try {
                // 所有jobserver的(检查+创建),加上设置sharding necessary内容为0,都是一个事务
                CuratorTransactionFinal curatorTransactionFinal = getJobNodeStorage().getClient().inTransaction().check().forPath("/").and();
                for (Entry<String, List<Integer>> entry : shardingItems.entrySet()) {
                    curatorTransactionFinal.create().forPath(JobNodePath.getNodeFullPath(jobName, ShardingNode.getShardingNode(entry.getKey())), ItemUtils.toItemsString(entry.getValue()).getBytes(StandardCharsets.UTF_8)).and();
                }
                curatorTransactionFinal.setData().withVersion(version).forPath(JobNodePath.getNodeFullPath(jobName, ShardingNode.NECESSARY), SHARDING_UN_NECESSARY.getBytes(StandardCharsets.UTF_8)).and();
                curatorTransactionFinal.commit();
                break;
            } catch (BadVersionException e) {
                LogUtils.warn(log, jobName, "zookeeper bad version exception happens", e);
                if (++retryCount <= maxRetryTime) {
                    LogUtils.info(log, jobName, "bad version because of concurrency, will retry to get shards from sharding/necessary later");
                    // NOSONAR
                    Thread.sleep(200L);
                    getDataStat = getNecessaryDataStat();
                }
            } catch (Exception e) {
                LogUtils.warn(log, jobName, "commit shards failed", e);
                /**
                 * 已知场景:
                 *   异常为NoNodeException,域下作业数量大,业务容器上下线。
                 *   原因是,大量的sharding task导致计算结果有滞后,同时某些server被删除,导致commit失败,报NoNode异常。
                 *
                 * 是否需要重试:
                 *   如果作业一直处于启用状态,necessary最终会被更新正确,这时不需要主动重试。 如果重试,可能导致提前拿到数据,后面再重新拿一次数据,不过也没多大问题。
                 *   如果作业在中途禁用了,那么necessary将不会被更新,这时从necessary拿到的数据是过时的,仍然会commit失败,这时需要从content获取数据来重试。
                 *   如果是其他未知场景导致的commit失败,也是可以尝试从content获取数据来重试。
                 *   所以,为了保险起见,均从content获取数据来重试。
                 */
                if (++retryCount <= maxRetryTime) {
                    LogUtils.info(log, jobName, "unexpected error, will retry to get shards from sharding/content later");
                    // 睡一下,没必要马上重试。减少对zk的压力。
                    // NOSONAR
                    Thread.sleep(500L);
                    /**
                     * 注意:
                     *   data为GET_SHARD_FROM_CONTENT_NODE_FLAG,会从sharding/content下获取数据。
                     *   version使用necessary的version。
                     */
                    getDataStat = new GetDataStat(NamespaceShardingContentService.GET_SHARD_FROM_CONTENT_NODE_FLAG, version);
                }
            }
            if (retryCount > maxRetryTime) {
                LogUtils.warn(log, jobName, "retry time exceed {}, will give up to get shards", maxRetryTime);
                break;
            }
        }
    } catch (Exception e) {
        LogUtils.error(log, jobName, e.getMessage(), e);
    } finally {
        getJobNodeStorage().removeJobNodeIfExisted(ShardingNode.PROCESSING);
    }
}
Also used : CuratorTransactionFinal(org.apache.curator.framework.api.transaction.CuratorTransactionFinal) List(java.util.List) BadVersionException(org.apache.zookeeper.KeeperException.BadVersionException) BadVersionException(org.apache.zookeeper.KeeperException.BadVersionException) JobShuttingDownException(com.vip.saturn.job.exception.JobShuttingDownException)

Aggregations

JobShuttingDownException (com.vip.saturn.job.exception.JobShuttingDownException)1 List (java.util.List)1 CuratorTransactionFinal (org.apache.curator.framework.api.transaction.CuratorTransactionFinal)1 BadVersionException (org.apache.zookeeper.KeeperException.BadVersionException)1