Examples with JobShuttingDownException - com.vip.saturn.job.exception.JobShuttingDownException

Example 1 with JobShuttingDownException

use of com.vip.saturn.job.exception.JobShuttingDownException in project Saturn by vipshop.

the class ShardingService method shardingIfNecessary.

/**
 * 如果需要分片且当前节点为主节点, 则作业分片.
 */
public synchronized void shardingIfNecessary() throws JobShuttingDownException {
    if (isShutdown) {
        return;
    }
    GetDataStat getDataStat = null;
    if (getJobNodeStorage().isJobNodeExisted(ShardingNode.NECESSARY)) {
        getDataStat = getNecessaryDataStat();
    }
    // sharding necessary内容为空，或者内容是"0"则返回，否则，需要进行sharding处理
    if (getDataStat == null || SHARDING_UN_NECESSARY.equals(getDataStat.getData())) {
        return;
    }
    // 如果不是leader，则等待leader处理完成（这也是一个死循环，知道满足跳出循环的条件：1. 被shutdown 2. 无须sharding而且不处于processing状态）
    if (blockUntilShardingComplatedIfNotLeader()) {
        return;
    }
    // 如果有作业分片处于running状态则等待（无限期）
    waitingOtherJobCompleted();
    // 建立一个临时节点，标记sharding处理中
    getJobNodeStorage().fillEphemeralJobNode(ShardingNode.PROCESSING, "");
    try {
        // 删除作业下面的所有JobServer的sharding节点
        clearShardingInfo();
        int maxRetryTime = 3;
        int retryCount = 0;
        while (!isShutdown) {
            int version = getDataStat.getVersion();
            // 首先尝试从job/leader/sharding/necessary节点获取，如果失败，会从$SaturnExecutors/sharding/content下面获取
            // key is executor, value is sharding items
            Map<String, List<Integer>> shardingItems = namespaceShardingContentService.getShardContent(jobName, getDataStat.getData());
            try {
                // 所有jobserver的（检查+创建），加上设置sharding necessary内容为0，都是一个事务
                CuratorTransactionFinal curatorTransactionFinal = getJobNodeStorage().getClient().inTransaction().check().forPath("/").and();
                for (Entry<String, List<Integer>> entry : shardingItems.entrySet()) {
                    curatorTransactionFinal.create().forPath(JobNodePath.getNodeFullPath(jobName, ShardingNode.getShardingNode(entry.getKey())), ItemUtils.toItemsString(entry.getValue()).getBytes(StandardCharsets.UTF_8)).and();
                }
                curatorTransactionFinal.setData().withVersion(version).forPath(JobNodePath.getNodeFullPath(jobName, ShardingNode.NECESSARY), SHARDING_UN_NECESSARY.getBytes(StandardCharsets.UTF_8)).and();
                curatorTransactionFinal.commit();
                break;
            } catch (BadVersionException e) {
                LogUtils.warn(log, jobName, "zookeeper bad version exception happens", e);
                if (++retryCount <= maxRetryTime) {
                    LogUtils.info(log, jobName, "bad version because of concurrency, will retry to get shards from sharding/necessary later");
                    // NOSONAR
                    Thread.sleep(200L);
                    getDataStat = getNecessaryDataStat();
                }
            } catch (Exception e) {
                LogUtils.warn(log, jobName, "commit shards failed", e);
                /**
                 * 已知场景：
                 *   异常为NoNodeException，域下作业数量大，业务容器上下线。
                 *   原因是，大量的sharding task导致计算结果有滞后，同时某些server被删除，导致commit失败，报NoNode异常。
                 *
                 * 是否需要重试：
                 *   如果作业一直处于启用状态，necessary最终会被更新正确，这时不需要主动重试。 如果重试，可能导致提前拿到数据，后面再重新拿一次数据，不过也没多大问题。
                 *   如果作业在中途禁用了，那么necessary将不会被更新，这时从necessary拿到的数据是过时的，仍然会commit失败，这时需要从content获取数据来重试。
                 *   如果是其他未知场景导致的commit失败，也是可以尝试从content获取数据来重试。
                 *   所以，为了保险起见，均从content获取数据来重试。
                 */
                if (++retryCount <= maxRetryTime) {
                    LogUtils.info(log, jobName, "unexpected error, will retry to get shards from sharding/content later");
                    // 睡一下，没必要马上重试。减少对zk的压力。
                    // NOSONAR
                    Thread.sleep(500L);
                    /**
                     * 注意：
                     *   data为GET_SHARD_FROM_CONTENT_NODE_FLAG，会从sharding/content下获取数据。
                     *   version使用necessary的version。
                     */
                    getDataStat = new GetDataStat(NamespaceShardingContentService.GET_SHARD_FROM_CONTENT_NODE_FLAG, version);
                }
            }
            if (retryCount > maxRetryTime) {
                LogUtils.warn(log, jobName, "retry time exceed {}, will give up to get shards", maxRetryTime);
                break;
            }
        }
    } catch (Exception e) {
        LogUtils.error(log, jobName, e.getMessage(), e);
    } finally {
        getJobNodeStorage().removeJobNodeIfExisted(ShardingNode.PROCESSING);
    }
}

Also used : CuratorTransactionFinal(org.apache.curator.framework.api.transaction.CuratorTransactionFinal) List(java.util.List) BadVersionException(org.apache.zookeeper.KeeperException.BadVersionException) BadVersionException(org.apache.zookeeper.KeeperException.BadVersionException) JobShuttingDownException(com.vip.saturn.job.exception.JobShuttingDownException)

Aggregations

JobShuttingDownException (com.vip.saturn.job.exception.JobShuttingDownException)1 List (java.util.List)1 CuratorTransactionFinal (org.apache.curator.framework.api.transaction.CuratorTransactionFinal)1 BadVersionException (org.apache.zookeeper.KeeperException.BadVersionException)1