use of in project anomaly-detection by opensearch-project.
the class AnomalyResultTests method testNullPointerRCFResult.
public void testNullPointerRCFResult() {
AnomalyResultTransportAction action = new AnomalyResultTransportAction(new ActionFilters(Collections.emptySet()), transportService, settings, client, stateManager, featureQuery, normalModelManager, hashRing, clusterService, indexNameResolver, adCircuitBreakerService, adStats, threadPool, NamedXContentRegistry.EMPTY, adTaskManager);
ActionListener<AnomalyResultResponse> listener = mock(ActionListener.class);
// detector being null causes NullPointerException
AnomalyResultTransportAction.RCFActionListener rcfListener = RCFActionListener("123-rcf-0", null, "nodeID", null, listener, null, adID);
double[] attribution = new double[] { 1. };
long totalUpdates = 32;
double grade = 0.5;
ArgumentCaptor<Exception> failureCaptor = ArgumentCaptor.forClass(Exception.class);
rcfListener.onResponse(new RCFResultResponse(0.3, 0, 26, attribution, totalUpdates, grade, Version.CURRENT, 0, null, null, null, 1.1));
verify(listener, times(1)).onFailure(failureCaptor.capture());
Exception failure = failureCaptor.getValue();
assertTrue(failure instanceof InternalFailure);
use of in project anomaly-detection by opensearch-project.
the class AnomalyResultTransportAction method doExecute.
* All the exceptions thrown by AD is a subclass of AnomalyDetectionException.
* ClientException is a subclass of AnomalyDetectionException. All exception visible to
* Client is under ClientVisible. Two classes directly extends ClientException:
* - InternalFailure for "root cause unknown failure. Maybe transient." We can continue the
* detector running.
* - EndRunException for "failures that might impact the customer." The method endNow() is
* added to indicate whether the client should immediately terminate running a detector.
* + endNow() returns true for "unrecoverable issue". We want to terminate the detector run
* immediately.
* + endNow() returns false for "maybe unrecoverable issue but worth retrying a few more
* times." We want to wait for a few more times on different requests before terminating
* the detector run.
* AD may not be able to get an anomaly grade but can find a feature vector. Consider the
* case when the shingle is not ready. In that case, AD just put NaN as anomaly grade and
* return the feature vector. If AD cannot even find a feature vector, AD throws
* EndRunException if there is an issue or returns empty response (all the numeric fields
* are Double.NaN and feature array is empty. Do so so that customer can write painless
* script.) otherwise.
* Known causes of EndRunException with endNow returning false:
* + training data for cold start not available
* + cold start cannot succeed
* + unknown prediction error
* + memory circuit breaker tripped
* + invalid search query
* Known causes of EndRunException with endNow returning true:
* + a model partition's memory size reached limit
* + models' total memory size reached limit
* + Having trouble querying feature data due to
* * index does not exist
* * all features have been disabled
* + anomaly detector is not available
* + AD plugin is disabled
* + training data is invalid due to serious internal bug(s)
* Known causes of InternalFailure:
* + threshold model node is not available
* + cluster read/write is blocked
* + cold start hasn't been finished
* + fail to get all of rcf model nodes' responses
* + fail to get threshold model node's response
* + RCF/Threshold model node failing to get checkpoint to restore model before timeout
* + Detection is throttle because previous detection query is running
protected void doExecute(Task task, ActionRequest actionRequest, ActionListener<AnomalyResultResponse> listener) {
try (ThreadContext.StoredContext context = client.threadPool().getThreadContext().stashContext()) {
AnomalyResultRequest request = AnomalyResultRequest.fromActionRequest(actionRequest);
String adID = request.getAdID();
ActionListener<AnomalyResultResponse> original = listener;
listener = ActionListener.wrap(r -> {
}, e -> {
// we will not count it in failure stats.
if (!(e instanceof AnomalyDetectionException) || ((AnomalyDetectionException) e).isCountedInStats()) {
if (hcDetectors.contains(adID)) {
if (!EnabledSetting.isADPluginEnabled()) {
throw new EndRunException(adID, CommonErrorMessages.DISABLED_ERR_MSG, true).countedInStats(false);
if (adCircuitBreakerService.isOpen()) {
listener.onFailure(new LimitExceededException(adID, CommonErrorMessages.MEMORY_CIRCUIT_BROKEN_ERR_MSG, false));
try {
stateManager.getAnomalyDetector(adID, onGetDetector(listener, adID, request));
} catch (Exception ex) {
handleExecuteException(ex, listener, adID);
} catch (Exception e) {
use of in project anomaly-detection by opensearch-project.
the class AnomalyResultTransportAction method executeAnomalyDetection.
private void executeAnomalyDetection(ActionListener<AnomalyResultResponse> listener, String adID, AnomalyResultRequest request, AnomalyDetector anomalyDetector, long dataStartTime, long dataEndTime) {
// HC logic starts here
if (anomalyDetector.isMultientityDetector()) {
Optional<Exception> previousException = stateManager.fetchExceptionAndClear(adID);
if (previousException.isPresent()) {
Exception exception = previousException.get();
LOG.error(new ParameterizedMessage("Previous exception of [{}]", adID), exception);
if (exception instanceof EndRunException) {
EndRunException endRunException = (EndRunException) exception;
if (endRunException.isEndNow()) {
// assume request are in epoch milliseconds
long nextDetectionStartTime = request.getEnd() + (long) (anomalyDetector.getDetectorIntervalInMilliseconds() * intervalRatioForRequest);
CompositeRetriever compositeRetriever = new CompositeRetriever(dataStartTime, dataEndTime, anomalyDetector, xContentRegistry, client, nextDetectionStartTime, settings, maxEntitiesPerInterval, pageSize);
PageIterator pageIterator = null;
try {
pageIterator = compositeRetriever.iterator();
} catch (Exception e) {
listener.onFailure(new EndRunException(anomalyDetector.getDetectorId(), CommonErrorMessages.INVALID_SEARCH_QUERY_MSG, e, false));
PageListener getEntityFeatureslistener = new PageListener(pageIterator, adID, dataStartTime, dataEndTime);
if (pageIterator.hasNext()) {;
// Pagination will stop itself when the time is up.
if (previousException.isPresent()) {
} else {
listener.onResponse(new AnomalyResultResponse(new ArrayList<FeatureData>(), null, null, anomalyDetector.getDetectorIntervalInMinutes(), true));
// HC logic ends and single entity logic starts here
// We are going to use only 1 model partition for a single stream detector.
// That's why we use 0 here.
String rcfModelID = SingleStreamModelIdMapper.getRcfModelId(adID, 0);
Optional<DiscoveryNode> asRCFNode = hashRing.getOwningNodeWithSameLocalAdVersionForRealtimeAD(rcfModelID);
if (!asRCFNode.isPresent()) {
listener.onFailure(new InternalFailure(adID, "RCF model node is not available."));
DiscoveryNode rcfNode = asRCFNode.get();
if (!shouldStart(listener, adID, anomalyDetector, rcfNode.getId(), rcfModelID)) {
featureManager.getCurrentFeatures(anomalyDetector, dataStartTime, dataEndTime, onFeatureResponseForSingleEntityDetector(adID, anomalyDetector, listener, rcfModelID, rcfNode, dataStartTime, dataEndTime));
use of in project anomaly-detection by opensearch-project.
the class AnomalyResultTransportAction method coldStartIfNoModel.
* Verify failure of rcf or threshold models. If there is no model, trigger cold
* start. If there is an exception for the previous cold start of this detector,
* throw exception to the caller.
* @param failure object that may contain exceptions thrown
* @param detector detector object
* @return exception if AD job execution gets resource not found exception
* @throws Exception when the input failure is not a ResourceNotFoundException.
* List of exceptions we can throw
* 1. Exception from cold start:
* 1). InternalFailure due to
* a. OpenSearchTimeoutException thrown by putModelCheckpoint during cold start
* 2). EndRunException with endNow equal to false
* a. training data not available
* b. cold start cannot succeed
* c. invalid training data
* 3) EndRunException with endNow equal to true
* a. invalid search query
* 2. LimitExceededException from one of RCF model node when the total size of the models
* is more than X% of heap memory.
* 3. InternalFailure wrapping OpenSearchTimeoutException inside caused by
* RCF/Threshold model node failing to get checkpoint to restore model before timeout.
private Exception coldStartIfNoModel(AtomicReference<Exception> failure, AnomalyDetector detector) throws Exception {
Exception exp = failure.get();
if (exp == null) {
return null;
// return exceptions like LimitExceededException to caller
if (!(exp instanceof ResourceNotFoundException)) {
return exp;
// fetch previous cold start exception
String adID = detector.getDetectorId();
final Optional<Exception> previousException = stateManager.fetchExceptionAndClear(adID);
if (previousException.isPresent()) {
Exception exception = previousException.get();
LOG.error("Previous exception of {}: {}", () -> adID, () -> exception);
if (exception instanceof EndRunException && ((EndRunException) exception).isEndNow()) {
return exception;
}"Trigger cold start for {}", detector.getDetectorId());
return previousException.orElse(new InternalFailure(adID, NO_MODEL_ERR_MSG));
use of in project anomaly-detection by opensearch-project.
the class MultiEntityResultTests method testTimeOutExceptionInModelNode.
* Test that in model node, previously recorded exception is OpenSearchTimeoutException,
* @throws IOException when failing to set up transport layer
* @throws InterruptedException when failing to wait for inProgress to finish
public void testTimeOutExceptionInModelNode() throws IOException, InterruptedException {
Pair<NodeStateManager, CountDownLatch> preparedFixture = setUpTestExceptionTestingInModelNode();
NodeStateManager modelNodeStateManager = preparedFixture.getLeft();
CountDownLatch inProgress = preparedFixture.getRight();
when(modelNodeStateManager.fetchExceptionAndClear(anyString())).thenReturn(Optional.of(new OpenSearchTimeoutException("blah")));
setUpEntityResult(1, modelNodeStateManager);
PlainActionFuture<AnomalyResultResponse> listener = new PlainActionFuture<>();
action.doExecute(null, request, listener);
AnomalyResultResponse response = listener.actionGet(10000L);
assertEquals(Double.NaN, response.getAnomalyGrade(), 0.01);
assertTrue(inProgress.await(10000L, TimeUnit.MILLISECONDS));
// since OpenSearchTimeoutException is not end run exception (now = true), the normal workflow continues
verify(resultWriteQueue, times(3)).put(any());
ArgumentCaptor<Exception> exceptionCaptor = ArgumentCaptor.forClass(Exception.class);
verify(stateManager).setException(anyString(), exceptionCaptor.capture());
Exception actual = exceptionCaptor.getValue();
assertTrue("actual exception is " + actual, actual instanceof InternalFailure);