Fixing hanging tasks for correlations#874
Conversation
Signed-off-by: Megha Goyal <goyamegh@amazon.com>
0ee0288 to
98fbc25
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #874 +/- ##
============================================
+ Coverage 24.82% 25.07% +0.24%
+ Complexity 1030 1029 -1
============================================
Files 277 277
Lines 12717 12572 -145
Branches 1401 1372 -29
============================================
- Hits 3157 3152 -5
+ Misses 9296 9156 -140
Partials 264 264 ☔ View full report in Codecov by Sentry. |
| long totalHits = response.getResponse().getHits().getTotalHits().value; | ||
| long totalHits = response.getResponse().getHits().getHits().length; |
There was a problem hiding this comment.
What's the rationale to swap from the total hits value to the hits array length?
There was a problem hiding this comment.
We are seeing inconsistent results from search response where the getTotalHits().value is non zero, but the actual response array has no objects. The idea is to use the same array for both, validation and then fetching data.
f653429 to
98fbc25
Compare
| client.search(searchRequest, new ActionListener<>() { | ||
| @Override | ||
| public void onResponse(SearchResponse response) { | ||
| if (response.getHits().getHits().length == 0) { |
There was a problem hiding this comment.
is not getting any hits really an exception? i thought if totalHits>0 && response.getHits().getHits().length == 0 only then its an exceptional case
There was a problem hiding this comment.
The search request expects the result to have one doc (size is set to 1), we can anyway get only two responses from the hits array length (0 or 1). Total hits isn't a required check for this invocation.
| } | ||
|
|
||
| public void onFailures(Exception t) { | ||
| log.error("Exception occurred while processing correlations", t); |
There was a problem hiding this comment.
redundant log line. the correct message should be logged at caller.
There was a problem hiding this comment.
This line will be the only place we log errors before throwing the exception out to Alerting, I made sure of this while testing for most happy cases, will follow up and remove them in future PRs.
Signed-off-by: Megha Goyal <goyamegh@amazon.com> (cherry picked from commit db025ce)
Signed-off-by: Megha Goyal <goyamegh@amazon.com> (cherry picked from commit db025ce)
Signed-off-by: Megha Goyal <goyamegh@amazon.com>
Signed-off-by: Megha Goyal <goyamegh@amazon.com>
|
The backport to To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.9 2.9
# Navigate to the new working tree
cd .worktrees/backport-2.9
# Create a new branch
git switch --create backport/backport-874-to-2.9
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 db025ce69c0201798b9e862a9156a658d2f2d241
# Push it to GitHub
git push --set-upstream origin backport/backport-874-to-2.9
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.9Then, create a pull request where the |
|
The backport to To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.7 2.7
# Navigate to the new working tree
cd .worktrees/backport-2.7
# Create a new branch
git switch --create backport/backport-874-to-2.7
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 db025ce69c0201798b9e862a9156a658d2f2d241
# Push it to GitHub
git push --set-upstream origin backport/backport-874-to-2.7
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.7Then, create a pull request where the |
Description
While dealing with correlations, we noticed unexpectedly high resource consumption in CPU and JVM, resulting in node drops. Upon closer examination, we found a subset of subscribe/findings tasks consistently in a hanging state throughout the data node's lifecycle until it either dropped or required manual restart. This information can be corroborated through the _cat/tasks API.
The root cause of these hanging tasks was the way action listeners handle exceptions today, causing the parent listener to not close in some cases, if the exceptions are not caught correctly.
In this specific case, the exception causing the contention of resources was:
java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0 at org.opensearch.securityanalytics.correlation.VectorEmbeddingsEngine$1$1.onResponse(VectorEmbeddingsEngine.java:142) at org.opensearch.securityanalytics.correlation.VectorEmbeddingsEngine$1$1.onResponse(VectorEmbeddingsEngine.java:115) at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:113) at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:107) at org.opensearch.action.search.TransportMultiSearchAction$1.finish(TransportMultiSearchAction.java:209) at org.opensearch.action.search.TransportMultiSearchAction$1.handleResponse(TransportMultiSearchAction.java:195) at org.opensearch.action.search.TransportMultiSearchAction$1.onResponse(TransportMultiSearchAction.java:183) at org.opensearch.action.search.TransportMultiSearchAction$1.onResponse(TransportMultiSearchAction.java:180) at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:113) at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:107) at org.opensearch.performanceanalyzer.action.PerformanceAnalyzerActionListener.onResponse(PerformanceAnalyzerActionListener.java:55) at org.opensearch.action.support.TimeoutTaskCancellationUtility$TimeoutRunnableListener.onResponse(TimeoutTaskCancellationUtility.java:132)This will PR will be followed up by two more solutions:
andgetTotalHits().value` are different and if we need to modify the way the search requests have been created in order to restore this usage in future.Issues Resolved
This PR aims to fix the current exception being thrown by using
getHits().lengthinstead ofgetTotalHits().valueas a mitigation to avoid any tasks to hang.Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.