Skip to content

Fixing hanging tasks for correlations#874

Merged
goyamegh merged 1 commit intoopensearch-project:mainfrom
goyamegh:main-correlations
Mar 5, 2024
Merged

Fixing hanging tasks for correlations#874
goyamegh merged 1 commit intoopensearch-project:mainfrom
goyamegh:main-correlations

Conversation

@goyamegh
Copy link
Copy Markdown
Collaborator

@goyamegh goyamegh commented Mar 1, 2024

Description

While dealing with correlations, we noticed unexpectedly high resource consumption in CPU and JVM, resulting in node drops. Upon closer examination, we found a subset of subscribe/findings tasks consistently in a hanging state throughout the data node's lifecycle until it either dropped or required manual restart. This information can be corroborated through the _cat/tasks API.

The root cause of these hanging tasks was the way action listeners handle exceptions today, causing the parent listener to not close in some cases, if the exceptions are not caught correctly.

In this specific case, the exception causing the contention of resources was:
java.lang.ArrayIndexOutOfBoundsException: Index 0 out of bounds for length 0 at org.opensearch.securityanalytics.correlation.VectorEmbeddingsEngine$1$1.onResponse(VectorEmbeddingsEngine.java:142) at org.opensearch.securityanalytics.correlation.VectorEmbeddingsEngine$1$1.onResponse(VectorEmbeddingsEngine.java:115) at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:113) at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:107) at org.opensearch.action.search.TransportMultiSearchAction$1.finish(TransportMultiSearchAction.java:209) at org.opensearch.action.search.TransportMultiSearchAction$1.handleResponse(TransportMultiSearchAction.java:195) at org.opensearch.action.search.TransportMultiSearchAction$1.onResponse(TransportMultiSearchAction.java:183) at org.opensearch.action.search.TransportMultiSearchAction$1.onResponse(TransportMultiSearchAction.java:180) at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:113) at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:107) at org.opensearch.performanceanalyzer.action.PerformanceAnalyzerActionListener.onResponse(PerformanceAnalyzerActionListener.java:55) at org.opensearch.action.support.TimeoutTaskCancellationUtility$TimeoutRunnableListener.onResponse(TimeoutTaskCancellationUtility.java:132)

This will PR will be followed up by two more solutions:

  1. Fix the working of child action listeners across Security Analytics to handle the closing of parent listener tasks correctly
  2. Analyze and see how the methods 'getHits().lengthand getTotalHits().value` are different and if we need to modify the way the search requests have been created in order to restore this usage in future.

Issues Resolved

This PR aims to fix the current exception being thrown by using getHits().length instead of getTotalHits().value as a mitigation to avoid any tasks to hang.

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Megha Goyal <goyamegh@amazon.com>
@goyamegh goyamegh force-pushed the main-correlations branch from 0ee0288 to 98fbc25 Compare March 1, 2024 08:46
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 1, 2024

Codecov Report

Attention: Patch coverage is 0% with 646 lines in your changes are missing coverage. Please review.

Project coverage is 25.07%. Comparing base (8ef0a3f) to head (f653429).
Report is 1 commits behind head on main.

❗ Current head f653429 differs from pull request most recent head 98fbc25. Consider uploading reports for the commit 98fbc25 to get more accurate results

Files Patch % Lines
...yanalytics/correlation/VectorEmbeddingsEngine.java 0.00% 280 Missing ⚠️
...arch/securityanalytics/correlation/JoinEngine.java 0.00% 191 Missing ⚠️
...ics/transport/TransportCorrelateFindingAction.java 0.00% 175 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #874      +/-   ##
============================================
+ Coverage     24.82%   25.07%   +0.24%     
+ Complexity     1030     1029       -1     
============================================
  Files           277      277              
  Lines         12717    12572     -145     
  Branches       1401     1372      -29     
============================================
- Hits           3157     3152       -5     
+ Misses         9296     9156     -140     
  Partials        264      264              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

engechas
engechas previously approved these changes Mar 1, 2024
Comment on lines -128 to +134
long totalHits = response.getResponse().getHits().getTotalHits().value;
long totalHits = response.getResponse().getHits().getHits().length;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale to swap from the total hits value to the hits array length?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are seeing inconsistent results from search response where the getTotalHits().value is non zero, but the actual response array has no objects. The idea is to use the same array for both, validation and then fetching data.

client.search(searchRequest, new ActionListener<>() {
@Override
public void onResponse(SearchResponse response) {
if (response.getHits().getHits().length == 0) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is not getting any hits really an exception? i thought if totalHits>0 && response.getHits().getHits().length == 0 only then its an exceptional case

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The search request expects the result to have one doc (size is set to 1), we can anyway get only two responses from the hits array length (0 or 1). Total hits isn't a required check for this invocation.

}

public void onFailures(Exception t) {
log.error("Exception occurred while processing correlations", t);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant log line. the correct message should be logged at caller.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line will be the only place we log errors before throwing the exception out to Alerting, I made sure of this while testing for most happy cases, will follow up and remove them in future PRs.

@goyamegh goyamegh merged commit db025ce into opensearch-project:main Mar 5, 2024
opensearch-trigger-bot bot pushed a commit that referenced this pull request Mar 7, 2024
Signed-off-by: Megha Goyal <goyamegh@amazon.com>
(cherry picked from commit db025ce)
opensearch-trigger-bot bot pushed a commit that referenced this pull request Mar 7, 2024
Signed-off-by: Megha Goyal <goyamegh@amazon.com>
(cherry picked from commit db025ce)
eirsep pushed a commit to eirsep/security-analytics that referenced this pull request Mar 7, 2024
Signed-off-by: Megha Goyal <goyamegh@amazon.com>
eirsep pushed a commit to eirsep/security-analytics that referenced this pull request Mar 8, 2024
Signed-off-by: Megha Goyal <goyamegh@amazon.com>
@opensearch-trigger-bot
Copy link
Copy Markdown
Contributor

The backport to 2.9 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.9 2.9
# Navigate to the new working tree
cd .worktrees/backport-2.9
# Create a new branch
git switch --create backport/backport-874-to-2.9
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 db025ce69c0201798b9e862a9156a658d2f2d241
# Push it to GitHub
git push --set-upstream origin backport/backport-874-to-2.9
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.9

Then, create a pull request where the base branch is 2.9 and the compare/head branch is backport/backport-874-to-2.9.

@opensearch-trigger-bot
Copy link
Copy Markdown
Contributor

The backport to 2.7 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.7 2.7
# Navigate to the new working tree
cd .worktrees/backport-2.7
# Create a new branch
git switch --create backport/backport-874-to-2.7
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 db025ce69c0201798b9e862a9156a658d2f2d241
# Push it to GitHub
git push --set-upstream origin backport/backport-874-to-2.7
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.7

Then, create a pull request where the base branch is 2.7 and the compare/head branch is backport/backport-874-to-2.7.

goyamegh added a commit that referenced this pull request Mar 12, 2024
Signed-off-by: Megha Goyal <goyamegh@amazon.com>
(cherry picked from commit db025ce)

Co-authored-by: Megha Goyal <56077967+goyamegh@users.noreply.github.com>
opensearch-trigger-bot bot added a commit that referenced this pull request Mar 14, 2024
Signed-off-by: Megha Goyal <goyamegh@amazon.com>
(cherry picked from commit db025ce)

Co-authored-by: Megha Goyal <56077967+goyamegh@users.noreply.github.com>
(cherry picked from commit 73de979)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants