Skip to content

Added leader/follower check attempt failure metrics#17254

Open
patelsmit32123 wants to merge 3 commits intoopensearch-project:mainfrom
patelsmit32123:leader-follower-check-metrics
Open

Added leader/follower check attempt failure metrics#17254
patelsmit32123 wants to merge 3 commits intoopensearch-project:mainfrom
patelsmit32123:leader-follower-check-metrics

Conversation

@patelsmit32123
Copy link
Copy Markdown

Description

This PR adds metrics for each individual leader/follower check attempt failure. It can help in understanding how frequently/intermittently the checks are failing.

Related Issues

Resolves #17253

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Smit Patel <patelsmit32123@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 5, 2025

❌ Gradle check result for b80b110: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 5, 2025

❕ Gradle check result for 7463a9d: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 5, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.32%. Comparing base (865704b) to head (984cde6).
⚠️ Report is 1376 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #17254      +/-   ##
============================================
- Coverage     72.43%   72.32%   -0.12%     
- Complexity    65725    65742      +17     
============================================
  Files          5318     5318              
  Lines        305675   305681       +6     
  Branches      44350    44350              
============================================
- Hits         221408   221073     -335     
- Misses        66055    66479     +424     
+ Partials      18212    18129      -83     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@patelsmit32123
Copy link
Copy Markdown
Author

@shwetathareja please review

Copy link
Copy Markdown
Member

@andrross andrross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add some tests here? Maybe in LeaderCheckerTests?

Copy link
Copy Markdown
Contributor

@Bukhtawar Bukhtawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we reset it to zero once we have a successful check

@patelsmit32123
Copy link
Copy Markdown
Author

patelsmit32123 commented Feb 6, 2025

Should we reset it to zero once we have a successful check

We want to have a total view of all the failures, not just consecutive ones

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 7, 2025

❌ Gradle check result for 04507e7: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Smit Patel <patelsmit32123@gmail.com>
@patelsmit32123 patelsmit32123 force-pushed the leader-follower-check-metrics branch from 04507e7 to 984cde6 Compare February 7, 2025 10:26
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 7, 2025

❕ Gradle check result for 984cde6: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link
Copy Markdown
Member

@shwetathareja shwetathareja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @patelsmit32123 for the changes!

FollowerChecker followerChecker = followerCheckers.get(discoveryNode);
if (followerChecker != null) {
logger.info(() -> new ParameterizedMessage("{} disconnected", followerChecker));
clusterManagerMetrics.incrementCounter(clusterManagerMetrics.followerCheckAttemptFailureCounter, 1.0);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the followerCheck attempt here, it is taking an action to fail the node here.

void handleDisconnectedNode(DiscoveryNode discoveryNode) {
if (discoveryNode.equals(leader)) {
logger.debug("leader [{}] disconnected", leader);
clusterManagerMetrics.incrementCounter(clusterManagerMetrics.leaderCheckAttemptFailureCounter, 1.0);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed here. This is not the leader check attempt.

@opensearch-trigger-bot
Copy link
Copy Markdown
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot Bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Mar 19, 2025
@opensearch-trigger-bot
Copy link
Copy Markdown
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot Bot added the stalled Issues that have stalled label Apr 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cluster Manager enhancement Enhancement or improvement to existing feature or request stalled Issues that have stalled

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[Feature Request] Add each individual leader/follower check failure metrics

4 participants