Added leader/follower check attempt failure metrics#17254
Added leader/follower check attempt failure metrics#17254patelsmit32123 wants to merge 3 commits intoopensearch-project:mainfrom
Conversation
Signed-off-by: Smit Patel <patelsmit32123@gmail.com>
|
❌ Gradle check result for b80b110: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❕ Gradle check result for 7463a9d: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #17254 +/- ##
============================================
- Coverage 72.43% 72.32% -0.12%
- Complexity 65725 65742 +17
============================================
Files 5318 5318
Lines 305675 305681 +6
Branches 44350 44350
============================================
- Hits 221408 221073 -335
- Misses 66055 66479 +424
+ Partials 18212 18129 -83 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@shwetathareja please review |
andrross
left a comment
There was a problem hiding this comment.
Should we add some tests here? Maybe in LeaderCheckerTests?
Bukhtawar
left a comment
There was a problem hiding this comment.
Should we reset it to zero once we have a successful check
We want to have a total view of all the failures, not just consecutive ones |
|
❌ Gradle check result for 04507e7: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Smit Patel <patelsmit32123@gmail.com>
04507e7 to
984cde6
Compare
|
❕ Gradle check result for 984cde6: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
shwetathareja
left a comment
There was a problem hiding this comment.
Thanks @patelsmit32123 for the changes!
| FollowerChecker followerChecker = followerCheckers.get(discoveryNode); | ||
| if (followerChecker != null) { | ||
| logger.info(() -> new ParameterizedMessage("{} disconnected", followerChecker)); | ||
| clusterManagerMetrics.incrementCounter(clusterManagerMetrics.followerCheckAttemptFailureCounter, 1.0); |
There was a problem hiding this comment.
This is not the followerCheck attempt here, it is taking an action to fail the node here.
| void handleDisconnectedNode(DiscoveryNode discoveryNode) { | ||
| if (discoveryNode.equals(leader)) { | ||
| logger.debug("leader [{}] disconnected", leader); | ||
| clusterManagerMetrics.incrementCounter(clusterManagerMetrics.leaderCheckAttemptFailureCounter, 1.0); |
There was a problem hiding this comment.
not needed here. This is not the leader check attempt.
|
This PR is stalled because it has been open for 30 days with no activity. |
|
This PR is stalled because it has been open for 30 days with no activity. |
Description
This PR adds metrics for each individual leader/follower check attempt failure. It can help in understanding how frequently/intermittently the checks are failing.
Related Issues
Resolves #17253
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.