Skip to content

fail fast LeaderCheck on CoordinationStateRejectedException#17400

Open
anuragrai16 wants to merge 10 commits intoopensearch-project:mainfrom
anuragrai16:fail-fast-leader-check
Open

fail fast LeaderCheck on CoordinationStateRejectedException#17400
anuragrai16 wants to merge 10 commits intoopensearch-project:mainfrom
anuragrai16:fail-fast-leader-check

Conversation

@anuragrai16
Copy link
Copy Markdown
Contributor

@anuragrai16 anuragrai16 commented Feb 20, 2025

Description

This PR adds a provision to fail-fast the leaderCheck quickly if a CoordinationStateRejectedException exception is received. Please see the related issue for more details. This puts the fail-fast behind a dynamic setting "cluster.fault_detection.leader_check.fail_fast_on_state_rejection" that is false by default.

Related Issues

Resolves #17155

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for c60fc77: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@anuragrai16 anuragrai16 force-pushed the fail-fast-leader-check branch from 72452c7 to 9292aa3 Compare February 20, 2025 15:47
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 9292aa3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@opensearch-trigger-bot
Copy link
Copy Markdown
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Mar 30, 2025
@anuragrai16 anuragrai16 requested a review from a team as a code owner July 16, 2025 15:30
Signed-off-by: Anurag Rai <anurag.rai@uber.com>
Signed-off-by: Anurag Rai <anurag.rai@uber.com>
Signed-off-by: Anurag Rai <anurag.rai@uber.com>
Signed-off-by: Anurag Rai <anurag.rai@uber.com>
@anuragrai16 anuragrai16 force-pushed the fail-fast-leader-check branch from 67e8bef to a385bf3 Compare July 16, 2025 15:43
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for a385bf3: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Anurag Rai <anurag.rai@uber.com>
Copy link
Copy Markdown
Contributor

@yupeng9 yupeng9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good. can you add a changelog for this?

anuragrai16 and others added 2 commits July 16, 2025 21:36
Signed-off-by: Anurag Rai <anurag.rai@uber.com>
Signed-off-by: Anurag Rai <91844619+anuragrai16@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for c5b0861: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Anurag Rai <anurag.rai@uber.com>
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for b9571f6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Anurag Rai <anurag.rai@uber.com>
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for beec41a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@anuragrai16
Copy link
Copy Markdown
Contributor Author

Flaky tests,

org.opensearch.upgrades.FullClusterRestartIT.testRecovery

org.opensearch.remotestore.RestoreShallowSnapshotV2IT.testHashedPrefixTranslogMetadataCombination {p0={"opensearch.experimental.feature.writable_warm_index.enabled":"true"}}

org.opensearch.remotestore.RestoreShallowSnapshotV2IT.testContinuousIndexing {p0={"opensearch.experimental.feature.writable_warm_index.enabled":"true"}}

Signed-off-by: Anurag Rai <anurag.rai@uber.com>
@github-actions
Copy link
Copy Markdown
Contributor

❕ Gradle check result for a26e390: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@codecov
Copy link
Copy Markdown

codecov bot commented Jul 17, 2025

Codecov Report

Attention: Patch coverage is 77.77778% with 2 lines in your changes missing coverage. Please review.

Project coverage is 72.76%. Comparing base (fc6b08e) to head (a26e390).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
...opensearch/cluster/coordination/LeaderChecker.java 77.77% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #17400      +/-   ##
============================================
- Coverage     72.86%   72.76%   -0.10%     
+ Complexity    68571    68459     -112     
============================================
  Files          5566     5566              
  Lines        314513   314678     +165     
  Branches      45636    45653      +17     
============================================
- Hits         229167   228984     -183     
- Misses        66789    67079     +290     
- Partials      18557    18615      +58     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yupeng9
Copy link
Copy Markdown
Contributor

yupeng9 commented Jul 17, 2025

@Bukhtawar thoughts on this change behind the flag?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Cluster Manager stalled Issues that have stalled

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[BUG] Node disconnection for long duration due to Encrypting network mesh during Mesh deployment

3 participants