Skip to content

Fix a flaky test for #20107#20237

Closed
liuguoqingfz wants to merge 1 commit intoopensearch-project:mainfrom
liuguoqingfz:flakytest-20107
Closed

Fix a flaky test for #20107#20237
liuguoqingfz wants to merge 1 commit intoopensearch-project:mainfrom
liuguoqingfz:flakytest-20107

Conversation

@liuguoqingfz
Copy link
Copy Markdown
Contributor

@liuguoqingfz liuguoqingfz commented Dec 15, 2025

Description

Fix a flaky test for a race: seeing the task "cluster:admin/snapshot/delete" in the task list does not guarantee the delete has progressed far enough to mark the repository “in use” in cluster state. If you call updateRepository(...) before SnapshotDeletionsInProgress is populated, the update can legitimately succeed, and then nothing is thrown.

Related Issues

Resolves #20107

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Summary by CodeRabbit

  • Tests
    • Enhanced validation of snapshot deletion operations to ensure proper cluster state ordering.
    • Improved exception handling and error messaging for repository modification attempts during concurrent operations.

✏️ Tip: You can customize this high-level summary in your review settings.

@liuguoqingfz liuguoqingfz requested a review from a team as a code owner December 15, 2025 14:40
@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run labels Dec 15, 2025
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Dec 15, 2025

Walkthrough

A single test file is modified to stabilize flaky behavior in testSettingsUpdateFailWhenDeleteSnapshotInProgress. The changes add stronger ordering validation by checking the cluster state for repository deletion progress and enhance exception assertion specificity.

Changes

Cohort / File(s) Summary
Test stabilization
server/src/internalClusterTest/java/org/opensearch/snapshots/ConcurrentSnapshotsIT.java
Added ordering check to wait for cluster state verification of repository in-use status via SnapshotDeletionsInProgress inspection. Enhanced exception handling to unwrap root cause, assert IllegalStateException type, and validate exact error message.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Single test file with localized changes
  • Assertions and timing checks are straightforward to verify
  • Exception handling improvements follow standard patterns but require confirmation that timing assertions address test flakiness

Suggested labels

bug

Suggested reviewers

  • dbwiddis
  • sachinpkale
  • cwperks
  • msfroh
  • andrross
  • Rishikesh1159
  • sohami

Poem

🐰 A test that flickered in the night,
Now ordered carefully, burning bright,
With waits and checks to catch the flow,
The deletions dance—now we'll know!
No more flakes upon the scene,
Assertions sharp and ordering keen! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix a flaky test for #20107' clearly and specifically describes the main change—addressing a flaky test by referencing the related issue.
Description check ✅ Passed The description explains the root cause of the flaky test (race condition in cluster state synchronization) and references the related issue #20107.
Linked Issues check ✅ Passed The code changes address the flaky test reported in #20107 by adding stronger ordering checks for cluster state and improved exception handling during repository updates.
Out of Scope Changes check ✅ Passed All changes are narrowly focused on fixing the flaky test in ConcurrentSnapshotsIT without introducing unrelated modifications.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

…e shows an entry in SnapshotDeletionsInProgress for repoName

change the formatting

Signed-off-by: Joe Liu <guoqing4@illinois.edu>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
server/src/internalClusterTest/java/org/opensearch/snapshots/ConcurrentSnapshotsIT.java (2)

205-218: Good fix for the race condition, but consider using existing imports.

The stronger ordering check correctly addresses the flaky test by ensuring SnapshotDeletionsInProgress contains the repository before attempting the update. However, the fully qualified names are unnecessary since ClusterState, SnapshotDeletionsInProgress, and Matchers are already imported at the top of the file (lines 44-45, 93).

         // Stronger ordering: wait until cluster state shows the repository is in-use due to deletion
         assertBusy(() -> {
-            final org.opensearch.cluster.ClusterState state = client().admin().cluster().prepareState().get().getState();
-            final org.opensearch.cluster.SnapshotDeletionsInProgress deletions = state.custom(
-                org.opensearch.cluster.SnapshotDeletionsInProgress.TYPE
-            );
+            final ClusterState state = client().admin().cluster().prepareState().get().getState();
+            final SnapshotDeletionsInProgress deletions = state.custom(SnapshotDeletionsInProgress.TYPE);

-            assertThat("SnapshotDeletionsInProgress must be present once delete starts", deletions, org.hamcrest.Matchers.notNullValue());
+            assertThat("SnapshotDeletionsInProgress must be present once delete starts", deletions, notNullValue());
             assertThat(
-                deletions.getEntries().stream().map(org.opensearch.cluster.SnapshotDeletionsInProgress.Entry::repository).toList(),
+                deletions.getEntries().stream().map(SnapshotDeletionsInProgress.Entry::repository).toList(),
                 hasItem(equalTo(repoName))
             );
         });

You'll need to add the notNullValue static import:

import static org.hamcrest.Matchers.notNullValue;

224-228: Correct handling of wrapped exceptions.

The approach of catching a generic Exception and unwrapping the root cause is appropriate since the transport layer can wrap the real exception. The instanceOf matcher is already imported at line 93, so the FQN can be simplified.

         // Transport can wrap the real exception; assert on the unwrapped root cause
         final Exception ex = assertThrows(Exception.class, () -> updateRepository(repoName, "mock", newSettings));
         final Throwable cause = org.opensearch.ExceptionsHelper.unwrapCause(ex);
-        assertThat(cause, org.hamcrest.Matchers.instanceOf(IllegalStateException.class));
+        assertThat(cause, instanceOf(IllegalStateException.class));
         assertEquals("trying to modify or unregister repository that is currently used", cause.getMessage());
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee59cb0 and fcdd76d.

📒 Files selected for processing (1)
  • server/src/internalClusterTest/java/org/opensearch/snapshots/ConcurrentSnapshotsIT.java (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
  • GitHub Check: gradle-check
  • GitHub Check: Analyze (java)
  • GitHub Check: assemble (21, ubuntu-24.04-arm)
  • GitHub Check: assemble (25, ubuntu-latest)
  • GitHub Check: assemble (25, ubuntu-24.04-arm)
  • GitHub Check: assemble (21, ubuntu-latest)
  • GitHub Check: assemble (21, windows-latest)
  • GitHub Check: assemble (25, windows-latest)
  • GitHub Check: detect-breaking-change

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for fcdd76d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for fcdd76d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for fcdd76d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for fcdd76d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@liuguoqingfz
Copy link
Copy Markdown
Contributor Author

Can you merge the PR, the gradle check keep failing because some other tests failing

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for fcdd76d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@opensearch-trigger-bot
Copy link
Copy Markdown
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Jan 20, 2026
@sandeshkr419 sandeshkr419 reopened this Jan 27, 2026
Copy link
Copy Markdown
Member

@sandeshkr419 sandeshkr419 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, a minor nitpick - @liuguoqingfz Can you please address that.

Post that once CI is green, I'll merge it.


// Stronger ordering: wait until cluster state shows the repository is in-use due to deletion
assertBusy(() -> {
final org.opensearch.cluster.ClusterState state = client().admin().cluster().prepareState().get().getState();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you please import the classes ClusterState, Matchers, ExceptionsHelper instead of full qualified name. Although a test, but that should make it more readable.

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for fcdd76d: SUCCESS

@codecov
Copy link
Copy Markdown

codecov bot commented Jan 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.25%. Comparing base (1022486) to head (fcdd76d).
⚠️ Report is 137 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20237      +/-   ##
============================================
+ Coverage     73.20%   73.25%   +0.04%     
- Complexity    71766    71790      +24     
============================================
  Files          5795     5795              
  Lines        328302   328303       +1     
  Branches      47283    47283              
============================================
+ Hits         240345   240488     +143     
+ Misses        68628    68561      -67     
+ Partials      19329    19254      -75     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autocut flaky-test Random test failure that succeeds on second run >test-failure Test failure from CI, local build, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for ConcurrentSnapshotsIT

3 participants