Skip to content

[Test] Fix ConcurrentSnapshotsIT testDeletesAreBatched#144331

Merged
elasticsearchmachine merged 6 commits intoelastic:mainfrom
albertzaharovits:fix-test-144034
Mar 17, 2026
Merged

[Test] Fix ConcurrentSnapshotsIT testDeletesAreBatched#144331
elasticsearchmachine merged 6 commits intoelastic:mainfrom
albertzaharovits:fix-test-144034

Conversation

@albertzaharovits
Copy link
Copy Markdown
Contributor

I think the test fails because some cluster state tasks that delete snapshots might actually be scheduled after the snapshot that holds up all deletions executes (i.e. after it is unblocked). This creates multiple (2) snapshot deletion batches.

This fix is ensuring that snapshot deletions are all batched up before unblocking the snapshot that's holding them up.

Fixes #144034

@albertzaharovits albertzaharovits self-assigned this Mar 16, 2026
@albertzaharovits albertzaharovits added >test Issues or PRs that are addressing/adding tests :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v9.4.0 labels Mar 16, 2026
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team. label Mar 16, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ this does look to be racy - suggested a simpler/faster fix

Comment on lines +237 to +238
assertBusy(() -> {
SnapshotDeletionsInProgress snapshotDeletionsInProgress = SnapshotDeletionsInProgress.get(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be sufficient and would avoid the busy-wait:

ClusterServiceUtils.addTemporaryStateListener(cs -> SnapshotDeletionsInProgress.get(cs).getEntries().getFirst().snapshots().size() == numSnapshots);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I mean, safeAwait on the listener that addTemporaryStateListener returns)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly subtle: you have to add the listener first, and handle the empty case too:

diff --git a/server/src/internalClusterTest/java/org/elasticsearch/snapshots/ConcurrentSnapshotsIT.java b/server/src/internalClusterTest/java/org/elasticsearch/snapshots/ConcurrentSnapshotsIT.java
index d28b29e6d5e0..e1d9c0966189 100644
--- a/server/src/internalClusterTest/java/org/elasticsearch/snapshots/ConcurrentSnapshotsIT.java
+++ b/server/src/internalClusterTest/java/org/elasticsearch/snapshots/ConcurrentSnapshotsIT.java
@@ -213,6 +213,12 @@ public class ConcurrentSnapshotsIT extends AbstractSnapshotIntegTestCase {
             dataNode
         );

+        final var deletesEnqueuedListener = ClusterServiceUtils.addTemporaryStateListener(cs -> {
+            final var entries = SnapshotDeletionsInProgress.get(cs).getEntries();
+            assertThat(entries, anyOf(hasSize(0), hasSize(1)));
+            return entries.isEmpty() == false && entries.getFirst().snapshots().size() == numSnapshots;
+        });
+
         final Collection<ListenableFuture<AcknowledgedResponse>> deleteFutures = new ArrayList<>();
         while (snapshotNames.isEmpty() == false) {
             final Collection<String> toDelete = randomSubsetOf(snapshotNames);
@@ -224,6 +230,7 @@ public class ConcurrentSnapshotsIT extends AbstractSnapshotIntegTestCase {
             clusterAdmin().prepareDeleteSnapshot(TEST_REQUEST_TIMEOUT, repoName, toDelete.toArray(Strings.EMPTY_ARRAY)).execute(future);
             deleteFutures.add(future);
         }
+        safeAwait(deletesEnqueuedListener);

         assertThat(createSlowFuture.isDone(), is(false));

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yeah good point! TIL org.elasticsearch.test.ClusterServiceUtils#addTemporaryStateListener
Pushed 6875de8

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's right, pushed: 7edb9e6

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@albertzaharovits albertzaharovits added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Mar 16, 2026
@elasticsearchmachine elasticsearchmachine merged commit 3f6b94e into elastic:main Mar 17, 2026
36 checks passed
@albertzaharovits albertzaharovits deleted the fix-test-144034 branch March 17, 2026 10:44
michalborek pushed a commit to michalborek/elasticsearch that referenced this pull request Mar 23, 2026
I think the test fails because some cluster state tasks that delete
snapshots might actually be scheduled after the snapshot that holds up
all deletions executes (i.e. after it is unblocked). This creates
multiple (2) snapshot deletion batches.

This fix is ensuring that snapshot deletions are all batched up before
unblocking the snapshot that's holding them up.

Fixes elastic#144034
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Meta label for distributed team. >test Issues or PRs that are addressing/adding tests v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] ConcurrentSnapshotsIT testDeletesAreBatched failing

3 participants