Batching of snapshot-delete start updates by DaveCTurner · Pull Request #141998 · elastic/elasticsearch

DaveCTurner · 2026-02-06T12:44:18Z

Today each request to delete a collection of snapshots is inserted into
the SnapshotDeletionsInProgress in the ClusterState with a separate
cluster state update. There can be a lot of these things if, say, ILM is
re-enabled after being turned off for a while. This commit makes it so
that requests received for each repository will be batched together into
a single cluster state update instead.

Closes ES-14001

Today each request to delete a collection of snapshots is inserted into the `SnapshotDeletionsInProgress` in the `ClusterState` with a separate cluster state update. There can be a lot of these things if, say, ILM is re-enabled after being turned off for a while. This commit makes it so that requests received for each repository will be batched together into a single cluster state update instead. Closes ES-14001

elasticsearchmachine · 2026-02-06T12:44:44Z

Hi @DaveCTurner, I've created a changelog YAML for you.

If the execution of a cluster state update task throws an exception then all the tasks in the batch must be failed with that exception, even if some of them have been marked as successfully completed.

DaveCTurner · 2026-02-06T18:39:46Z

This change uncovered a longstanding bug in MasterService, see #142042 (I include the fix for this bug here too but will merge the MasterService fix first).

DaveCTurner · 2026-02-06T22:11:16Z

I can't find a much easier way to express this for review purposes unfortunately - it's doing quite a complicated job, extracting the process out to a separate class is noisy work, but doing that separately from the change that adds batching wouldn't really make the batching change significantly simpler. If it's any consolation, 60% of the lines added in this PR are the new unit test suite which I think covers everything but also puts us in a better place for adding future tests of this specific component.

elasticsearchmachine · 2026-02-06T22:11:43Z

Pinging @elastic/es-distributed (Team:Distributed)

ywangd

This involves quite heavy machinery to get the batching to work. It does look correct to me. But as you commented inline, I wonder how does it compare to reusing MasterService's batch queue and creating tasks with a SubscribableListener<RepositoryData>. We can resubmit the tasks when none of its repositoryData is ready or they are stale. BlobRepository#getRepositoryData already has deduplication which can be leveraged. We still need much of the PR for processing multiple items. But maybe it's overall a bit simpler and less cognitive load?

server/src/main/java/org/elasticsearch/snapshots/SnapshotDeletionStartBatcher.java

ywangd · 2026-02-09T06:44:59Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotDeletionStartBatcher.java

+            for (var entry : abortedAndNeedingFinalization) {
+                snapshotEnder.endSnapshot(entry, metadata, repositoryData);
+            }
+
+            if (entryToStart != null) {
+                deletionStarter.startDeletion(projectId, repositoryName, entryToStart, repositoryData, maxDataNodeCompatibleIndexVersion);
+            }


This change might OK. Though in the existing code, the start deletion and finalizing snapshots are mutually exclusive and deletion is preferred. It relies on the state trainsition after each operation to go through all operations. It might be OK to add them explicitly since tryEnterRepoLoop should keep them linearized though I am still not certain about these code for clone.

Maybe we should keep the existing logic or maybe at least run startDeletion before endSnasphot

I think these two things have to be mutually exclusive - if there is a running snapshot that we've just put into a finalizing state then any new deletion entry will be in state WAITING so entryToStart will be null. I added an assertion to this effect.

Ok thanks the relationship is quite subtle here. Writing it down here to remind myself. The an entry can only be added to abortedAndNeedingFinalization if it is completed. An completed entry has isWritingToRepository() == true which must make a new deletion entry (if there is one) WAITING.

👍 are you ok with the assertion I added or do you want some additional comments to record this in the code too?

The assertion is fine. Thanks!

ywangd · 2026-02-09T06:50:38Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

-            }
-        }, "delete snapshot [" + repository + "]" + Arrays.toString(snapshotNames), listener::onFailure);
+    private void notifyAbortedByDeletion(Snapshot snapshot) {
+        if (endingSnapshots.add(snapshot)) {


This now runs after publication instead of the part of master task execution. I think it should be fine since these are snapshots directly removed from the cluster state so that it's not possible for them to compete with any other operations. Just writing it down to clear my thoughts.

Yes this one took a bit of thinking. The way it worked previously was very ugly, we adjusted this externally-visible state here and then tried to revert it if the cluster state update failed, but that was very much an antipattern vs waiting for the cluster state update to be committed before reacting.

ywangd · 2026-02-09T07:00:51Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotDeletionStartBatcher.java

+            for (var item : batch) {
+                assert item.itemCompletionHandler != null;
+                item.itemCompletionHandler.onCompletion(
+                    item.startedDeletionUuid != null ? item.startedDeletionUuid : deletionUuid,


Can we get rid of deletionUuid and just rely on startedDeletionUuid? We do need assignment for each item but seems conceptually simpler?

Mmm not very easily - at the point we're deciding whether each item can subscribe to the existing STARTED deletion or must wait for a new one we don't know what the new one's UUID will be, and then later on we would have to do some slightly strange checks to decide whether each item needs to subscribe to the new deletion or whether it's already bound for some other completion path.

I was thinking that we "just" replace all batchCompletionHandler.deletionUuid assignments with a method that internally assign it to each item's startedDeletionUuid if it is null. But we can leave it as is if you prefer.

I think it'd be confusing to have a non-null task.deletionUuid if the task wasn't actually waiting for that deletion (e.g. its listener was DO_NOTHING or COMPLETE_LISTENER_IMMEDIATELY).

DaveCTurner · 2026-02-09T09:31:01Z

how does it compare to reusing MasterService's batch queue and creating tasks with a SubscribableListener.

I worry that this would end up retaining an awful lot of slightly different (stale and ultimately unused) copies of RepositoryData in memory, which can be very large sometimes. I'd rather have at most one batch-wide copy of this data structure instead to avoid that risk.

inespot · 2026-02-10T00:19:50Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotDeletionStartBatcher.java

+            return;
+        }
+
+        repository.getRepositoryData(snapshotExecutor, new ActionListener<>() {


If I understand this PR correctly, this expansive async getRepositoryData call is the whole reason why we choose to not use the master queue to handle the batching, right?
Because we need the RepositoryData to compute cluster state when executing the batch, we can't block on the getRepositoryData response within the executor because that could (catastrophically) stall cluster state updates. And based on your response here: #141998 (comment), we also don't want to pass it for "each task in a master batch" because it would consume more memory resources than we need.

Right - we need to be sure we're resolving the patterns in the request against the actual snapshots in the repository, and although usually the master will hold a copy of the latest RepositoryData in-memory, it may be that we have to read it from the repository and that's not something for which we should block the master thread. Best to spin and try again rather than holding up the queue like that.

Previously this was done with SnapshotsService#executeConsistentStateUpdate but this utility only supports a plain ClusterStateUpdateTask.

That makes sense, thanks!

inespot · 2026-02-10T00:25:59Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotDeletionStartBatcher.java

+        final SnapshotDeletionsItem[] batch;
+        final RepositoryMetadata repositoryMetadata;
+        final RepositoryData repositoryData;
+        // TODO could we discard the repositoryData eagerly if repositoryMetadata changes to release that memory sooner?


So detect this case:

elasticsearch/server/src/main/java/org/elasticsearch/snapshots/SnapshotDeletionStartBatcher.java

Lines 518 to 523 in cb412ee

if (task.repositoryMetadata.equals(repositoryMetadata) == false) {

// RepositoryMetadata changed out from under us, so the RepositoryData we captured may be stale. Retry from start.

// NB this retry-from-start is copied from the old unbatched mechanism but seems a little drastic. It would be neater to

// allow updating of task.repositoryData while the task is in the queue, ideally by subscribing to repositoryData updates

// from the repository itself, but that kind of subscription mechanism does not exist today. If we had that, we could just

// use a regular master task queue to do this batching work for us.

after the batch was submitted to the queue but before we execute it? Do you have an idea of how we would practically do that? Via some kind of listener on the batch?

On successful completion of the second cluster state update in BlobStoreRepository#writeIndexGen (the one that says set safe repository generation) we'd want to notify the enqueued batch of the new RepositoryData, without pulling the batch out of the queue. This is the only place where we write a new RepositoryData, and it's happening as part of a cluster state update (e.g. it's on the master thread) so we can be sure that we're not currently processing the batch at the time.

However, there's some subtleties around the failure handling. If the master stops being the master then we do not guarantee to clear out its queue immediately, so a different node could be elected master, update the RepositoryData, and then fail over back to the original master. Since a different node updated the RepositoryData the original master's queue entries wouldn't have been notified about the change. Likewise if either the writeAtomic(OperationPurpose.SNAPSHOT_METADATA, blobContainer(), indexBlob, ... or the following set safe repository generation update fail then we can't know the actual state of the repository without re-reading it.

Since a different node updated the RepositoryData the original master's queue entries wouldn't have been notified about the change. Likewise if either the writeAtomic(OperationPurpose.SNAPSHOT_METADATA, blobContainer(), indexBlob, ... or the following set safe repository generation update fail then we can't know the actual state of the repository without re-reading it.

This is an issue if we require the notification process to guarantee that every batch executes with the most recent repositoryData, right? If we treat notifications as 'best effort' and invalidate queued batches (e.g., setting their repositoryData field to null) when a change occurs to eagerly discard them, but keep the current check logic, this would be fine? But I guess the performance gains might not be as significant

Yes that's right, these things are surmountable, the point is that there's more complexity to it than you might first think.

inespot · 2026-02-10T00:32:03Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotDeletionStartBatcher.java

+     * removes the last entry from {@link SnapshotsInProgress}.
+     */
+    void startDeletion(String[] snapshots, boolean waitForCompletion, TimeValue timeout, ActionListener<Void> listener) {
+        logger.trace("startDeletion[{}]", Arrays.toString(snapshots));


Unrelated to this specific PR, but are there established conventions that we follow around logging in the ES repo, ie the amount of internal logs we emit and the level, specifically in the distributed path? Or if it’s more of a 'best judgment' situation (eg don't emit noisy logs on hot paths)?

We do have a guide for logging best practice that you may want to consult with.
https://github.com/elastic/elasticsearch/blob/main/CONTRIBUTING.md#logging

ywangd

I have a few more questions.

ywangd · 2026-02-10T04:28:12Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotDeletionStartBatcher.java

+                            return ItemCompletionHandler.COMPLETE_LISTENER_IMMEDIATELY;
+                        }
+                    }
+                    // else add all the item's snapshots to the new deletion, even though there may be duplicates, because the running


We may add the snapshotIds to an existing WAITING deletion entry instead of creating a new deletion?

That's right - I take it you just mean the comment needs to be more precise, no actual code changes needed, right? Comment improved 8151d5d.

ywangd · 2026-02-10T04:31:41Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotDeletionStartBatcher.java

+                    final var updatedDeletionsInProgress = deletionsInProgress.withReplacedEntry(
+                        entry.withAddedSnapshots(snapshotIdsRequiringCleanup)
+                    );


withAddedSnapshots(...) return this when nothing is added but withReplacedEntry(...) always returns a new object even when the newEntry is the same object as the old entry. Therefore updatedDeletionsInProgress is always updated which seems unecessary.

Yes this was my mistaken understanding of snapshotIdsRequiringCleanup too - I made withReplacedEntry do the no-op thing correctly (and added some more assertions therein), see 0e09e6d.

ywangd · 2026-02-10T04:40:05Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotDeletionStartBatcher.java

+                    if (updatedSnapshots == snapshotsInProgress && updatedDeletionsInProgress == deletionsInProgress) {
+                        // NB copied over from the unbatched implementation, may now already be handled in resolveItemAndAddSnapshotIds?
+                        assert false : "should be already handled"; // TODO remove this branch if tests are reliably passing
+                        return initialState;


Is this true? We don't check WAITING state in resolveItemAndAddSnapshotIds. It seems possible a first delete request creating a WAITING entry and a second delete request targeting the same snapshots get added to the first deletion entry but is a noop?

Aha ok yes I was misunderstanding the meaning of snapshotIdsRequiringCleanup - this includes snapshots that are already marked for deletion, it isn't only the new deletion entries. I added a test case covering this code fragment more thoroughly and expanded some comments in 0e09e6d.

ywangd · 2026-02-10T04:53:18Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotDeletionStartBatcher.java

+            for (var entry : abortedAndNeedingFinalization) {
+                snapshotEnder.endSnapshot(entry, metadata, repositoryData);
+            }
+
+            if (entryToStart != null) {
+                deletionStarter.startDeletion(projectId, repositoryName, entryToStart, repositoryData, maxDataNodeCompatibleIndexVersion);
+            }


Ok thanks the relationship is quite subtle here. Writing it down here to remind myself. The an entry can only be added to abortedAndNeedingFinalization if it is completed. An completed entry has isWritingToRepository() == true which must make a new deletion entry (if there is one) WAITING.

ywangd · 2026-02-10T05:03:15Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotDeletionStartBatcher.java

+            for (var item : batch) {
+                assert item.itemCompletionHandler != null;
+                item.itemCompletionHandler.onCompletion(
+                    item.startedDeletionUuid != null ? item.startedDeletionUuid : deletionUuid,


I was thinking that we "just" replace all batchCompletionHandler.deletionUuid assignments with a method that internally assign it to each item's startedDeletionUuid if it is null. But we can leave it as is if you prefer.

ywangd · 2026-02-10T05:05:12Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

+        if (deleteUUID == null) {
+            listener.onResponse(null);


Can we assert false in this branch? It seems to me that we should always have a non-null UUID for this callback.

No, this can still be null if the snapshots hadn't really started and were just immediately removed from the cluster state. I added a comment in 6bcaead.

ywangd

LGTM

The tests are quite massive and impressive. Took me a while to read through them. TIL.

ywangd · 2026-02-11T09:00:32Z

server/src/test/java/org/elasticsearch/snapshots/SnapshotDeletionStartBatcherTests.java

+        // the listener plumbing is kinda delicate here - done like this to ensure that we verify that failBatch calls runQueueProcessor
+        final var deletionMissingFuture = new SubscribableListener<Void>();
+        final var deletionFailFuture = new SubscribableListener<Void>();
+        deletionFailFuture.addListener(ActionTestUtils.assertNoSuccessListener(ignoredException -> {
+            repository.repositoryDataListener = SubscribableListener.newSucceeded(RepositoryData.EMPTY);
+            batcher.startDeletion(
+                new String[] { randomSnapshotName() },
+                true,
+                MasterNodeRequest.INFINITE_MASTER_NODE_TIMEOUT,
+                deletionMissingFuture
+            );


Would be nice if we could assert that runQueueProcessor is not triggered after startDeletion to double confirm it is calledd by failBatch. Doesn't seem easy though.

Yes I didn't find this test very satisfactory. I looked at it again with a fresh mind and rewrote it in 792d1a5 to fail at the master-publication stage instead, which has the same effect (we call failBatch either way).

Thanks. LGTM 👍

ywangd · 2026-02-11T09:11:57Z

server/src/test/java/org/elasticsearch/snapshots/SnapshotDeletionStartBatcherTests.java

+        for (int i = 0; i < iterations; i++) {
+            startDeletion(snapshotName);


It seems unecessary to log all names even when they are identical. I guess it matters only for tests since we don't really expect such usages in production?

Yeah it didn't really seem worth tracking duplicates and worrying about memory usage for that; today because it's unbatched we would log each of these in a separate log line because there's not even a concept of "duplicate". Also in production I think it'd be interesting if we were sometimes seeing lots of duplicates here. Something to come back to if it turns out to be an issue but I expect it won't.

This test fails if it only has a single snapshot thread and that thread gets blocked on the repository before the second snapshot has been enqueued: since elastic#141998 landed, enqueuing batches of deletes requires a snapshot thread. This commit fixes the problem by adding a second snapshot thread. Closes elastic#143387

This test fails if it only has a single snapshot thread and that thread gets blocked on the repository before the second snapshot has been enqueued: since #141998 landed, enqueuing batches of deletes requires a snapshot thread. This commit fixes the problem by adding a second snapshot thread. Closes #143387

This test fails if it only has a single snapshot thread and that thread gets blocked on the repository before the second snapshot has been enqueued: since elastic#141998 landed, enqueuing batches of deletes requires a snapshot thread. This commit fixes the problem by adding a second snapshot thread. Closes elastic#143387

DaveCTurner requested a review from ywangd February 6, 2026 12:44

DaveCTurner added >enhancement :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v9.4.0 labels Feb 6, 2026

Update docs/changelog/141998.yaml

c11c532

DaveCTurner added 4 commits February 6, 2026 14:33

Merge branch 'main' into 2026/02/05/delete-snapshot-batching

e57f0a3

Permit no-op deletes even if repo is readonly

09eeadd

Remove debug marker from log

71d990f

Suppress success callback when failing master task

2fd1958

If the execution of a cluster state update task throws an exception then all the tasks in the batch must be failed with that exception, even if some of them have been marked as successfully completed.

DaveCTurner marked this pull request as ready for review February 6, 2026 22:11

elasticsearchmachine added the Team:Distributed Meta label for distributed team. label Feb 6, 2026

ywangd mentioned this pull request Feb 9, 2026

Suppress success callback when failing master task #142042

Merged

ywangd reviewed Feb 9, 2026

View reviewed changes

DaveCTurner added 6 commits February 9, 2026 08:33

Merge branch 'main' into 2026/02/05/delete-snapshot-batching

43b1d10

isInfiniteTaskTimeout

a09bb05

Allow no-op deletes during cleanup

c2661c9

Bound log message size

b3bef0a

Assert false on believed-impossible branches

9e0e3a9

Assert abort-finalize and start-delete are exclusive

cb412ee

DaveCTurner requested a review from ywangd February 9, 2026 09:31

DaveCTurner added 2 commits February 9, 2026 10:23

Fix line length

b82d129

Strings.format to align

c9be011

inespot reviewed Feb 10, 2026

View reviewed changes

ywangd reviewed Feb 10, 2026

View reviewed changes

DaveCTurner added 6 commits February 10, 2026 07:45

Merge branch 'main' into 2026/02/05/delete-snapshot-batching

4002c04

Improve comments around choosing the subscription target

8151d5d

Comment on null deleteUUID

6bcaead

randomSnapshot

20db1b2

submitClusterStateUpdateTask

1e7c1fe

Fix already-WAITING deletion handling

0e09e6d

DaveCTurner requested a review from ywangd February 10, 2026 09:57

ywangd approved these changes Feb 11, 2026

View reviewed changes

Improve testContinuesWithNextBatchAfterFailure

792d1a5

DaveCTurner enabled auto-merge (squash) February 11, 2026 09:54

DaveCTurner added 8 commits February 11, 2026 09:54

Merge branch 'main' into 2026/02/05/delete-snapshot-batching

a9465ae

Merge branch 'main' into 2026/02/05/delete-snapshot-batching

d578b9b

Exclude sub-millisecond units

312a021

Merge branch 'main' into 2026/02/05/delete-snapshot-batching

299cf36

Merge branch 'main' into 2026/02/05/delete-snapshot-batching

9262e8d

Merge branch 'main' into 2026/02/05/delete-snapshot-batching

ac73fd1

Merge branch 'main' into 2026/02/05/delete-snapshot-batching

3a78867

Merge branch 'main' into 2026/02/05/delete-snapshot-batching

83db8b4

DaveCTurner mentioned this pull request Feb 14, 2026

Batch snapshot update tasks after external change #142091

Merged

DaveCTurner merged commit 46a74f0 into elastic:main Feb 14, 2026
35 checks passed

DaveCTurner mentioned this pull request Mar 18, 2026

Fix ConcurrentSnapshotsIT#testBackToBackQueuedDeletes #144461

Merged

	if (task.repositoryMetadata.equals(repositoryMetadata) == false) {
	// RepositoryMetadata changed out from under us, so the RepositoryData we captured may be stale. Retry from start.
	// NB this retry-from-start is copied from the old unbatched mechanism but seems a little drastic. It would be neater to
	// allow updating of task.repositoryData while the task is in the queue, ideally by subscribing to repositoryData updates
	// from the repository itself, but that kind of subscription mechanism does not exist today. If we had that, we could just
	// use a regular master task queue to do this batching work for us.

		for (int i = 0; i < iterations; i++) {
		startDeletion(snapshotName);

Conversation

DaveCTurner commented Feb 6, 2026

Uh oh!

elasticsearchmachine commented Feb 6, 2026

Uh oh!

DaveCTurner commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DaveCTurner commented Feb 6, 2026

Uh oh!

elasticsearchmachine commented Feb 6, 2026

Uh oh!

ywangd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner commented Feb 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywangd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner commented Feb 6, 2026 •

edited

Loading