Skip to content

Limit heap used tracking IndexMetadata deletions#140394

Merged
DaveCTurner merged 7 commits intoelastic:mainfrom
DaveCTurner:2026/01/08/snapshot-deletions-bound-index-metadata-usage
Jan 12, 2026
Merged

Limit heap used tracking IndexMetadata deletions#140394
DaveCTurner merged 7 commits intoelastic:mainfrom
DaveCTurner:2026/01/08/snapshot-deletions-bound-index-metadata-usage

Conversation

@DaveCTurner
Copy link
Copy Markdown
Member

In #133558 we imposed a limit on the heap used to keep track of
shard-level blobs to clean up after the commit of a snapshot deletion.
This commit makes use of the same mechanism to track IndexMetadata
blobs for future deletion.

Closes #140018

In elastic#133558 we imposed a limit on the heap used to keep track of
shard-level blobs to clean up after the commit of a snapshot deletion.
This commit makes use of the same mechanism to track `IndexMetadata`
blobs for future deletion.

Closes elastic#140018
@DaveCTurner DaveCTurner added >bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v9.4.0 labels Jan 8, 2026
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. label Jan 8, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you.

Copy link
Copy Markdown
Member Author

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is a little noisy because I've renamed things to avoid referring to shard-level things now that they also include index-level ones.

private void writeUpdatedShardMetadataAndComputeDeletes(ActionListener<Void> listener) {
// noinspection resource -- closed safely at the end of the iteration
final var listeners = new RefCountingListener(listener);
final var listeners = new RefCountingListener(ActionListener.runBefore(listener, this::recordUnreferencedIndicesMetadata));
Copy link
Copy Markdown
Member Author

@DaveCTurner DaveCTurner Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NB moves the computation here, before we commit the new RepositoryData blob, because this is the right place to enhance it in future with the idea to list the actual blobs to be deleted. Even before doing that, if we compute these blobs here we can be sure there's no other snapshot deletions ongoing so we don't have to worry about concurrent memory usage.

Comment on lines +1826 to +1828
final var indexPath = indexPath(indexId).buildAsString();
assert indexPath.startsWith(basePath);
final var truncatedIndexPath = indexPath.substring(basePathLen);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we were adding the full path including basePath to each blob and then removing it again later on, but we can cut that out here.

final Map<IndexId, Collection<String>> toRemove = new HashMap<>();
while (indicesForSnapshot.hasNext()) {
final var indexId = indicesForSnapshot.next();
return Iterators.flatMap(indicesToUpdateAfterRemovingSnapshot(snapshotIds), indexId -> {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the significant change: make this an Iterator so that we don't need to materialize more than one index's worth of blobs at once.

Copy link
Copy Markdown
Contributor

@joshua-adams-1 joshua-adams-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general, just a few nits

.get();

final var repo = setupRepo();
try (var shardBlobsToDelete = repo.new ShardBlobsToDelete()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] The test name and javadoc need updating since it is no longer ShardBlobsToDelete

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ fixed


final var repo = setupRepo();
try (var shardBlobsToDelete = repo.new ShardBlobsToDelete()) {
try (var shardBlobsToDelete = repo.new BlobsToDelete()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] Should this variable still be called shardBlobsToDelete?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ fixed

final var shardId = between(1, 30);
final var shardGeneration = new ShardGeneration(randomUUID());

// Always write at least one blob, guaranteeing that the shardDeleteResults stream increases in size
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit] AFAICT from the git diff, the final var blobsToDelete variable this comment belongs to has moved to line 798 so can this comment go there too? (I also see a similar need for it above line 814). Also I think shardDeleteResults needs renaming

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ fixed; it's actually not important that we always write one blob because we always e.g. write the IndexId

final List<String> blobsToDelete;
final CheckedRunnable<Exception> addResult;
final UnaryOperator<String> blobNameOperator;
if (randomBoolean()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a one line comment here explaining what is happening so that the next person who reads the test understands we're testing two branches. Or, and probably my preferred option, would be updating the Javadoc comment above the test

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ fixed

@DaveCTurner DaveCTurner enabled auto-merge (squash) January 12, 2026 09:34
Copy link
Copy Markdown
Contributor

@joshua-adams-1 joshua-adams-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@DaveCTurner DaveCTurner merged commit e3caa46 into elastic:main Jan 12, 2026
35 checks passed
@DaveCTurner DaveCTurner deleted the 2026/01/08/snapshot-deletions-bound-index-metadata-usage branch January 12, 2026 12:58
jimczi pushed a commit to jimczi/elasticsearch that referenced this pull request Jan 12, 2026
In elastic#133558 we imposed a limit on the heap used to keep track of
shard-level blobs to clean up after the commit of a snapshot deletion.
This commit makes use of the same mechanism to track `IndexMetadata`
blobs for future deletion.

Closes elastic#140018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Excessive heap usage of post-snapshot-delete index metadata cleanup

3 participants