Fixes testSnapshotShutdownProgressTracker by joshua-adams-1 · Pull Request #134926 · elastic/elasticsearch

joshua-adams-1 · 2025-09-17T16:15:44Z

I was able to reproduce the error by following the steps outlined by @JeremyDahlgren here

The error occurs due to the following sequence of events:

One of the snapshot threads has a sleep when executing snapshot tasks
The other is not delayed, so it will get blocked in the repository to allow the test to get to the putShutdownForRemovalMetadata() call
After the delays the first thread processes the other snapshot tasks, which are now paused, which throws a PausedSnapshotException, and then immediately notifies the master with the failure
The first four SnapshotShutdownProgressTracker log messages are verified in the test and then the masterTransportService.addRequestHandlingBehavior() is called to block the notifications
The repo is unblocked on the nodeForRemoval, but by this time it is too late and one or more updates have already been sent, possibly leaving only the single blocked shard, causing the assertion at 581 to fail

The proposed change:

Ensures that there are n shards on otherNode with data. At the moment, there is only 1.
I have reordered the blocking on the master node to come earlier. This is so we are always capturing the notifications back to the master, either by the snapshot thread being blocked, or if there was some delay on a thread due to other concurrent scheduling.

A full explanation of the changes can be found here

Testing

Having been able to reproducible make the test fail, it now succeeds 100 times in a row
The original gradle command ./gradlew ":server:internalClusterTest" --tests "org.elasticsearch.snapshots.SnapshotShutdownIT.testSnapshotShutdownProgressTracker" -Dtests.seed=C44FB8C3BA8AC78E -Dtests.locale=dz -Dtests.timezone=Australia/Queensland -Druntime.java=23 succeeds
CI passes

Resolves: #134620

Addresses a mocklog assertion inside the testSnapshotShutdownProgressTracker. Resolves: 134620

joshua-adams-1 · 2025-09-17T16:20:04Z

I have added a backport label for 8.18.9 too - let me know if this is incorrect

DaveCTurner

Hmm but startFullSnapshotBlockedOnDataNode(randomIdentifier(), repoName, nodeForRemoval); should block all the shard snapshots before they get as far as notifying the master that they have PAUSED. Can you explain how some of these shards bypass that block?

joshua-adams-1 · 2025-09-18T09:25:17Z

@DaveCTurner I have updated the message above to include further clarity. I can't say why that particular line doesn't work, but there is a code comment here alluding to a possible race condition that likely answers your question

elasticsearchmachine · 2025-09-18T12:58:03Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

JeremyDahlgren

Hmm but startFullSnapshotBlockedOnDataNode(randomIdentifier(), repoName, nodeForRemoval); should block all the shard snapshots before they get as far as notifying the master that they have PAUSED. Can you explain how some of these shards bypass that block?

I dug into this and was able to reliably reproduce the error, along with an explanation for David's question, posted here in the test failure issue. In brief, it is possible for a snapshot pool thread to not be blocked after putShutdownForRemovalMetadata() is called, and the notifications back to the master will execute before the CountDownLatch and masterTransportService.addRequestHandlingBehavior() have been set up.

An alternative to changing the log message is to move the CountDownLatch and masterTransportService.addRequestHandlingBehavior() setup to before the putShutdownForRemovalMetadata() call. This has been running ok on my machine in a loop.

DaveCTurner · 2025-09-19T09:38:29Z

I suggested a preferable fix here: make sure that every shard contains data so that none of them can escape the block.

joshua-adams-1 · 2025-09-24T13:02:58Z

The testSnapshotShutdownProgressTracker test is present in both branches 9.0 and 9.1 so am backporting to them both.

JeremyDahlgren

LGTM

elasticsearchmachine · 2025-09-25T08:33:28Z

💔 Backport failed

Status	Branch	Result
❌	9.1	Commit could not be cherrypicked due to conflicts
❌	9.0	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 134926

…-dls * upstream/main: (100 commits) ES|QL: Add FUSE operator tests (elastic#135307) [D0CS] Revise connector setup steps in documentation (elastic#135426) Fix DiscoveryDisruptionIT.testElectMasterWithLatestVersion (elastic#135396) [DOCS] Marks the change point agg as GA (elastic#134898) Rework ShardSearchContextId to explain use of searcher id better (elastic#135233) [CI] Handle caching bwc dependencies more gracefully (elastic#135417) Mute org.elasticsearch.gradle.TestClustersPluginFuncTest override jdk usage via ES_JAVA_HOME for known jdk os incompatibilities elastic#135413 [Build] update eclipse formatter used by spotless (elastic#135382) [Test] Fix typo in build tool tests (elastic#135405) Fixes testSnapshotShutdownProgressTracker (elastic#134926) Mute org.elasticsearch.upgrades.StandardToLogsDbIndexModeRollingUpgradeIT testLogsIndexing {upgradedNodes=1} elastic#135313 OTLP: remove feature flag (elastic#135401) [Docs] Convert asciidoc lifecycle markers into Docs V3 syntax (elastic#135347) Mute org.elasticsearch.upgrades.QueryableBuiltInRolesUpgradeIT testBuiltInRolesSyncedOnClusterUpgrade elastic#135194 Mute org.elasticsearch.upgrades.IndexingIT testIndexing elastic#135407 Mute org.elasticsearch.upgrades.DataStreamsUpgradeIT testDataStreamValidationDoesNotBreakUpgrade elastic#135406 [CI] Handle git snapshot BWC versions correctly when calculating jdk fallback (elastic#135399) [Build] Update checkstyle from 10.3 to 11.0.1 (elastic#135381) Mute org.elasticsearch.upgrades.TextRollingUpgradeIT testIndexing {upgradedNodes=2} elastic#135238 Mute org.elasticsearch.upgrades.MatchOnlyTextRollingUpgradeIT testIndexing {upgradedNodes=2} elastic#135325 ...

Partial revert of elastic#134926

Partial revert of #134926

…ic#135835) Partial revert of elastic#134926

Partial revert of elastic#134926 Backport of elastic#135835 to 9.1

Partial revert of elastic#134926 Backport of elastic#135835 to 9.0

…) (#136014) Partial revert of #134926

Partial revert of #134926 Backport of #135835 to 9.1

Partial revert of #134926 Backport of #135835 to 9.0

Fixes testSnapshotShutdownProgressTracker

41c9787

Addresses a mocklog assertion inside the testSnapshotShutdownProgressTracker. Resolves: 134620

joshua-adams-1 requested review from DaveCTurner, DiannaHohensee and JeremyDahlgren September 17, 2025 16:15

joshua-adams-1 self-assigned this Sep 17, 2025

joshua-adams-1 added >non-issue :Distributed Coordination/Distributed labels Sep 17, 2025

elasticsearchmachine added the v9.2.0 label Sep 17, 2025

joshua-adams-1 added v8.18.9 auto-backport Automatically create backport pull requests when merged labels Sep 17, 2025

[CI] Auto commit changes from spotless

57a6616

DaveCTurner reviewed Sep 17, 2025

View reviewed changes

Split log line

a038678

joshua-adams-1 marked this pull request as ready for review September 18, 2025 12:57

elasticsearchmachine added the Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. label Sep 18, 2025

JeremyDahlgren reviewed Sep 19, 2025

View reviewed changes

joshua-adams-1 and others added 3 commits September 24, 2025 12:17

Moves master node transport action blocks

48b4aa8

[CI] Update transport version definitions

654e8cb

Merge branch 'main' into snapshot-shutdown-it-test-failure

a600ed5

joshua-adams-1 requested a review from JeremyDahlgren September 24, 2025 12:38

joshua-adams-1 mentioned this pull request Sep 24, 2025

[CI] SnapshotShutdownIT testSnapshotShutdownProgressTracker failing #134620

Closed

joshua-adams-1 added v9.1.5 v9.0.8 and removed v8.18.9 labels Sep 24, 2025

JeremyDahlgren approved these changes Sep 24, 2025

View reviewed changes

joshua-adams-1 added >test Issues or PRs that are addressing/adding tests and removed >non-issue labels Sep 24, 2025

joshua-adams-1 merged commit 5643435 into elastic:main Sep 25, 2025
34 checks passed

elasticsearchmachine added the backport pending label Sep 25, 2025

This was referenced Sep 25, 2025

[9.0] Fixes testSnapshotShutdownProgressTracker (#134926) #135415

Merged

[9.1] Fixes testSnapshotShutdownProgressTracker (#134926) #135416

Merged

joshua-adams-1 deleted the snapshot-shutdown-it-test-failure branch September 25, 2025 09:27

joshua-adams-1 removed the backport pending label Sep 25, 2025

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Oct 2, 2025

Reinstate block order in testSnapshotShutdownProgressTracker

c30ec06

Partial revert of elastic#134926

DaveCTurner mentioned this pull request Oct 2, 2025

Reinstate block order in testSnapshotShutdownProgressTracker #135835

Merged

DaveCTurner added a commit that referenced this pull request Oct 6, 2025

Reinstate block order in testSnapshotShutdownProgressTracker (#135835)

2586fbd

Partial revert of #134926

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Oct 6, 2025

Reinstate block order in testSnapshotShutdownProgressTracker (elast…

4b2b780

…ic#135835) Partial revert of elastic#134926

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Oct 6, 2025

Reinstate block order in testSnapshotShutdownProgressTracker

9d44a24

Partial revert of elastic#134926 Backport of elastic#135835 to 9.1

DaveCTurner mentioned this pull request Oct 6, 2025

Reinstate block order in testSnapshotShutdownProgressTracker #136015

Merged

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Oct 6, 2025

Reinstate block order in testSnapshotShutdownProgressTracker

6765b34

Partial revert of elastic#134926 Backport of elastic#135835 to 9.0

DaveCTurner mentioned this pull request Oct 6, 2025

Reinstate block order in testSnapshotShutdownProgressTracker #136016

Merged

elasticsearchmachine pushed a commit that referenced this pull request Oct 6, 2025

Reinstate block order in testSnapshotShutdownProgressTracker (#135835…

8a779c4

…) (#136014) Partial revert of #134926

elasticsearchmachine pushed a commit that referenced this pull request Oct 6, 2025

Reinstate block order in testSnapshotShutdownProgressTracker (#136015)

8778a2f

Partial revert of #134926 Backport of #135835 to 9.1

elasticsearchmachine pushed a commit that referenced this pull request Oct 6, 2025

Reinstate block order in testSnapshotShutdownProgressTracker (#136016)

267c6bf

Partial revert of #134926 Backport of #135835 to 9.0

repantis added :Distributed/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Distributed Coordination/Distributed labels Jan 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes testSnapshotShutdownProgressTracker#134926

Fixes testSnapshotShutdownProgressTracker#134926
joshua-adams-1 merged 6 commits intoelastic:mainfrom
joshua-adams-1:snapshot-shutdown-it-test-failure

joshua-adams-1 commented Sep 17, 2025 •

edited

Loading

Uh oh!

joshua-adams-1 commented Sep 17, 2025

Uh oh!

DaveCTurner left a comment

Uh oh!

joshua-adams-1 commented Sep 18, 2025

Uh oh!

elasticsearchmachine commented Sep 18, 2025

Uh oh!

JeremyDahlgren left a comment

Uh oh!

DaveCTurner commented Sep 19, 2025

Uh oh!

joshua-adams-1 commented Sep 24, 2025

Uh oh!

JeremyDahlgren left a comment

Uh oh!

Uh oh!

elasticsearchmachine commented Sep 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

joshua-adams-1 commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Uh oh!

joshua-adams-1 commented Sep 17, 2025

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

joshua-adams-1 commented Sep 18, 2025

Uh oh!

elasticsearchmachine commented Sep 18, 2025

Uh oh!

JeremyDahlgren left a comment

Choose a reason for hiding this comment

Uh oh!

DaveCTurner commented Sep 19, 2025

Uh oh!

joshua-adams-1 commented Sep 24, 2025

Uh oh!

JeremyDahlgren left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Sep 25, 2025

💔 Backport failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

joshua-adams-1 commented Sep 17, 2025 •

edited

Loading