Skip to content

Fixes testSnapshotShutdownProgressTracker#134926

Merged
joshua-adams-1 merged 6 commits intoelastic:mainfrom
joshua-adams-1:snapshot-shutdown-it-test-failure
Sep 25, 2025
Merged

Fixes testSnapshotShutdownProgressTracker#134926
joshua-adams-1 merged 6 commits intoelastic:mainfrom
joshua-adams-1:snapshot-shutdown-it-test-failure

Conversation

@joshua-adams-1
Copy link
Copy Markdown
Contributor

@joshua-adams-1 joshua-adams-1 commented Sep 17, 2025

I was able to reproduce the error by following the steps outlined by @JeremyDahlgren here

The error occurs due to the following sequence of events:

  1. One of the snapshot threads has a sleep when executing snapshot tasks
  2. The other is not delayed, so it will get blocked in the repository to allow the test to get to the putShutdownForRemovalMetadata() call
  3. After the delays the first thread processes the other snapshot tasks, which are now paused, which throws a PausedSnapshotException, and then immediately notifies the master with the failure
  4. The first four SnapshotShutdownProgressTracker log messages are verified in the test and then the masterTransportService.addRequestHandlingBehavior() is called to block the notifications
  5. The repo is unblocked on the nodeForRemoval, but by this time it is too late and one or more updates have already been sent, possibly leaving only the single blocked shard, causing the assertion at 581 to fail

The proposed change:

  1. Ensures that there are n shards on otherNode with data. At the moment, there is only 1.
  2. I have reordered the blocking on the master node to come earlier. This is so we are always capturing the notifications back to the master, either by the snapshot thread being blocked, or if there was some delay on a thread due to other concurrent scheduling.

A full explanation of the changes can be found here


Testing

  • Having been able to reproducible make the test fail, it now succeeds 100 times in a row
  • The original gradle command ./gradlew ":server:internalClusterTest" --tests "org.elasticsearch.snapshots.SnapshotShutdownIT.testSnapshotShutdownProgressTracker" -Dtests.seed=C44FB8C3BA8AC78E -Dtests.locale=dz -Dtests.timezone=Australia/Queensland -Druntime.java=23 succeeds
  • CI passes

Resolves: #134620

Addresses a mocklog assertion inside the
testSnapshotShutdownProgressTracker.

Resolves: 134620
@joshua-adams-1
Copy link
Copy Markdown
Contributor Author

I have added a backport label for 8.18.9 too - let me know if this is incorrect

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm but startFullSnapshotBlockedOnDataNode(randomIdentifier(), repoName, nodeForRemoval); should block all the shard snapshots before they get as far as notifying the master that they have PAUSED. Can you explain how some of these shards bypass that block?

@joshua-adams-1
Copy link
Copy Markdown
Contributor Author

@DaveCTurner I have updated the message above to include further clarity. I can't say why that particular line doesn't work, but there is a code comment here alluding to a possible race condition that likely answers your question

@joshua-adams-1 joshua-adams-1 marked this pull request as ready for review September 18, 2025 12:57
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. label Sep 18, 2025
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Copy Markdown
Contributor

@JeremyDahlgren JeremyDahlgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm but startFullSnapshotBlockedOnDataNode(randomIdentifier(), repoName, nodeForRemoval); should block all the shard snapshots before they get as far as notifying the master that they have PAUSED. Can you explain how some of these shards bypass that block?

I dug into this and was able to reliably reproduce the error, along with an explanation for David's question, posted here in the test failure issue. In brief, it is possible for a snapshot pool thread to not be blocked after putShutdownForRemovalMetadata() is called, and the notifications back to the master will execute before the CountDownLatch and masterTransportService.addRequestHandlingBehavior() have been set up.

An alternative to changing the log message is to move the CountDownLatch and masterTransportService.addRequestHandlingBehavior() setup to before the putShutdownForRemovalMetadata() call. This has been running ok on my machine in a loop.

@DaveCTurner
Copy link
Copy Markdown
Member

I suggested a preferable fix here: make sure that every shard contains data so that none of them can escape the block.

@joshua-adams-1
Copy link
Copy Markdown
Contributor Author

The testSnapshotShutdownProgressTracker test is present in both branches 9.0 and 9.1 so am backporting to them both.

Copy link
Copy Markdown
Contributor

@JeremyDahlgren JeremyDahlgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@joshua-adams-1 joshua-adams-1 added >test Issues or PRs that are addressing/adding tests and removed >non-issue labels Sep 24, 2025
@joshua-adams-1 joshua-adams-1 merged commit 5643435 into elastic:main Sep 25, 2025
34 checks passed
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

💔 Backport failed

Status Branch Result
9.1 Commit could not be cherrypicked due to conflicts
9.0 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 134926

@joshua-adams-1 joshua-adams-1 deleted the snapshot-shutdown-it-test-failure branch September 25, 2025 09:27
szybia added a commit to szybia/elasticsearch that referenced this pull request Sep 25, 2025
…-dls

* upstream/main: (100 commits)
  ES|QL: Add FUSE operator tests (elastic#135307)
  [D0CS] Revise connector setup steps in documentation (elastic#135426)
  Fix DiscoveryDisruptionIT.testElectMasterWithLatestVersion (elastic#135396)
  [DOCS] Marks the change point agg as GA (elastic#134898)
  Rework ShardSearchContextId to explain use of searcher id better (elastic#135233)
  [CI] Handle caching bwc dependencies more gracefully (elastic#135417)
  Mute org.elasticsearch.gradle.TestClustersPluginFuncTest override jdk usage via ES_JAVA_HOME for known jdk os incompatibilities elastic#135413
  [Build] update eclipse formatter used by spotless (elastic#135382)
  [Test] Fix typo in build tool tests (elastic#135405)
  Fixes testSnapshotShutdownProgressTracker (elastic#134926)
  Mute org.elasticsearch.upgrades.StandardToLogsDbIndexModeRollingUpgradeIT testLogsIndexing {upgradedNodes=1} elastic#135313
  OTLP: remove feature flag (elastic#135401)
  [Docs] Convert asciidoc lifecycle markers into Docs V3 syntax (elastic#135347)
  Mute org.elasticsearch.upgrades.QueryableBuiltInRolesUpgradeIT testBuiltInRolesSyncedOnClusterUpgrade elastic#135194
  Mute org.elasticsearch.upgrades.IndexingIT testIndexing elastic#135407
  Mute org.elasticsearch.upgrades.DataStreamsUpgradeIT testDataStreamValidationDoesNotBreakUpgrade elastic#135406
  [CI] Handle git snapshot BWC versions correctly when calculating jdk fallback (elastic#135399)
  [Build] Update checkstyle from 10.3 to 11.0.1 (elastic#135381)
  Mute org.elasticsearch.upgrades.TextRollingUpgradeIT testIndexing {upgradedNodes=2} elastic#135238
  Mute org.elasticsearch.upgrades.MatchOnlyTextRollingUpgradeIT testIndexing {upgradedNodes=2} elastic#135325
  ...
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Oct 2, 2025
DaveCTurner added a commit that referenced this pull request Oct 6, 2025
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Oct 6, 2025
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Oct 6, 2025
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Oct 6, 2025
elasticsearchmachine pushed a commit that referenced this pull request Oct 6, 2025
elasticsearchmachine pushed a commit that referenced this pull request Oct 6, 2025
elasticsearchmachine pushed a commit that referenced this pull request Oct 6, 2025
@repantis repantis added :Distributed/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Distributed Coordination/Distributed labels Jan 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Automatically create backport pull requests when merged :Distributed/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. >test Issues or PRs that are addressing/adding tests v9.0.8 v9.1.5 v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] SnapshotShutdownIT testSnapshotShutdownProgressTracker failing

5 participants