Fixes testSnapshotShutdownProgressTracker#134926
Fixes testSnapshotShutdownProgressTracker#134926joshua-adams-1 merged 6 commits intoelastic:mainfrom
Conversation
Addresses a mocklog assertion inside the testSnapshotShutdownProgressTracker. Resolves: 134620
|
I have added a backport label for |
DaveCTurner
left a comment
There was a problem hiding this comment.
Hmm but startFullSnapshotBlockedOnDataNode(randomIdentifier(), repoName, nodeForRemoval); should block all the shard snapshots before they get as far as notifying the master that they have PAUSED. Can you explain how some of these shards bypass that block?
|
@DaveCTurner I have updated the message above to include further clarity. I can't say why that particular line doesn't work, but there is a code comment here alluding to a possible race condition that likely answers your question |
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
JeremyDahlgren
left a comment
There was a problem hiding this comment.
Hmm but startFullSnapshotBlockedOnDataNode(randomIdentifier(), repoName, nodeForRemoval); should block all the shard snapshots before they get as far as notifying the master that they have PAUSED. Can you explain how some of these shards bypass that block?
I dug into this and was able to reliably reproduce the error, along with an explanation for David's question, posted here in the test failure issue. In brief, it is possible for a snapshot pool thread to not be blocked after putShutdownForRemovalMetadata() is called, and the notifications back to the master will execute before the CountDownLatch and masterTransportService.addRequestHandlingBehavior() have been set up.
An alternative to changing the log message is to move the CountDownLatch and masterTransportService.addRequestHandlingBehavior() setup to before the putShutdownForRemovalMetadata() call. This has been running ok on my machine in a loop.
|
I suggested a preferable fix here: make sure that every shard contains data so that none of them can escape the block. |
|
The |
💔 Backport failed
You can use sqren/backport to manually backport by running |
…-dls * upstream/main: (100 commits) ES|QL: Add FUSE operator tests (elastic#135307) [D0CS] Revise connector setup steps in documentation (elastic#135426) Fix DiscoveryDisruptionIT.testElectMasterWithLatestVersion (elastic#135396) [DOCS] Marks the change point agg as GA (elastic#134898) Rework ShardSearchContextId to explain use of searcher id better (elastic#135233) [CI] Handle caching bwc dependencies more gracefully (elastic#135417) Mute org.elasticsearch.gradle.TestClustersPluginFuncTest override jdk usage via ES_JAVA_HOME for known jdk os incompatibilities elastic#135413 [Build] update eclipse formatter used by spotless (elastic#135382) [Test] Fix typo in build tool tests (elastic#135405) Fixes testSnapshotShutdownProgressTracker (elastic#134926) Mute org.elasticsearch.upgrades.StandardToLogsDbIndexModeRollingUpgradeIT testLogsIndexing {upgradedNodes=1} elastic#135313 OTLP: remove feature flag (elastic#135401) [Docs] Convert asciidoc lifecycle markers into Docs V3 syntax (elastic#135347) Mute org.elasticsearch.upgrades.QueryableBuiltInRolesUpgradeIT testBuiltInRolesSyncedOnClusterUpgrade elastic#135194 Mute org.elasticsearch.upgrades.IndexingIT testIndexing elastic#135407 Mute org.elasticsearch.upgrades.DataStreamsUpgradeIT testDataStreamValidationDoesNotBreakUpgrade elastic#135406 [CI] Handle git snapshot BWC versions correctly when calculating jdk fallback (elastic#135399) [Build] Update checkstyle from 10.3 to 11.0.1 (elastic#135381) Mute org.elasticsearch.upgrades.TextRollingUpgradeIT testIndexing {upgradedNodes=2} elastic#135238 Mute org.elasticsearch.upgrades.MatchOnlyTextRollingUpgradeIT testIndexing {upgradedNodes=2} elastic#135325 ...
Partial revert of elastic#134926 Backport of elastic#135835 to 9.1
Partial revert of elastic#134926 Backport of elastic#135835 to 9.0
I was able to reproduce the error by following the steps outlined by @JeremyDahlgren here
The error occurs due to the following sequence of events:
putShutdownForRemovalMetadata()callPausedSnapshotException, and then immediately notifies the master with the failureSnapshotShutdownProgressTrackerlog messages are verified in the test and then themasterTransportService.addRequestHandlingBehavior()is called to block the notificationsnodeForRemoval, but by this time it is too late and one or more updates have already been sent, possibly leaving only the single blocked shard, causing the assertion at 581 to failThe proposed change:
nshards onotherNodewith data. At the moment, there is only 1.A full explanation of the changes can be found here
Testing
./gradlew ":server:internalClusterTest" --tests "org.elasticsearch.snapshots.SnapshotShutdownIT.testSnapshotShutdownProgressTracker" -Dtests.seed=C44FB8C3BA8AC78E -Dtests.locale=dz -Dtests.timezone=Australia/Queensland -Druntime.java=23succeedsResolves: #134620