Add Snapshot Resiliency Test for Master Failover during Delete#54866
Add Snapshot Resiliency Test for Master Failover during Delete#54866original-brownbear merged 8 commits intoelastic:masterfrom original-brownbear:add-snapshot-delete-master-failover-test
Conversation
We only have very indirect coverage of master failovers during snaphot delete at the moment. This comment adds a direct test of this scenario and also an assertion that makes sure we are not leaking any snapshot completion listeners in the snapshots service in this scenario. This gives us better coverage of scenarios like #54256 and makes the diff to the upcoming more consistent snapshot delete implementation in #54705 smaller.
|
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
| assertThat(finalSnapshotsInProgress.entries(), empty()); | ||
| final Repository repository = randomMaster.repositoriesService.repository(repoName); | ||
| Collection<SnapshotId> snapshotIds = getRepositoryData(repository).getSnapshotIds(); | ||
| assertThat(snapshotIds, either(hasSize(1)).or(hasSize(0))); |
There was a problem hiding this comment.
Shouldn't it be always hasSize(0) when waitForSnapshot is true?
There was a problem hiding this comment.
Right :) I shouldn't have mindlessly copied that from the concurrent snapshots branch. Thanks for spotting :)
There was a problem hiding this comment.
As a matter of fact, thanks to recent fixes this is always 0. Even on master fail-over the deletes are now properly retried :) Adjusted tests accordingly. Interestingly enough, this created one strange spot for one, one in a million seed where the cleanup logic would take multiple minutes to complete on the fake threadpool (so it's just fake minutes) but still interesting => that's why I had to up the timeout there.
I'm investigating why that is now.
There was a problem hiding this comment.
@tlrx This was a really stupid bug ... forgot to start the node connections service. This had some strange side effects since it resulted in some transport handlers never failing, causing some CS publications on the failing over master to never complete, causing this test to only move on once the failing master was again removed from the cluster after the 1.5m publication timeout ... behaves much better now.
Should be good for review with 3370c84 now :)
| continueOrDie(cleanupResponse, r -> cleanedUp.set(true)); | ||
|
|
||
| runUntil(cleanedUp::get, TimeUnit.MINUTES.toMillis(1L)); | ||
| runUntil(cleanedUp::get, TimeUnit.MINUTES.toMillis(5L)); |
There was a problem hiding this comment.
I'm investigating why this is taking so long, until then upping the value here to because even though I found a seed that fails here now I bet there's one where this could fail in one of the other master failover tests.
|
Jenkins run elasticsearch-ci/1 (known task cancel test failure) |
|
Jenkins run elasticsearch-ci/2 |
| new NodeConnectionsService(clusterService.getSettings(), threadPool, transportService)); | ||
| clusterService.getClusterApplierService().start(); | ||
| clusterService.getClusterApplierService().setNodeConnectionsService(nodeConnectionsService); | ||
| nodeConnectionsService.start(); |
There was a problem hiding this comment.
Should we also call close in the stop method?
There was a problem hiding this comment.
Yea let's do it to prevent stray connection checks.
| assertThat(snapshotIds, hasSize(1)); | ||
| } | ||
|
|
||
| public void testSnapshotDeleteWithMasterFailOvers() { |
…e-master-failover-test
|
Thanks Yannick! |
… (#55456) * Add Snapshot Resiliency Test for Master Failover during Delete We only have very indirect coverage of master failovers during snaphot delete at the moment. This comment adds a direct test of this scenario and also an assertion that makes sure we are not leaking any snapshot completion listeners in the snapshots service in this scenario. This gives us better coverage of scenarios like #54256 and makes the diff to the upcoming more consistent snapshot delete implementation in #54705 smaller.
We only have very indirect coverage of master failovers during snaphot delete
at the moment. This comment adds a direct test of this scenario and also
an assertion that makes sure we are not leaking any snapshot completion listeners
in the snapshots service in this scenario.
This gives us better coverage of scenarios like #54256 and makes the diff
to the upcoming more consistent snapshot delete implementation in #54705
smaller.