Add Snapshot Resiliency Test for Master Failover during Delete by original-brownbear · Pull Request #54866 · elastic/elasticsearch

original-brownbear · 2020-04-07T08:56:53Z

We only have very indirect coverage of master failovers during snaphot delete
at the moment. This comment adds a direct test of this scenario and also
an assertion that makes sure we are not leaking any snapshot completion listeners
in the snapshots service in this scenario.

This gives us better coverage of scenarios like #54256 and makes the diff
to the upcoming more consistent snapshot delete implementation in #54705
smaller.

We only have very indirect coverage of master failovers during snaphot delete at the moment. This comment adds a direct test of this scenario and also an assertion that makes sure we are not leaking any snapshot completion listeners in the snapshots service in this scenario. This gives us better coverage of scenarios like #54256 and makes the diff to the upcoming more consistent snapshot delete implementation in #54705 smaller.

elasticmachine · 2020-04-07T08:56:55Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

tlrx · 2020-04-07T14:09:14Z

server/src/test/java/org/elasticsearch/snapshots/SnapshotResiliencyTests.java

+        assertThat(finalSnapshotsInProgress.entries(), empty());
+        final Repository repository = randomMaster.repositoriesService.repository(repoName);
+        Collection<SnapshotId> snapshotIds = getRepositoryData(repository).getSnapshotIds();
+        assertThat(snapshotIds, either(hasSize(1)).or(hasSize(0)));


Shouldn't it be always hasSize(0) when waitForSnapshot is true?

Right :) I shouldn't have mindlessly copied that from the concurrent snapshots branch. Thanks for spotting :)

As a matter of fact, thanks to recent fixes this is always 0. Even on master fail-over the deletes are now properly retried :) Adjusted tests accordingly. Interestingly enough, this created one strange spot for one, one in a million seed where the cleanup logic would take multiple minutes to complete on the fake threadpool (so it's just fake minutes) but still interesting => that's why I had to up the timeout there.
I'm investigating why that is now.

@tlrx This was a really stupid bug ... forgot to start the node connections service. This had some strange side effects since it resulted in some transport handlers never failing, causing some CS publications on the failing over master to never complete, causing this test to only move on once the failing master was again removed from the cluster after the 1.5m publication timeout ... behaves much better now.
Should be good for review with 3370c84 now :)

LGTM :) Thanks Armin

…e-master-failover-test

original-brownbear · 2020-04-07T16:10:46Z

server/src/test/java/org/elasticsearch/snapshots/SnapshotResiliencyTests.java

            continueOrDie(cleanupResponse, r -> cleanedUp.set(true));

-            runUntil(cleanedUp::get, TimeUnit.MINUTES.toMillis(1L));
+            runUntil(cleanedUp::get, TimeUnit.MINUTES.toMillis(5L));


I'm investigating why this is taking so long, until then upping the value here to because even though I found a seed that fails here now I bet there's one where this could fail in one of the other master failover tests.

…e-master-failover-test

original-brownbear · 2020-04-08T13:08:24Z

Jenkins run elasticsearch-ci/1 (known task cancel test failure)

original-brownbear · 2020-04-08T13:43:04Z

Jenkins run elasticsearch-ci/2

ywelsch

LGTM

ywelsch · 2020-04-08T15:12:11Z

server/src/test/java/org/elasticsearch/snapshots/SnapshotResiliencyTests.java

-                    new NodeConnectionsService(clusterService.getSettings(), threadPool, transportService));
-                clusterService.getClusterApplierService().start();
+                clusterService.getClusterApplierService().setNodeConnectionsService(nodeConnectionsService);
+                nodeConnectionsService.start();


Should we also call close in the stop method?

Yea let's do it to prevent stray connection checks.

ywelsch · 2020-04-08T15:15:23Z

server/src/test/java/org/elasticsearch/snapshots/SnapshotResiliencyTests.java

        assertThat(snapshotIds, hasSize(1));
    }

+    public void testSnapshotDeleteWithMasterFailOvers() {


…e-master-failover-test

original-brownbear · 2020-04-08T17:36:25Z

Thanks Yannick!

… (#55456) * Add Snapshot Resiliency Test for Master Failover during Delete We only have very indirect coverage of master failovers during snaphot delete at the moment. This comment adds a direct test of this scenario and also an assertion that makes sure we are not leaking any snapshot completion listeners in the snapshots service in this scenario. This gives us better coverage of scenarios like #54256 and makes the diff to the upcoming more consistent snapshot delete implementation in #54705 smaller.

original-brownbear added >test Issues or PRs that are addressing/adding tests :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.8.0 labels Apr 7, 2020

original-brownbear requested review from tlrx and ywelsch April 7, 2020 08:57

tlrx reviewed Apr 7, 2020

View reviewed changes

original-brownbear added 3 commits April 7, 2020 16:09

Merge remote-tracking branch 'elastic/master' into add-snapshot-delet…

c88dff7

…e-master-failover-test

tricky business

1ca6517

We never leave any snapshots

50088b6

original-brownbear requested a review from tlrx April 7, 2020 16:07

original-brownbear commented Apr 7, 2020

View reviewed changes

original-brownbear added 2 commits April 8, 2020 11:21

Merge remote-tracking branch 'elastic/master' into add-snapshot-delet…

e7444b3

…e-master-failover-test

start node connections service

3370c84

ywelsch approved these changes Apr 8, 2020

View reviewed changes

original-brownbear added 2 commits April 8, 2020 18:45

Merge remote-tracking branch 'elastic/master' into add-snapshot-delet…

9873a89

…e-master-failover-test

CR comments

0e3f3eb

original-brownbear merged commit db04fd2 into elastic:master Apr 8, 2020

original-brownbear deleted the add-snapshot-delete-master-failover-test branch April 8, 2020 17:36

original-brownbear added the backport pending label Apr 8, 2020

original-brownbear removed the backport pending label Apr 20, 2020

original-brownbear mentioned this pull request Apr 20, 2020

Add Snapshot Resiliency Test for Master Failover during Delete (#54866) #55456

Merged

jakelandis removed the v8.0.0 label Jul 26, 2021

jakelandis added the v8.0.0-alpha1 label Jul 26, 2021

Conversation

original-brownbear commented Apr 7, 2020

Uh oh!

elasticmachine commented Apr 7, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Apr 8, 2020

Uh oh!

original-brownbear commented Apr 8, 2020

Uh oh!

ywelsch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Apr 8, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants