Fix CoordinatorTests.testUnresponsiveLeaderDetectedEventually by fcofdez · Pull Request #64462 · elastic/elasticsearch

fcofdez · 2020-11-02T09:45:16Z

Take into account messy scenarios of 5 node clusters elections
where multiple nodes can trigger an election concurrently, meaning
that it takes longer to stabilize the cluster and elect a leader.

Fixes #63918

Take into account messy scenarios where a 5 node clusters elections where multiple nodes can trigger an election concurrently, meaning that it takes longer to stabilise the cluster and elect a leader. Fixes elastic#63918

elasticmachine · 2020-11-02T09:45:18Z

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

DaveCTurner

LGTM

DaveCTurner

Hmm actually I think we need another defaultMillis(PUBLISH_TIMEOUT_SETTING) too -- the failing elections all try and publish to the unresponsive leader and must therefore wait for a timeout.

fcofdez · 2020-11-02T09:56:40Z

I'm not sure if we need the additional publish timeout as those publications go through as expected (since they have quorum) or fail due to term bumps. Am I missing something?

DaveCTurner · 2020-11-02T10:34:05Z

Yes, until the master is properly established each publication will go to the unresponsive node and therefore wait for the publish timeout before proceeding. That node is only removed from the cluster once the elections have settled down.

In the failing test, we blackhole node4 at 433835372ms, and then attempt the following publications:

[433867822ms][+    32450ms][node4][o.e.c.c.C.CoordinatorPublication          ] publishing PublishRequest{term=3, version=6, state=cluster uuid: H5lUTwAAQACNnOGK_____w [committed: true]
[433868232ms][+    32860ms][node3][o.e.c.c.C.CoordinatorPublication          ] publishing PublishRequest{term=5, version=6, state=cluster uuid: H5lUTwAAQACNnOGK_____w [committed: true]
[433897846ms][+    62474ms][node4][o.e.c.c.C.CoordinatorPublication          ] publication ended unsuccessfully: Publication{term=3, version=6}
[433898290ms][+    62918ms][node3][o.e.c.c.C.CoordinatorPublication          ] publication ended successfully: Publication{term=5, version=6}
[433898645ms][+    63273ms][node3][o.e.c.c.C.CoordinatorPublication          ] publishing PublishRequest{term=8, version=7, state=cluster uuid: H5lUTwAAQACNnOGK_____w [committed: true]
[433898735ms][+    63363ms][node3][o.e.c.c.C.CoordinatorPublication          ] publication ended unsuccessfully: Publication{term=8, version=7}
[433899213ms][+    63841ms][node1][o.e.c.c.C.CoordinatorPublication          ] publishing PublishRequest{term=10, version=8, state=cluster uuid: H5lUTwAAQACNnOGK_____w [committed: true]
[433929341ms][+    93969ms][node1][o.e.c.c.C.CoordinatorPublication          ] publication ended successfully: Publication{term=10, version=8}
[433929686ms][+    94314ms][node1][o.e.c.c.C.CoordinatorPublication          ] publishing PublishRequest{term=13, version=9, state=cluster uuid: H5lUTwAAQACNnOGK_____w [committed: true]
[433959731ms][+   124359ms][node1][o.e.c.c.C.CoordinatorPublication          ] publication ended successfully: Publication{term=13, version=9}
[433959731ms][+   124359ms][node1][o.e.c.c.C.CoordinatorPublication          ] publishing PublishRequest{term=13, version=10, state=cluster uuid: H5lUTwAAQACNnOGK_____w [committed: true]
[433989746ms][+   154374ms][node1][o.e.c.c.C.CoordinatorPublication          ] publication ended successfully: Publication{term=13, version=10}
[433989826ms][+   154454ms][node1][o.e.c.c.C.CoordinatorPublication          ] publishing PublishRequest{term=13, version=11, state=cluster uuid: H5lUTwAAQACNnOGK_____w [committed: true]
[433989975ms][+   154603ms][node1][o.e.c.c.C.CoordinatorPublication          ] publication ended successfully: Publication{term=13, version=11}
[433990058ms][+   154686ms][node1][o.e.c.c.C.CoordinatorPublication          ] publishing PublishRequest{term=13, version=12, state=cluster uuid: H5lUTwAAQACNnOGK_____w [committed: true]
[433990366ms][+   154994ms][node1][o.e.c.c.C.CoordinatorPublication          ] publication ended successfully: Publication{term=13, version=12}

(second column is times relative to the blackhole time)

Until version 11, all publications go to node4 and therefore take ≥30 seconds because of hitting the publish timeout. In particular, this includes the clashing publication of version 6 which we're wanting to account for here.

fcofdez · 2020-11-02T10:40:59Z

Thanks for the explanation @DaveCTurner ! I was missing that publications waits up until the timeout to succeed even if it got a Quorum. I've updated the PR.

DaveCTurner

LGTM

Take into account messy scenarios of 5 node clusters elections where multiple nodes can trigger an election concurrently, meaning that it takes longer to stabilize the cluster and elect a leader. Fixes elastic#63918 Backport of elastic#64462

Take into account messy scenarios of 5 node clusters elections where multiple nodes can trigger an election concurrently, meaning that it takes longer to stabilize the cluster and elect a leader. Fixes #63918 Backport of #64462

Today we require the cluster to stabilise in a time period that allows time for the first election to encounter conflicts. However on very rare occasions there might be an election conflict in the second election too. This commit extends the stabilisation timeout to allow for this. Similar to elastic#64462 Closes elastic#78370

Today we require the cluster to stabilise in a time period that allows time for the first election to encounter conflicts. However on very rare occasions there might be an election conflict in the second election too. This commit extends the stabilisation timeout to allow for this. Similar to #64462 Closes #78370

Today we require the cluster to stabilise in a time period that allows time for the first election to encounter conflicts. However on very rare occasions there might be an election conflict in the second election too. This commit extends the stabilisation timeout to allow for this. Similar to elastic#64462 Closes elastic#78370

Today we require the cluster to stabilise in a time period that allows time for the first election to encounter conflicts. However on very rare occasions there might be an election conflict in the second election too. This commit extends the stabilisation timeout to allow for this. Similar to #64462 Closes #78370 Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

Fix CoordinatorTests.testUnresponsiveLeaderDetectedEventually

4055879

Take into account messy scenarios where a 5 node clusters elections where multiple nodes can trigger an election concurrently, meaning that it takes longer to stabilise the cluster and elect a leader. Fixes elastic#63918

fcofdez added >test Issues or PRs that are addressing/adding tests :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Meta label for distributed team. v7.11.0 labels Nov 2, 2020

fcofdez requested a review from DaveCTurner November 2, 2020 09:45

fcofdez mentioned this pull request Nov 2, 2020

CoordinatorTests.testUnresponsiveLeaderDetectedEventually reproducible failure: #63918

Closed

DaveCTurner approved these changes Nov 2, 2020

View reviewed changes

DaveCTurner requested changes Nov 2, 2020

View reviewed changes

Take into account publication timeout

fecb391

DaveCTurner approved these changes Nov 2, 2020

View reviewed changes

fcofdez merged commit e29fc62 into elastic:master Nov 2, 2020

fcofdez mentioned this pull request Nov 2, 2020

Fix CoordinatorTests.testUnresponsiveLeaderDetectedEventually 7.x #64490

Merged

fcofdez added the v8.0.0 label Nov 2, 2020

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

DaveCTurner mentioned this pull request Oct 20, 2021

CoordinatorTests sometimes needs three term bumps #79574

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CoordinatorTests.testUnresponsiveLeaderDetectedEventually#64462

Fix CoordinatorTests.testUnresponsiveLeaderDetectedEventually#64462
fcofdez merged 2 commits intoelastic:masterfrom
fcofdez:fix-test-unresponsive-leader-detected-eventually-edge-case

fcofdez commented Nov 2, 2020 •

edited

Loading

Uh oh!

elasticmachine commented Nov 2, 2020

Uh oh!

DaveCTurner left a comment

Uh oh!

DaveCTurner left a comment

Uh oh!

fcofdez commented Nov 2, 2020

Uh oh!

DaveCTurner commented Nov 2, 2020

Uh oh!

fcofdez commented Nov 2, 2020 •

edited

Loading

Uh oh!

DaveCTurner left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

fcofdez commented Nov 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Nov 2, 2020

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

fcofdez commented Nov 2, 2020

Uh oh!

DaveCTurner commented Nov 2, 2020

Uh oh!

fcofdez commented Nov 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fcofdez commented Nov 2, 2020 •

edited

Loading

fcofdez commented Nov 2, 2020 •

edited

Loading