Skip to content

Fix CoordinatorTests.testUnresponsiveLeaderDetectedEventually#64462

Merged
fcofdez merged 2 commits intoelastic:masterfrom
fcofdez:fix-test-unresponsive-leader-detected-eventually-edge-case
Nov 2, 2020
Merged

Fix CoordinatorTests.testUnresponsiveLeaderDetectedEventually#64462
fcofdez merged 2 commits intoelastic:masterfrom
fcofdez:fix-test-unresponsive-leader-detected-eventually-edge-case

Conversation

@fcofdez
Copy link
Copy Markdown
Contributor

@fcofdez fcofdez commented Nov 2, 2020

Take into account messy scenarios of 5 node clusters elections
where multiple nodes can trigger an election concurrently, meaning
that it takes longer to stabilize the cluster and elect a leader.

Fixes #63918

Take into account messy scenarios where a 5 node clusters elections
where multiple nodes can trigger an election concurrently, meaning
that it takes longer to stabilise the cluster and elect a leader.

Fixes elastic#63918
@fcofdez fcofdez added >test Issues or PRs that are addressing/adding tests :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Meta label for distributed team. v7.11.0 labels Nov 2, 2020
@fcofdez fcofdez requested a review from DaveCTurner November 2, 2020 09:45
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm actually I think we need another defaultMillis(PUBLISH_TIMEOUT_SETTING) too -- the failing elections all try and publish to the unresponsive leader and must therefore wait for a timeout.

@fcofdez
Copy link
Copy Markdown
Contributor Author

fcofdez commented Nov 2, 2020

I'm not sure if we need the additional publish timeout as those publications go through as expected (since they have quorum) or fail due to term bumps. Am I missing something?

@DaveCTurner
Copy link
Copy Markdown
Member

Yes, until the master is properly established each publication will go to the unresponsive node and therefore wait for the publish timeout before proceeding. That node is only removed from the cluster once the elections have settled down.

In the failing test, we blackhole node4 at 433835372ms, and then attempt the following publications:

[433867822ms][+    32450ms][node4][o.e.c.c.C.CoordinatorPublication          ] publishing PublishRequest{term=3, version=6, state=cluster uuid: H5lUTwAAQACNnOGK_____w [committed: true]
[433868232ms][+    32860ms][node3][o.e.c.c.C.CoordinatorPublication          ] publishing PublishRequest{term=5, version=6, state=cluster uuid: H5lUTwAAQACNnOGK_____w [committed: true]
[433897846ms][+    62474ms][node4][o.e.c.c.C.CoordinatorPublication          ] publication ended unsuccessfully: Publication{term=3, version=6}
[433898290ms][+    62918ms][node3][o.e.c.c.C.CoordinatorPublication          ] publication ended successfully: Publication{term=5, version=6}
[433898645ms][+    63273ms][node3][o.e.c.c.C.CoordinatorPublication          ] publishing PublishRequest{term=8, version=7, state=cluster uuid: H5lUTwAAQACNnOGK_____w [committed: true]
[433898735ms][+    63363ms][node3][o.e.c.c.C.CoordinatorPublication          ] publication ended unsuccessfully: Publication{term=8, version=7}
[433899213ms][+    63841ms][node1][o.e.c.c.C.CoordinatorPublication          ] publishing PublishRequest{term=10, version=8, state=cluster uuid: H5lUTwAAQACNnOGK_____w [committed: true]
[433929341ms][+    93969ms][node1][o.e.c.c.C.CoordinatorPublication          ] publication ended successfully: Publication{term=10, version=8}
[433929686ms][+    94314ms][node1][o.e.c.c.C.CoordinatorPublication          ] publishing PublishRequest{term=13, version=9, state=cluster uuid: H5lUTwAAQACNnOGK_____w [committed: true]
[433959731ms][+   124359ms][node1][o.e.c.c.C.CoordinatorPublication          ] publication ended successfully: Publication{term=13, version=9}
[433959731ms][+   124359ms][node1][o.e.c.c.C.CoordinatorPublication          ] publishing PublishRequest{term=13, version=10, state=cluster uuid: H5lUTwAAQACNnOGK_____w [committed: true]
[433989746ms][+   154374ms][node1][o.e.c.c.C.CoordinatorPublication          ] publication ended successfully: Publication{term=13, version=10}
[433989826ms][+   154454ms][node1][o.e.c.c.C.CoordinatorPublication          ] publishing PublishRequest{term=13, version=11, state=cluster uuid: H5lUTwAAQACNnOGK_____w [committed: true]
[433989975ms][+   154603ms][node1][o.e.c.c.C.CoordinatorPublication          ] publication ended successfully: Publication{term=13, version=11}
[433990058ms][+   154686ms][node1][o.e.c.c.C.CoordinatorPublication          ] publishing PublishRequest{term=13, version=12, state=cluster uuid: H5lUTwAAQACNnOGK_____w [committed: true]
[433990366ms][+   154994ms][node1][o.e.c.c.C.CoordinatorPublication          ] publication ended successfully: Publication{term=13, version=12}

(second column is times relative to the blackhole time)

Until version 11, all publications go to node4 and therefore take ≥30 seconds because of hitting the publish timeout. In particular, this includes the clashing publication of version 6 which we're wanting to account for here.

@fcofdez
Copy link
Copy Markdown
Contributor Author

fcofdez commented Nov 2, 2020

Thanks for the explanation @DaveCTurner ! I was missing that publications waits up until the timeout to succeed even if it got a Quorum. I've updated the PR.

Copy link
Copy Markdown
Member

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fcofdez fcofdez merged commit e29fc62 into elastic:master Nov 2, 2020
fcofdez added a commit to fcofdez/elasticsearch that referenced this pull request Nov 2, 2020
Take into account messy scenarios of 5 node clusters elections
where multiple nodes can trigger an election concurrently, meaning
that it takes longer to stabilize the cluster and elect a leader.

Fixes elastic#63918
Backport of elastic#64462
@fcofdez fcofdez added the v8.0.0 label Nov 2, 2020
fcofdez added a commit that referenced this pull request Nov 2, 2020
Take into account messy scenarios of 5 node clusters elections
where multiple nodes can trigger an election concurrently, meaning
that it takes longer to stabilize the cluster and elect a leader.

Fixes #63918
Backport of #64462
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Oct 20, 2021
Today we require the cluster to stabilise in a time period that allows
time for the first election to encounter conflicts. However on very rare
occasions there might be an election conflict in the second election
too. This commit extends the stabilisation timeout to allow for this.

Similar to elastic#64462
Closes elastic#78370
elasticsearchmachine pushed a commit that referenced this pull request Oct 28, 2021
Today we require the cluster to stabilise in a time period that allows
time for the first election to encounter conflicts. However on very rare
occasions there might be an election conflict in the second election
too. This commit extends the stabilisation timeout to allow for this.
Similar to #64462 Closes #78370
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Oct 28, 2021
Today we require the cluster to stabilise in a time period that allows
time for the first election to encounter conflicts. However on very rare
occasions there might be an election conflict in the second election
too. This commit extends the stabilisation timeout to allow for this.
Similar to elastic#64462 Closes elastic#78370
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Oct 28, 2021
Today we require the cluster to stabilise in a time period that allows
time for the first election to encounter conflicts. However on very rare
occasions there might be an election conflict in the second election
too. This commit extends the stabilisation timeout to allow for this.
Similar to elastic#64462 Closes elastic#78370
elasticsearchmachine pushed a commit that referenced this pull request Nov 5, 2021
Today we require the cluster to stabilise in a time period that allows
time for the first election to encounter conflicts. However on very rare
occasions there might be an election conflict in the second election
too. This commit extends the stabilisation timeout to allow for this.
Similar to #64462 Closes #78370

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
elasticsearchmachine pushed a commit that referenced this pull request Nov 5, 2021
Today we require the cluster to stabilise in a time period that allows
time for the first election to encounter conflicts. However on very rare
occasions there might be an election conflict in the second election
too. This commit extends the stabilisation timeout to allow for this.
Similar to #64462 Closes #78370

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Meta label for distributed team. >test Issues or PRs that are addressing/adding tests v7.11.0 v8.0.0-alpha1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CoordinatorTests.testUnresponsiveLeaderDetectedEventually reproducible failure:

4 participants