Fix CoordinatorTests.testUnresponsiveLeaderDetectedEventually#64462
Conversation
Take into account messy scenarios where a 5 node clusters elections where multiple nodes can trigger an election concurrently, meaning that it takes longer to stabilise the cluster and elect a leader. Fixes elastic#63918
|
Pinging @elastic/es-distributed (:Distributed/Cluster Coordination) |
DaveCTurner
left a comment
There was a problem hiding this comment.
Hmm actually I think we need another defaultMillis(PUBLISH_TIMEOUT_SETTING) too -- the failing elections all try and publish to the unresponsive leader and must therefore wait for a timeout.
|
I'm not sure if we need the additional publish timeout as those publications go through as expected (since they have quorum) or fail due to term bumps. Am I missing something? |
|
Yes, until the master is properly established each publication will go to the unresponsive node and therefore wait for the publish timeout before proceeding. That node is only removed from the cluster once the elections have settled down. In the failing test, we blackhole (second column is times relative to the blackhole time) Until version 11, all publications go to |
|
Thanks for the explanation @DaveCTurner ! I was missing that publications waits up until the timeout to succeed even if it got a Quorum. I've updated the PR. |
Take into account messy scenarios of 5 node clusters elections where multiple nodes can trigger an election concurrently, meaning that it takes longer to stabilize the cluster and elect a leader. Fixes elastic#63918 Backport of elastic#64462
Today we require the cluster to stabilise in a time period that allows time for the first election to encounter conflicts. However on very rare occasions there might be an election conflict in the second election too. This commit extends the stabilisation timeout to allow for this. Similar to elastic#64462 Closes elastic#78370
Today we require the cluster to stabilise in a time period that allows time for the first election to encounter conflicts. However on very rare occasions there might be an election conflict in the second election too. This commit extends the stabilisation timeout to allow for this. Similar to #64462 Closes #78370
Today we require the cluster to stabilise in a time period that allows time for the first election to encounter conflicts. However on very rare occasions there might be an election conflict in the second election too. This commit extends the stabilisation timeout to allow for this. Similar to elastic#64462 Closes elastic#78370
Today we require the cluster to stabilise in a time period that allows time for the first election to encounter conflicts. However on very rare occasions there might be an election conflict in the second election too. This commit extends the stabilisation timeout to allow for this. Similar to elastic#64462 Closes elastic#78370
Today we require the cluster to stabilise in a time period that allows time for the first election to encounter conflicts. However on very rare occasions there might be an election conflict in the second election too. This commit extends the stabilisation timeout to allow for this. Similar to #64462 Closes #78370 Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Today we require the cluster to stabilise in a time period that allows time for the first election to encounter conflicts. However on very rare occasions there might be an election conflict in the second election too. This commit extends the stabilisation timeout to allow for this. Similar to #64462 Closes #78370 Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Take into account messy scenarios of 5 node clusters elections
where multiple nodes can trigger an election concurrently, meaning
that it takes longer to stabilize the cluster and elect a leader.
Fixes #63918