Start failover without waiting for the next cluster cron cycle#2209
Conversation
An automatic failover has a fixed delay of 500ms, its main purpose is
to wait for the FAIL packet to propagate in the cluster so that other
primaries can respond to the vote request.
```
server.cluster->failover_auth_time =
now +
500 + /* Fixed delay of 500 milliseconds, let FAIL msg propagate. */
random() % 500; /* Random delay between 0 and 500 milliseconds. */
```
A fixed delay of 500ms is a bit long in now days, we are looking into
removing it for a faster failover. If we can ensure it is safe to remove,
then we can remove it.
Currently a replica now can only learn of the death of its primary
node from two places:
1. Collecting pfail messages from other primary nodes, and then check
in markNodeAsFailingIfNeeded to see if we have the quorum.
2. Receiving FAIL messages from other primary nodes (or replica nodes).
Anyway both the failing state is triggered collecting failure reports
from primaries.
In markNodeAsFailingIfNeeded, if `nodeIsReplica(myself) && myself->replicaof == node`
is true, then we can try to trigger a failover in clusterBeforeSleep as soon as
possible without waiting for clusterCron. Moreover, we can remove 500ms, because
we broadcast a FAIL here, which means that myself's FAILOVER_AUTH will definitely
be after the FAIL, which means we don't have to wait 500ms for FAIL to propagate.
```
void markNodeAsFailingIfNeeded(clusterNode *node) {
...
/* Broadcast the failing node name to everybody, forcing all the other
* reachable nodes to flag the node as FAIL.
* We do that even if this node is a replica and not a primary: anyway
* the failing state is triggered collecting failure reports from primaries,
* so here the replica is only helping propagating this status. */
clusterSendFail(node->name);
clusterDoBeforeSleep(CLUSTER_TODO_UPDATE_STATE | CLUSTER_TODO_SAVE_CONFIG);
```
And moreover, if we do the same when getting a FAIL, we check if failing is
myself's primary, and we call clusterSendFail to send the FAIL, this look like
we can also remove the 500ms delay. It is somehow like the first case in a very
badly way, like all the replicas got the pfail and then send the fail.
```
} else if (type == CLUSTERMSG_TYPE_FAIL) {
clusterNode *failing;
if (sender) {
failing = clusterLookupNode(hdr->data.fail.about.nodename, CLUSTER_NAMELEN);
if (failing && !(failing->flags & (CLUSTER_NODE_FAIL | CLUSTER_NODE_MYSELF))) {
serverLog(LL_NOTICE, "FAIL message received from %.40s (%s) about %.40s (%s)", hdr->sender,
sender->human_nodename, hdr->data.fail.about.nodename, failing->human_nodename);
failing->flags |= CLUSTER_NODE_FAIL;
failing->fail_time = now;
failing->flags &= ~CLUSTER_NODE_PFAIL;
clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG | CLUSTER_TODO_UPDATE_STATE);
}
```
So in this commit, we try to remove the fixed delay of 500ms. For now, we
keep the random 500ms and use the ranking to avoid the vote confilct.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #2209 +/- ##
============================================
- Coverage 71.54% 71.53% -0.01%
============================================
Files 122 122
Lines 66491 66501 +10
============================================
+ Hits 47570 47573 +3
- Misses 18921 18928 +7
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
In the time between scheduling a failover and starting the vote, the replica sends it's offset to other replicas in the same shard, to make sure the nodes agree on the rank before the failover starts.
/* Now that we have a scheduled election, broadcast our offset
* to all the other replicas so that they'll updated their offsets
* if our offset is better. */
clusterBroadcastPong(CLUSTER_BROADCAST_LOCAL_REPLICAS);Assuming the other replicas will also receive the FAIL, they will also broadcast their offset to other replicas in the same shard, so all replicas will have updated offset before any replica actually starts the vote.
This code updates the rank and the delay if some replica reports a better offset:
/* It is possible that we received more updated offsets from other
* replicas for the same primary since we computed our election delay.
* Update the delay if our rank changed.
*
* It is also possible that we received the message that telling a
* shard is up. Update the delay if our failed_primary_rank changed.
*
* Not performed if this is a manual failover. */
if (server.cluster->failover_auth_sent == 0 && server.cluster->mf_end == 0) {
int newrank = clusterGetReplicaRank();
if (newrank != server.cluster->failover_auth_rank) {
long long added_delay = (newrank - server.cluster->failover_auth_rank) * 1000;
server.cluster->failover_auth_time += added_delay;
server.cluster->failover_auth_rank = newrank;
serverLog(LL_NOTICE, "Replica rank updated to #%d, added %lld milliseconds of delay.", newrank,
added_delay);
}With this PR, the delay can be near 0, so there is not enough time for the replicas to agree about the rank. In the worst case, two replicas both believe they are rank 0 and start the failover at the same time, and none of them will win the election, causing even more downtime before we have a successful failover.
Do we need a fixed delay for this case, or is there some other mechanism to avoid this race between replicas?
We will always race without a consensus algorithm like raft or paxos with the data flowing through it. It's just what degree of raciness are we willing to tolerate. We could wait for an affirmative ping from the other replicas that they believe their primary is failed (it will be in their gossip pfail) before starting to nominate ourselves if we have the highest offset or 500ms has elapsed. |
yes, Viktor and i talked about this offline yesterday, i think we can do it with other fixed number, like the And with this fixed 500ms delay, i wonder if the current mechanism will be safe enough, and if we determine that it is indeed safe, then i would love to remove it. |
@madolson Sure, but we try to make the race less likely to happen. We have rank and now also primary rank that helps avoid a race if two primaries fail at the same time.
Do we do this? Whenever we get a pong from all replicas while waiting, we abort the 500ms wait and start the election immediately? I haven't seen this. Can you point to the code? |
|
My understanding for this requirement is to change current 500 ms delay (the fixed value) to a config parameter and let user decide the value (the default value keeps as 500 ms). Added in Valkey weekly meeting. |
No, I was proposing that was something we could do. I think I missed a word in my original comment. We are waiting from the fail message from the other replicas to make sure we have their repl offset to compute the rank. If a given replica has received that information from all other replicas, it doesn't need to wait. |
|
We discussed this in the weekly meeting. The discussion was that we don't want to fully remove the 500ms logic, since we believe we do need to wait to get the messages from the other replicas to verify we have the best replication offset. We need some verification that we have the offsets from the other replicas before starting the election. |
|
btw, if we only have one replica, that means, my rank is definitely is 0, and my offset is definitely the higher one. The waitting is just a waste in this case. |
|
Assuming the comments are correct and we are waiting for FAIL to propagate. Or we consider it safe, then I would like to remove it completely. Or there are other options:
Anyway, bottom line, if we feel safe, then I want to remove it. If we are not sure, that's fine, after all, stability is also very important. |
|
@enjoy-binbin we can do Madelyn's idea. We wait but when we receive information from all replicas that they know the primary is failing and we get their offset, we start immediately. It can be very fast. |
|
ok, that is a good idea and i think we are making a progress step by step, it can definitely solve the one replica case and help other cases, i will do it in a new PR. new PR: #2227 |
In valkey-io#2209, we are exploring ways to make failover faster, that is, to minimize the delay. The purpose of this delay is to allow us to have a scheduled election, so that during the delay, there is a way for the cluster to get more detailed information. For example, other primary nodes need to receive the propagation of FAIL in order to vote, and we need to know the offset of other replica nodes in order to rank. We want to minimize delay while ensuring safety. It is very useful for example, if there is only one replica, then we don't need any delay. In this PR, when we find that the replica is the best ranked replica, we let it initiate a failover immediately and completely remove the delay. How to ensure safety? 1. We try to broadcast a FAIL message to the cluster before the replica initiates the election, so that other primaries will not refuse to vote. 2. Each replica node will mark in flags whether its primary node is in FAIL state, that is, a new CLUSTER_NODE_MY_PRIMARY_FAIL is introduced. And when we gossip with others, the replica can know whether all replicas under the primary node have reached a consensus on the FAIL state based on this flag. If they have reached a consensus, it means that we know the offset information of all replicas, which means that we can calculate the rank based on the existing information without worrying about the offset being outdated. Signed-off-by: Binbin <binloveplay1314@qq.com>
…r failover" This reverts commit 00d4b4c.
Signed-off-by: Binbin <binloveplay1314@qq.com>
…s FAIL Signed-off-by: Binbin <binloveplay1314@qq.com>
|
@zuiderkwast I revert the 500ms change, and only keep the TODO FAILOVER logic. (I guess we can ignore the DCO for now, i can squash a signoff when i merge it) |
zuiderkwast
left a comment
There was a problem hiding this comment.
Change LGTM. Now it's only to skip the wait for the next clusterCron cycle.
I think you can change the title from "as soon as possible" to for example something like "Start failover without waiting for the next cluster cron cycle".
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: Binbin <binloveplay1314@qq.com>
|
@enjoy-binbin @zuiderkwast Could you explain the rational behind this change better? I don't understand the benefit of this very clearly. Let me share my reasoning on this. I believe on a single primary failure scenario with just one replica this change could potentially help with faster failover, however that is also not guaranteed if the healthy primaries are not yet aware about the failure of the primary for which vote is requested. Further, if there are multiple replicas in a given shard, it's very likely they would conflict with each other and divide the votes amongst them, which isn't ideal. It's also possible that a replica with lower offset could be elected. And in multiple primary failure scenario, we would again divide the votes for a given epoch amongst different shard causing wastage of election cycle and introduce delay. I would also like the @valkey-io/core-team to share their opinion on this. |
|
@hpatro This PR is now very minor. It doesn't remove the 500 delay nor the random delay. It only removed the wait for next cluster cron cycle, which I believe is not an intentional delay. Maybe you want to discuss this one instead? |
…#2227) In #2023 (#2209, etc.), we are exploring ways to make failover faster, that is, to minimize the delay. When a node is marked as FAIL and before the failover starts, there is a delay of 500-1000ms. The original purpose of this delay: 1. Allow FAIL to propagate to at least a majority of the primaries. This makes sure they will vote when a replica sends failover auth request. 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. We want to minimize this delay while ensuring safety. It is useful for example in these cases: 1. If there is only one replica, then we don't need any delay, or 2. If there are more replicas, with a fast network, the replicas can exchange the offsets very quickly and start the failover within a few milliseconds instead of 500-1000ms. In this PR, when we can be sure that the replica is the best ranked replica, we let it initiate a failover immediately and completely remove the delay. ### How to ensure safety? 1. To make sure this replica has the best rank, it only skips the delay if it is sure that it have the best rank and that all replicas in the same shard agree that the primary is failing. A new flag `CLUSTER_NODE_MY_PRIMARY_FAIL` is introduced to indicate that each replica has marked its primary as FAIL. If all replicas say that the primary is failing, we also know that the offset is not updated, because the offset is not incrementing when the primary is failing. We can skip the delay only if we have received a message from all replicas and they all have set this flag. 2. To make sure the primaries will vote even if they didn't receive the FAIL yet, we use the `CLUSTERMSG_FLAG0_FORCEACK` to make sure they will vote. This is equivalent to broadcasting a FAIL message to all primaries before we broadcast the failover auth request (but cheaper). The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R) is illustrated in the following sequence diagram: ``` A R B C | | | | | FAIL | | | |----->| AUTH R.| | | |------->| | | FAIL | | | |-------------->| | | | AUTH R.| | | |-------------->| | FAIL | | | |--------------------->| ``` ### Details This is the how the failover is initiated, with new steps marked with **(new)**: 1. A majority of primaries have marked another primary as PFAIL. 2. Some nodes counts failure reports and marks the failing primary as FAIL. The node that detects FAIL broadcasts it to all nodes in the cluster. 3. When a replica receives FAIL (or detects FAIL itself by counting PFAIL reports) it schedules a failover: a. It sets a timeout (500ms + random 0-500ms). b. It broadcasts pong to the other replicas in the same shard. c. **(new)** The pong (actually the clusterMsg header) has a new flag `CLUSTER_NODE_MY_PRIMARY_FAIL`. When the replicas broadcast pong to each other here, this flag is set. 4. **(new)** When the following conditions are met, skip the remaining delay and start the failover using AUTH REQUEST with the FORCE ACK flag set, that is if a. a PONG is received from every other replica in the same shard (broadcast within the shard) and b. all replicas have marked that its primary is FAIL in their last message (the new `CLUSTER_NODE_MY_PRIMARY_FAIL` flag is set) and c. this is the best replica (rank = 0) and d. my replication offset != 0. 5. When the delay has passed and no other replica has initiated failover, then initiate failover. Notes: * With 3(c), we don't need to wait for FAIL to propagate to all voting primaries. At this point, a FAIL has already been broadcast by some node, but there is a race so our auth request may arrive to some node before the FAIL. Using the FORCE ACK flag ensures the primaries will vote for us. (It is equivalent to broacasting another FAIL just before broadcasting auth request.) * 4(b) ensures that we have received the replication offset from all other replicas and that it's up to date. If a replica says that it's primary is failing, it also means that the replication from the primary to that replica has stopped. * 4(c) is to avoid a special bad case. It can happen that not all replicas know about each other. In this case, two replicas can think they are both the best replica and start the failover at the same time. This can already happen without this PR. When it happens, it usually means that a new replica has just joined and it has no data (offset = 0) and if it wins the election, there is a problem of dataloss (discussed and partially mitigated for the replica migration case in #885). To avoid this case, skip this fast failover path if the replica has offset = 0. --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…valkey-io#2227) In valkey-io#2023 (valkey-io#2209, etc.), we are exploring ways to make failover faster, that is, to minimize the delay. When a node is marked as FAIL and before the failover starts, there is a delay of 500-1000ms. The original purpose of this delay: 1. Allow FAIL to propagate to at least a majority of the primaries. This makes sure they will vote when a replica sends failover auth request. 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. We want to minimize this delay while ensuring safety. It is useful for example in these cases: 1. If there is only one replica, then we don't need any delay, or 2. If there are more replicas, with a fast network, the replicas can exchange the offsets very quickly and start the failover within a few milliseconds instead of 500-1000ms. In this PR, when we can be sure that the replica is the best ranked replica, we let it initiate a failover immediately and completely remove the delay. ### How to ensure safety? 1. To make sure this replica has the best rank, it only skips the delay if it is sure that it have the best rank and that all replicas in the same shard agree that the primary is failing. A new flag `CLUSTER_NODE_MY_PRIMARY_FAIL` is introduced to indicate that each replica has marked its primary as FAIL. If all replicas say that the primary is failing, we also know that the offset is not updated, because the offset is not incrementing when the primary is failing. We can skip the delay only if we have received a message from all replicas and they all have set this flag. 2. To make sure the primaries will vote even if they didn't receive the FAIL yet, we use the `CLUSTERMSG_FLAG0_FORCEACK` to make sure they will vote. This is equivalent to broadcasting a FAIL message to all primaries before we broadcast the failover auth request (but cheaper). The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R) is illustrated in the following sequence diagram: ``` A R B C | | | | | FAIL | | | |----->| AUTH R.| | | |------->| | | FAIL | | | |-------------->| | | | AUTH R.| | | |-------------->| | FAIL | | | |--------------------->| ``` ### Details This is the how the failover is initiated, with new steps marked with **(new)**: 1. A majority of primaries have marked another primary as PFAIL. 2. Some nodes counts failure reports and marks the failing primary as FAIL. The node that detects FAIL broadcasts it to all nodes in the cluster. 3. When a replica receives FAIL (or detects FAIL itself by counting PFAIL reports) it schedules a failover: a. It sets a timeout (500ms + random 0-500ms). b. It broadcasts pong to the other replicas in the same shard. c. **(new)** The pong (actually the clusterMsg header) has a new flag `CLUSTER_NODE_MY_PRIMARY_FAIL`. When the replicas broadcast pong to each other here, this flag is set. 4. **(new)** When the following conditions are met, skip the remaining delay and start the failover using AUTH REQUEST with the FORCE ACK flag set, that is if a. a PONG is received from every other replica in the same shard (broadcast within the shard) and b. all replicas have marked that its primary is FAIL in their last message (the new `CLUSTER_NODE_MY_PRIMARY_FAIL` flag is set) and c. this is the best replica (rank = 0) and d. my replication offset != 0. 5. When the delay has passed and no other replica has initiated failover, then initiate failover. Notes: * With 3(c), we don't need to wait for FAIL to propagate to all voting primaries. At this point, a FAIL has already been broadcast by some node, but there is a race so our auth request may arrive to some node before the FAIL. Using the FORCE ACK flag ensures the primaries will vote for us. (It is equivalent to broacasting another FAIL just before broadcasting auth request.) * 4(b) ensures that we have received the replication offset from all other replicas and that it's up to date. If a replica says that it's primary is failing, it also means that the replication from the primary to that replica has stopped. * 4(c) is to avoid a special bad case. It can happen that not all replicas know about each other. In this case, two replicas can think they are both the best replica and start the failover at the same time. This can already happen without this PR. When it happens, it usually means that a new replica has just joined and it has no data (offset = 0) and if it wins the election, there is a problem of dataloss (discussed and partially mitigated for the replica migration case in valkey-io#885). To avoid this case, skip this fast failover path if the replica has offset = 0. --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…valkey-io#2227) In valkey-io#2023 (valkey-io#2209, etc.), we are exploring ways to make failover faster, that is, to minimize the delay. When a node is marked as FAIL and before the failover starts, there is a delay of 500-1000ms. The original purpose of this delay: 1. Allow FAIL to propagate to at least a majority of the primaries. This makes sure they will vote when a replica sends failover auth request. 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. We want to minimize this delay while ensuring safety. It is useful for example in these cases: 1. If there is only one replica, then we don't need any delay, or 2. If there are more replicas, with a fast network, the replicas can exchange the offsets very quickly and start the failover within a few milliseconds instead of 500-1000ms. In this PR, when we can be sure that the replica is the best ranked replica, we let it initiate a failover immediately and completely remove the delay. ### How to ensure safety? 1. To make sure this replica has the best rank, it only skips the delay if it is sure that it have the best rank and that all replicas in the same shard agree that the primary is failing. A new flag `CLUSTER_NODE_MY_PRIMARY_FAIL` is introduced to indicate that each replica has marked its primary as FAIL. If all replicas say that the primary is failing, we also know that the offset is not updated, because the offset is not incrementing when the primary is failing. We can skip the delay only if we have received a message from all replicas and they all have set this flag. 2. To make sure the primaries will vote even if they didn't receive the FAIL yet, we use the `CLUSTERMSG_FLAG0_FORCEACK` to make sure they will vote. This is equivalent to broadcasting a FAIL message to all primaries before we broadcast the failover auth request (but cheaper). The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R) is illustrated in the following sequence diagram: ``` A R B C | | | | | FAIL | | | |----->| AUTH R.| | | |------->| | | FAIL | | | |-------------->| | | | AUTH R.| | | |-------------->| | FAIL | | | |--------------------->| ``` ### Details This is the how the failover is initiated, with new steps marked with **(new)**: 1. A majority of primaries have marked another primary as PFAIL. 2. Some nodes counts failure reports and marks the failing primary as FAIL. The node that detects FAIL broadcasts it to all nodes in the cluster. 3. When a replica receives FAIL (or detects FAIL itself by counting PFAIL reports) it schedules a failover: a. It sets a timeout (500ms + random 0-500ms). b. It broadcasts pong to the other replicas in the same shard. c. **(new)** The pong (actually the clusterMsg header) has a new flag `CLUSTER_NODE_MY_PRIMARY_FAIL`. When the replicas broadcast pong to each other here, this flag is set. 4. **(new)** When the following conditions are met, skip the remaining delay and start the failover using AUTH REQUEST with the FORCE ACK flag set, that is if a. a PONG is received from every other replica in the same shard (broadcast within the shard) and b. all replicas have marked that its primary is FAIL in their last message (the new `CLUSTER_NODE_MY_PRIMARY_FAIL` flag is set) and c. this is the best replica (rank = 0) and d. my replication offset != 0. 5. When the delay has passed and no other replica has initiated failover, then initiate failover. Notes: * With 3(c), we don't need to wait for FAIL to propagate to all voting primaries. At this point, a FAIL has already been broadcast by some node, but there is a race so our auth request may arrive to some node before the FAIL. Using the FORCE ACK flag ensures the primaries will vote for us. (It is equivalent to broacasting another FAIL just before broadcasting auth request.) * 4(b) ensures that we have received the replication offset from all other replicas and that it's up to date. If a replica says that it's primary is failing, it also means that the replication from the primary to that replica has stopped. * 4(c) is to avoid a special bad case. It can happen that not all replicas know about each other. In this case, two replicas can think they are both the best replica and start the failover at the same time. This can already happen without this PR. When it happens, it usually means that a new replica has just joined and it has no data (offset = 0) and if it wins the election, there is a problem of dataloss (discussed and partially mitigated for the replica migration case in valkey-io#885). To avoid this case, skip this fast failover path if the replica has offset = 0. --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
…#2227) In #2023 (#2209, etc.), we are exploring ways to make failover faster, that is, to minimize the delay. When a node is marked as FAIL and before the failover starts, there is a delay of 500-1000ms. The original purpose of this delay: 1. Allow FAIL to propagate to at least a majority of the primaries. This makes sure they will vote when a replica sends failover auth request. 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. We want to minimize this delay while ensuring safety. It is useful for example in these cases: 1. If there is only one replica, then we don't need any delay, or 2. If there are more replicas, with a fast network, the replicas can exchange the offsets very quickly and start the failover within a few milliseconds instead of 500-1000ms. In this PR, when we can be sure that the replica is the best ranked replica, we let it initiate a failover immediately and completely remove the delay. ### How to ensure safety? 1. To make sure this replica has the best rank, it only skips the delay if it is sure that it have the best rank and that all replicas in the same shard agree that the primary is failing. A new flag `CLUSTER_NODE_MY_PRIMARY_FAIL` is introduced to indicate that each replica has marked its primary as FAIL. If all replicas say that the primary is failing, we also know that the offset is not updated, because the offset is not incrementing when the primary is failing. We can skip the delay only if we have received a message from all replicas and they all have set this flag. 2. To make sure the primaries will vote even if they didn't receive the FAIL yet, we use the `CLUSTERMSG_FLAG0_FORCEACK` to make sure they will vote. This is equivalent to broadcasting a FAIL message to all primaries before we broadcast the failover auth request (but cheaper). The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R) is illustrated in the following sequence diagram: ``` A R B C | | | | | FAIL | | | |----->| AUTH R.| | | |------->| | | FAIL | | | |-------------->| | | | AUTH R.| | | |-------------->| | FAIL | | | |--------------------->| ``` ### Details This is the how the failover is initiated, with new steps marked with **(new)**: 1. A majority of primaries have marked another primary as PFAIL. 2. Some nodes counts failure reports and marks the failing primary as FAIL. The node that detects FAIL broadcasts it to all nodes in the cluster. 3. When a replica receives FAIL (or detects FAIL itself by counting PFAIL reports) it schedules a failover: a. It sets a timeout (500ms + random 0-500ms). b. It broadcasts pong to the other replicas in the same shard. c. **(new)** The pong (actually the clusterMsg header) has a new flag `CLUSTER_NODE_MY_PRIMARY_FAIL`. When the replicas broadcast pong to each other here, this flag is set. 4. **(new)** When the following conditions are met, skip the remaining delay and start the failover using AUTH REQUEST with the FORCE ACK flag set, that is if a. a PONG is received from every other replica in the same shard (broadcast within the shard) and b. all replicas have marked that its primary is FAIL in their last message (the new `CLUSTER_NODE_MY_PRIMARY_FAIL` flag is set) and c. this is the best replica (rank = 0) and d. my replication offset != 0. 5. When the delay has passed and no other replica has initiated failover, then initiate failover. Notes: * With 3(c), we don't need to wait for FAIL to propagate to all voting primaries. At this point, a FAIL has already been broadcast by some node, but there is a race so our auth request may arrive to some node before the FAIL. Using the FORCE ACK flag ensures the primaries will vote for us. (It is equivalent to broacasting another FAIL just before broadcasting auth request.) * 4(b) ensures that we have received the replication offset from all other replicas and that it's up to date. If a replica says that it's primary is failing, it also means that the replication from the primary to that replica has stopped. * 4(c) is to avoid a special bad case. It can happen that not all replicas know about each other. In this case, two replicas can think they are both the best replica and start the failover at the same time. This can already happen without this PR. When it happens, it usually means that a new replica has just joined and it has no data (offset = 0) and if it wins the election, there is a problem of dataloss (discussed and partially mitigated for the replica migration case in #885). To avoid this case, skip this fast failover path if the replica has offset = 0. --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
When myself is a replica, if we receive a message that my primary is
FAIL, we can try to set CLUSTER_TODO_HANDLE_FAILOVER so that we can
try to failover as soon as possible in beforeSleep without waiting
for clusterCron to kick in.
Add a new markNodeAsFailing method, we move FAIL flag related code to
here so we can easily check whether the failing node is my primary.
The following is all old content. It was deleted during the discussion, but the content is worth keeping. (Don't merge this text)
================================================================================
Remove the 500ms fixed delay for automatic failover for faster failover
An automatic failover has a fixed delay of 500ms, its main purpose is
to wait for the FAIL packet to propagate in the cluster so that other
primaries can respond to the vote request.
A fixed delay of 500ms is a bit long in now days, we are looking into
removing it for a faster failover. If we can ensure it is safe to remove,
then we can remove it.
Currently a replica now can only learn of the death of its primary
node from two places:
in markNodeAsFailingIfNeeded to see if we have the quorum.
Anyway both the failing state is triggered collecting failure reports
from primaries.
In markNodeAsFailingIfNeeded, if
nodeIsReplica(myself) && myself->replicaof == nodeis true, then we can try to trigger a failover in clusterBeforeSleep as soon as
possible without waiting for clusterCron. Moreover, we can remove 500ms, because
we broadcast a FAIL here, which means that myself's FAILOVER_AUTH will definitely
be after the FAIL, which means we don't have to wait 500ms for FAIL to propagate.
And moreover, if we do the same when getting a FAIL, we check if failing is
myself's primary, and we call clusterSendFail to send the FAIL, this look like
we can also remove the 500ms delay. It is somehow like the first case in a very
badly way, like all the replicas got the pfail and then send the fail.
So in this commit, we try to remove the fixed delay of 500ms. For now, we
keep the random 500ms and use the ranking to avoid the vote confilct.
Related to #2023.