Start failover without waiting for the next cluster cron cycle by enjoy-binbin · Pull Request #2209 · valkey-io/valkey

enjoy-binbin · 2025-06-12T16:55:48Z

When myself is a replica, if we receive a message that my primary is
FAIL, we can try to set CLUSTER_TODO_HANDLE_FAILOVER so that we can
try to failover as soon as possible in beforeSleep without waiting
for clusterCron to kick in.

Add a new markNodeAsFailing method, we move FAIL flag related code to
here so we can easily check whether the failing node is my primary.

The following is all old content. It was deleted during the discussion, but the content is worth keeping. (Don't merge this text)

================================================================================

Remove the 500ms fixed delay for automatic failover for faster failover

An automatic failover has a fixed delay of 500ms, its main purpose is
to wait for the FAIL packet to propagate in the cluster so that other
primaries can respond to the vote request.

server.cluster->failover_auth_time =
now +
500 +           /* Fixed delay of 500 milliseconds, let FAIL msg propagate. */
random() % 500; /* Random delay between 0 and 500 milliseconds. */

A fixed delay of 500ms is a bit long in now days, we are looking into
removing it for a faster failover. If we can ensure it is safe to remove,
then we can remove it.

Currently a replica now can only learn of the death of its primary
node from two places:

Collecting pfail messages from other primary nodes, and then check
in markNodeAsFailingIfNeeded to see if we have the quorum.
Receiving FAIL messages from other primary nodes (or replica nodes).

Anyway both the failing state is triggered collecting failure reports
from primaries.

In markNodeAsFailingIfNeeded, if nodeIsReplica(myself) && myself->replicaof == node
is true, then we can try to trigger a failover in clusterBeforeSleep as soon as
possible without waiting for clusterCron. Moreover, we can remove 500ms, because
we broadcast a FAIL here, which means that myself's FAILOVER_AUTH will definitely
be after the FAIL, which means we don't have to wait 500ms for FAIL to propagate.

void markNodeAsFailingIfNeeded(clusterNode *node) {
    ...
    /* Broadcast the failing node name to everybody, forcing all the other
     * reachable nodes to flag the node as FAIL.
     * We do that even if this node is a replica and not a primary: anyway
     * the failing state is triggered collecting failure reports from primaries,
     * so here the replica is only helping propagating this status. */
    clusterSendFail(node->name);
    clusterDoBeforeSleep(CLUSTER_TODO_UPDATE_STATE | CLUSTER_TODO_SAVE_CONFIG);

And moreover, if we do the same when getting a FAIL, we check if failing is
myself's primary, and we call clusterSendFail to send the FAIL, this look like
we can also remove the 500ms delay. It is somehow like the first case in a very
badly way, like all the replicas got the pfail and then send the fail.

    } else if (type == CLUSTERMSG_TYPE_FAIL) {
        clusterNode *failing;

        if (sender) {
            failing = clusterLookupNode(hdr->data.fail.about.nodename, CLUSTER_NAMELEN);
            if (failing && !(failing->flags & (CLUSTER_NODE_FAIL | CLUSTER_NODE_MYSELF))) {
                serverLog(LL_NOTICE, "FAIL message received from %.40s (%s) about %.40s (%s)", hdr->sender,
                          sender->human_nodename, hdr->data.fail.about.nodename, failing->human_nodename);
                failing->flags |= CLUSTER_NODE_FAIL;
                failing->fail_time = now;
                failing->flags &= ~CLUSTER_NODE_PFAIL;
                clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG | CLUSTER_TODO_UPDATE_STATE);
            }

So in this commit, we try to remove the fixed delay of 500ms. For now, we
keep the random 500ms and use the ranking to avoid the vote confilct.

Related to #2023.

An automatic failover has a fixed delay of 500ms, its main purpose is to wait for the FAIL packet to propagate in the cluster so that other primaries can respond to the vote request. ``` server.cluster->failover_auth_time = now + 500 + /* Fixed delay of 500 milliseconds, let FAIL msg propagate. */ random() % 500; /* Random delay between 0 and 500 milliseconds. */ ``` A fixed delay of 500ms is a bit long in now days, we are looking into removing it for a faster failover. If we can ensure it is safe to remove, then we can remove it. Currently a replica now can only learn of the death of its primary node from two places: 1. Collecting pfail messages from other primary nodes, and then check in markNodeAsFailingIfNeeded to see if we have the quorum. 2. Receiving FAIL messages from other primary nodes (or replica nodes). Anyway both the failing state is triggered collecting failure reports from primaries. In markNodeAsFailingIfNeeded, if `nodeIsReplica(myself) && myself->replicaof == node` is true, then we can try to trigger a failover in clusterBeforeSleep as soon as possible without waiting for clusterCron. Moreover, we can remove 500ms, because we broadcast a FAIL here, which means that myself's FAILOVER_AUTH will definitely be after the FAIL, which means we don't have to wait 500ms for FAIL to propagate. ``` void markNodeAsFailingIfNeeded(clusterNode *node) { ... /* Broadcast the failing node name to everybody, forcing all the other * reachable nodes to flag the node as FAIL. * We do that even if this node is a replica and not a primary: anyway * the failing state is triggered collecting failure reports from primaries, * so here the replica is only helping propagating this status. */ clusterSendFail(node->name); clusterDoBeforeSleep(CLUSTER_TODO_UPDATE_STATE | CLUSTER_TODO_SAVE_CONFIG); ``` And moreover, if we do the same when getting a FAIL, we check if failing is myself's primary, and we call clusterSendFail to send the FAIL, this look like we can also remove the 500ms delay. It is somehow like the first case in a very badly way, like all the replicas got the pfail and then send the fail. ``` } else if (type == CLUSTERMSG_TYPE_FAIL) { clusterNode *failing; if (sender) { failing = clusterLookupNode(hdr->data.fail.about.nodename, CLUSTER_NAMELEN); if (failing && !(failing->flags & (CLUSTER_NODE_FAIL | CLUSTER_NODE_MYSELF))) { serverLog(LL_NOTICE, "FAIL message received from %.40s (%s) about %.40s (%s)", hdr->sender, sender->human_nodename, hdr->data.fail.about.nodename, failing->human_nodename); failing->flags |= CLUSTER_NODE_FAIL; failing->fail_time = now; failing->flags &= ~CLUSTER_NODE_PFAIL; clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG | CLUSTER_TODO_UPDATE_STATE); } ``` So in this commit, we try to remove the fixed delay of 500ms. For now, we keep the random 500ms and use the ranking to avoid the vote confilct. Signed-off-by: Binbin <binloveplay1314@qq.com>

codecov · 2025-06-12T17:11:21Z

Codecov Report

Attention: Patch coverage is 90.90909% with 1 line in your changes missing coverage. Please review.

Project coverage is 71.53%. Comparing base (2287261) to head (00d4b4c).

Files with missing lines	Patch %	Lines
src/cluster_legacy.c	90.90%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2209      +/-   ##
============================================
- Coverage     71.54%   71.53%   -0.01%     
============================================
  Files           122      122              
  Lines         66491    66501      +10     
============================================
+ Hits          47570    47573       +3     
- Misses        18921    18928       +7

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`86.82% <90.90%> (+0.18%)`	⬆️

... and 14 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

zuiderkwast

In the time between scheduling a failover and starting the vote, the replica sends it's offset to other replicas in the same shard, to make sure the nodes agree on the rank before the failover starts.

        /* Now that we have a scheduled election, broadcast our offset
         * to all the other replicas so that they'll updated their offsets
         * if our offset is better. */
        clusterBroadcastPong(CLUSTER_BROADCAST_LOCAL_REPLICAS);

Assuming the other replicas will also receive the FAIL, they will also broadcast their offset to other replicas in the same shard, so all replicas will have updated offset before any replica actually starts the vote.

This code updates the rank and the delay if some replica reports a better offset:

    /* It is possible that we received more updated offsets from other
     * replicas for the same primary since we computed our election delay.
     * Update the delay if our rank changed.
     *
     * It is also possible that we received the message that telling a
     * shard is up. Update the delay if our failed_primary_rank changed.
     *
     * Not performed if this is a manual failover. */
    if (server.cluster->failover_auth_sent == 0 && server.cluster->mf_end == 0) {
        int newrank = clusterGetReplicaRank();
        if (newrank != server.cluster->failover_auth_rank) {
            long long added_delay = (newrank - server.cluster->failover_auth_rank) * 1000;
            server.cluster->failover_auth_time += added_delay;
            server.cluster->failover_auth_rank = newrank;
            serverLog(LL_NOTICE, "Replica rank updated to #%d, added %lld milliseconds of delay.", newrank,
                      added_delay);
        }

With this PR, the delay can be near 0, so there is not enough time for the replicas to agree about the rank. In the worst case, two replicas both believe they are rank 0 and start the failover at the same time, and none of them will win the election, causing even more downtime before we have a successful failover.

Do we need a fixed delay for this case, or is there some other mechanism to avoid this race between replicas?

madolson · 2025-06-12T19:30:43Z

Do we need a fixed delay for this case, or is there some other mechanism to avoid this race between replicas?

We will always race without a consensus algorithm like raft or paxos with the data flowing through it. It's just what degree of raciness are we willing to tolerate.

We could wait for an affirmative ping from the other replicas that they believe their primary is failed (it will be in their gossip pfail) before starting to nominate ourselves if we have the highest offset or 500ms has elapsed.

hpatro

My assumption from #2023 we wanted to make this delay dynamic w.r.t. to cluster-node-timeout. I'm apprehensive to having a separate flow to handle faster failover.

enjoy-binbin · 2025-06-13T02:31:56Z

My assumption from #2023 we wanted to make this delay dynamic w.r.t. to cluster-node-timeout. I'm apprehensive to having a separate flow to handle faster failover.

yes, Viktor and i talked about this offline yesterday, i think we can do it with other fixed number, like the random() % 500, server.cluster->failover_auth_rank * 1000, server.cluster->failover_failed_primary_rank * 500 something like these.

And with this fixed 500ms delay, i wonder if the current mechanism will be safe enough, and if we determine that it is indeed safe, then i would love to remove it.

zuiderkwast · 2025-06-13T07:43:11Z

We will always race without a consensus algorithm like raft or paxos with the data flowing through it. It's just what degree of raciness are we willing to tolerate.

@madolson Sure, but we try to make the race less likely to happen. We have rank and now also primary rank that helps avoid a race if two primaries fail at the same time.

We wait for an affirmative ping from the other replicas that they believe their primary is failed (it will be in their gossip pfail) before starting to nominate ourselves if we have the highest offset or 500ms has elapsed.

Do we do this? Whenever we get a pong from all replicas while waiting, we abort the 500ms wait and start the election immediately? I haven't seen this. Can you point to the code?

hwware · 2025-06-13T17:42:49Z

My understanding for this requirement is to change current 500 ms delay (the fixed value) to a config parameter and let user decide the value (the default value keeps as 500 ms). Added in Valkey weekly meeting.

madolson · 2025-06-13T22:52:14Z

Do we do this? Whenever we get a pong from all replicas while waiting, we abort the 500ms wait and start the election immediately? I haven't seen this. Can you point to the code?

No, I was proposing that was something we could do. I think I missed a word in my original comment.

We are waiting from the fail message from the other replicas to make sure we have their repl offset to compute the rank. If a given replica has received that information from all other replicas, it doesn't need to wait.

madolson · 2025-06-16T15:15:33Z

We discussed this in the weekly meeting. The discussion was that we don't want to fully remove the 500ms logic, since we believe we do need to wait to get the messages from the other replicas to verify we have the best replication offset. We need some verification that we have the offsets from the other replicas before starting the election.

enjoy-binbin · 2025-06-16T15:20:53Z

btw, if we only have one replica, that means, my rank is definitely is 0, and my offset is definitely the higher one. The waitting is just a waste in this case.

enjoy-binbin · 2025-06-16T16:39:16Z

Assuming the comments are correct and we are waiting for FAIL to propagate. Or we consider it safe, then I would like to remove it completely.

        server.cluster->failover_auth_time = now +
                                             500 +           /* Fixed delay of 500 milliseconds, let FAIL msg propagate. */
                                             random() % 500; /* Random delay between 0 and 500 milliseconds. */

Or there are other options:

Make it become a configuration item, an experienced administrator can use any value he wants, zero or 500ms, one can set it based on network (if you have a small node timeout, you can change it) or cluster size (if you only have one replica, you can just go zero) etc. This work for everyone, except the disadvantage is that a new configuration item needs to be introduced, which is not easy for ordinary people to use, but to some extent many of our configuration items rely on users to understand some details.
zuiderkwast idea, doing some math like node_timeout / 30, i am fine with that, but I should mention that we don't modify this timeout value often, so it is unlikely to be used for us.

Anyway, bottom line, if we feel safe, then I want to remove it. If we are not sure, that's fine, after all, stability is also very important.

zuiderkwast · 2025-06-17T09:17:07Z

@enjoy-binbin we can do Madelyn's idea. We wait but when we receive information from all replicas that they know the primary is failing and we get their offset, we start immediately. It can be very fast.

enjoy-binbin · 2025-06-17T14:36:26Z

ok, that is a good idea and i think we are making a progress step by step, it can definitely solve the one replica case and help other cases, i will do it in a new PR.

new PR: #2227

In valkey-io#2209, we are exploring ways to make failover faster, that is, to minimize the delay. The purpose of this delay is to allow us to have a scheduled election, so that during the delay, there is a way for the cluster to get more detailed information. For example, other primary nodes need to receive the propagation of FAIL in order to vote, and we need to know the offset of other replica nodes in order to rank. We want to minimize delay while ensuring safety. It is very useful for example, if there is only one replica, then we don't need any delay. In this PR, when we find that the replica is the best ranked replica, we let it initiate a failover immediately and completely remove the delay. How to ensure safety? 1. We try to broadcast a FAIL message to the cluster before the replica initiates the election, so that other primaries will not refuse to vote. 2. Each replica node will mark in flags whether its primary node is in FAIL state, that is, a new CLUSTER_NODE_MY_PRIMARY_FAIL is introduced. And when we gossip with others, the replica can know whether all replicas under the primary node have reached a consensus on the FAIL state based on this flag. If they have reached a consensus, it means that we know the offset information of all replicas, which means that we can calculate the rank based on the existing information without worrying about the offset being outdated. Signed-off-by: Binbin <binloveplay1314@qq.com>

…r failover" This reverts commit 00d4b4c.

Signed-off-by: Binbin <binloveplay1314@qq.com>

…s FAIL Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin · 2025-06-27T03:49:44Z

@zuiderkwast I revert the 500ms change, and only keep the TODO FAILOVER logic. (I guess we can ignore the DCO for now, i can squash a signoff when i merge it)

zuiderkwast

Change LGTM. Now it's only to skip the wait for the next clusterCron cycle.

I think you can change the title from "as soon as possible" to for example something like "Start failover without waiting for the next cluster cron cycle".

Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: Binbin <binloveplay1314@qq.com>

hpatro · 2025-06-27T17:24:51Z

@enjoy-binbin @zuiderkwast Could you explain the rational behind this change better?

I don't understand the benefit of this very clearly. Let me share my reasoning on this. I believe on a single primary failure scenario with just one replica this change could potentially help with faster failover, however that is also not guaranteed if the healthy primaries are not yet aware about the failure of the primary for which vote is requested.

Further, if there are multiple replicas in a given shard, it's very likely they would conflict with each other and divide the votes amongst them, which isn't ideal. It's also possible that a replica with lower offset could be elected.

And in multiple primary failure scenario, we would again divide the votes for a given epoch amongst different shard causing wastage of election cycle and introduce delay.

I would also like the @valkey-io/core-team to share their opinion on this.

zuiderkwast · 2025-06-27T17:34:33Z

@hpatro This PR is now very minor. It doesn't remove the 500 delay nor the random delay. It only removed the wait for next cluster cron cycle, which I believe is not an intentional delay.

Maybe you want to discuss this one instead?

Do the failover immediately if the replica is the best ranked replica #2227

…#2227) In #2023 (#2209, etc.), we are exploring ways to make failover faster, that is, to minimize the delay. When a node is marked as FAIL and before the failover starts, there is a delay of 500-1000ms. The original purpose of this delay: 1. Allow FAIL to propagate to at least a majority of the primaries. This makes sure they will vote when a replica sends failover auth request. 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. We want to minimize this delay while ensuring safety. It is useful for example in these cases: 1. If there is only one replica, then we don't need any delay, or 2. If there are more replicas, with a fast network, the replicas can exchange the offsets very quickly and start the failover within a few milliseconds instead of 500-1000ms. In this PR, when we can be sure that the replica is the best ranked replica, we let it initiate a failover immediately and completely remove the delay. ### How to ensure safety? 1. To make sure this replica has the best rank, it only skips the delay if it is sure that it have the best rank and that all replicas in the same shard agree that the primary is failing. A new flag `CLUSTER_NODE_MY_PRIMARY_FAIL` is introduced to indicate that each replica has marked its primary as FAIL. If all replicas say that the primary is failing, we also know that the offset is not updated, because the offset is not incrementing when the primary is failing. We can skip the delay only if we have received a message from all replicas and they all have set this flag. 2. To make sure the primaries will vote even if they didn't receive the FAIL yet, we use the `CLUSTERMSG_FLAG0_FORCEACK` to make sure they will vote. This is equivalent to broadcasting a FAIL message to all primaries before we broadcast the failover auth request (but cheaper). The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R) is illustrated in the following sequence diagram: ``` A R B C | | | | | FAIL | | | |----->| AUTH R.| | | |------->| | | FAIL | | | |-------------->| | | | AUTH R.| | | |-------------->| | FAIL | | | |--------------------->| ``` ### Details This is the how the failover is initiated, with new steps marked with **(new)**: 1. A majority of primaries have marked another primary as PFAIL. 2. Some nodes counts failure reports and marks the failing primary as FAIL. The node that detects FAIL broadcasts it to all nodes in the cluster. 3. When a replica receives FAIL (or detects FAIL itself by counting PFAIL reports) it schedules a failover: a. It sets a timeout (500ms + random 0-500ms). b. It broadcasts pong to the other replicas in the same shard. c. **(new)** The pong (actually the clusterMsg header) has a new flag `CLUSTER_NODE_MY_PRIMARY_FAIL`. When the replicas broadcast pong to each other here, this flag is set. 4. **(new)** When the following conditions are met, skip the remaining delay and start the failover using AUTH REQUEST with the FORCE ACK flag set, that is if a. a PONG is received from every other replica in the same shard (broadcast within the shard) and b. all replicas have marked that its primary is FAIL in their last message (the new `CLUSTER_NODE_MY_PRIMARY_FAIL` flag is set) and c. this is the best replica (rank = 0) and d. my replication offset != 0. 5. When the delay has passed and no other replica has initiated failover, then initiate failover. Notes: * With 3(c), we don't need to wait for FAIL to propagate to all voting primaries. At this point, a FAIL has already been broadcast by some node, but there is a race so our auth request may arrive to some node before the FAIL. Using the FORCE ACK flag ensures the primaries will vote for us. (It is equivalent to broacasting another FAIL just before broadcasting auth request.) * 4(b) ensures that we have received the replication offset from all other replicas and that it's up to date. If a replica says that it's primary is failing, it also means that the replication from the primary to that replica has stopped. * 4(c) is to avoid a special bad case. It can happen that not all replicas know about each other. In this case, two replicas can think they are both the best replica and start the failover at the same time. This can already happen without this PR. When it happens, it usually means that a new replica has just joined and it has no data (offset = 0) and if it wins the election, there is a problem of dataloss (discussed and partially mitigated for the replica migration case in #885). To avoid this case, skip this fast failover path if the replica has offset = 0. --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

…valkey-io#2227) In valkey-io#2023 (valkey-io#2209, etc.), we are exploring ways to make failover faster, that is, to minimize the delay. When a node is marked as FAIL and before the failover starts, there is a delay of 500-1000ms. The original purpose of this delay: 1. Allow FAIL to propagate to at least a majority of the primaries. This makes sure they will vote when a replica sends failover auth request. 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. We want to minimize this delay while ensuring safety. It is useful for example in these cases: 1. If there is only one replica, then we don't need any delay, or 2. If there are more replicas, with a fast network, the replicas can exchange the offsets very quickly and start the failover within a few milliseconds instead of 500-1000ms. In this PR, when we can be sure that the replica is the best ranked replica, we let it initiate a failover immediately and completely remove the delay. ### How to ensure safety? 1. To make sure this replica has the best rank, it only skips the delay if it is sure that it have the best rank and that all replicas in the same shard agree that the primary is failing. A new flag `CLUSTER_NODE_MY_PRIMARY_FAIL` is introduced to indicate that each replica has marked its primary as FAIL. If all replicas say that the primary is failing, we also know that the offset is not updated, because the offset is not incrementing when the primary is failing. We can skip the delay only if we have received a message from all replicas and they all have set this flag. 2. To make sure the primaries will vote even if they didn't receive the FAIL yet, we use the `CLUSTERMSG_FLAG0_FORCEACK` to make sure they will vote. This is equivalent to broadcasting a FAIL message to all primaries before we broadcast the failover auth request (but cheaper). The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R) is illustrated in the following sequence diagram: ``` A R B C | | | | | FAIL | | | |----->| AUTH R.| | | |------->| | | FAIL | | | |-------------->| | | | AUTH R.| | | |-------------->| | FAIL | | | |--------------------->| ``` ### Details This is the how the failover is initiated, with new steps marked with **(new)**: 1. A majority of primaries have marked another primary as PFAIL. 2. Some nodes counts failure reports and marks the failing primary as FAIL. The node that detects FAIL broadcasts it to all nodes in the cluster. 3. When a replica receives FAIL (or detects FAIL itself by counting PFAIL reports) it schedules a failover: a. It sets a timeout (500ms + random 0-500ms). b. It broadcasts pong to the other replicas in the same shard. c. **(new)** The pong (actually the clusterMsg header) has a new flag `CLUSTER_NODE_MY_PRIMARY_FAIL`. When the replicas broadcast pong to each other here, this flag is set. 4. **(new)** When the following conditions are met, skip the remaining delay and start the failover using AUTH REQUEST with the FORCE ACK flag set, that is if a. a PONG is received from every other replica in the same shard (broadcast within the shard) and b. all replicas have marked that its primary is FAIL in their last message (the new `CLUSTER_NODE_MY_PRIMARY_FAIL` flag is set) and c. this is the best replica (rank = 0) and d. my replication offset != 0. 5. When the delay has passed and no other replica has initiated failover, then initiate failover. Notes: * With 3(c), we don't need to wait for FAIL to propagate to all voting primaries. At this point, a FAIL has already been broadcast by some node, but there is a race so our auth request may arrive to some node before the FAIL. Using the FORCE ACK flag ensures the primaries will vote for us. (It is equivalent to broacasting another FAIL just before broadcasting auth request.) * 4(b) ensures that we have received the replication offset from all other replicas and that it's up to date. If a replica says that it's primary is failing, it also means that the replication from the primary to that replica has stopped. * 4(c) is to avoid a special bad case. It can happen that not all replicas know about each other. In this case, two replicas can think they are both the best replica and start the failover at the same time. This can already happen without this PR. When it happens, it usually means that a new replica has just joined and it has no data (offset = 0) and if it wins the election, there is a problem of dataloss (discussed and partially mitigated for the replica migration case in valkey-io#885). To avoid this case, skip this fast failover path if the replica has offset = 0. --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

…#2227) In #2023 (#2209, etc.), we are exploring ways to make failover faster, that is, to minimize the delay. When a node is marked as FAIL and before the failover starts, there is a delay of 500-1000ms. The original purpose of this delay: 1. Allow FAIL to propagate to at least a majority of the primaries. This makes sure they will vote when a replica sends failover auth request. 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. We want to minimize this delay while ensuring safety. It is useful for example in these cases: 1. If there is only one replica, then we don't need any delay, or 2. If there are more replicas, with a fast network, the replicas can exchange the offsets very quickly and start the failover within a few milliseconds instead of 500-1000ms. In this PR, when we can be sure that the replica is the best ranked replica, we let it initiate a failover immediately and completely remove the delay. ### How to ensure safety? 1. To make sure this replica has the best rank, it only skips the delay if it is sure that it have the best rank and that all replicas in the same shard agree that the primary is failing. A new flag `CLUSTER_NODE_MY_PRIMARY_FAIL` is introduced to indicate that each replica has marked its primary as FAIL. If all replicas say that the primary is failing, we also know that the offset is not updated, because the offset is not incrementing when the primary is failing. We can skip the delay only if we have received a message from all replicas and they all have set this flag. 2. To make sure the primaries will vote even if they didn't receive the FAIL yet, we use the `CLUSTERMSG_FLAG0_FORCEACK` to make sure they will vote. This is equivalent to broadcasting a FAIL message to all primaries before we broadcast the failover auth request (but cheaper). The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R) is illustrated in the following sequence diagram: ``` A R B C | | | | | FAIL | | | |----->| AUTH R.| | | |------->| | | FAIL | | | |-------------->| | | | AUTH R.| | | |-------------->| | FAIL | | | |--------------------->| ``` ### Details This is the how the failover is initiated, with new steps marked with **(new)**: 1. A majority of primaries have marked another primary as PFAIL. 2. Some nodes counts failure reports and marks the failing primary as FAIL. The node that detects FAIL broadcasts it to all nodes in the cluster. 3. When a replica receives FAIL (or detects FAIL itself by counting PFAIL reports) it schedules a failover: a. It sets a timeout (500ms + random 0-500ms). b. It broadcasts pong to the other replicas in the same shard. c. **(new)** The pong (actually the clusterMsg header) has a new flag `CLUSTER_NODE_MY_PRIMARY_FAIL`. When the replicas broadcast pong to each other here, this flag is set. 4. **(new)** When the following conditions are met, skip the remaining delay and start the failover using AUTH REQUEST with the FORCE ACK flag set, that is if a. a PONG is received from every other replica in the same shard (broadcast within the shard) and b. all replicas have marked that its primary is FAIL in their last message (the new `CLUSTER_NODE_MY_PRIMARY_FAIL` flag is set) and c. this is the best replica (rank = 0) and d. my replication offset != 0. 5. When the delay has passed and no other replica has initiated failover, then initiate failover. Notes: * With 3(c), we don't need to wait for FAIL to propagate to all voting primaries. At this point, a FAIL has already been broadcast by some node, but there is a race so our auth request may arrive to some node before the FAIL. Using the FORCE ACK flag ensures the primaries will vote for us. (It is equivalent to broacasting another FAIL just before broadcasting auth request.) * 4(b) ensures that we have received the replication offset from all other replicas and that it's up to date. If a replica says that it's primary is failing, it also means that the replication from the primary to that replica has stopped. * 4(c) is to avoid a special bad case. It can happen that not all replicas know about each other. In this case, two replicas can think they are both the best replica and start the failover at the same time. This can already happen without this PR. When it happens, it usually means that a new replica has just joined and it has no data (offset = 0) and if it wins the election, there is a problem of dataloss (discussed and partially mitigated for the replica migration case in #885). To avoid this case, skip this fast failover path if the replica has offset = 0. --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

enjoy-binbin added the major-decision-pending Major decision pending by TSC team label Jun 12, 2025

enjoy-binbin added this to Valkey 9.0 Jun 12, 2025

enjoy-binbin requested review from PingXie, hpatro, madolson and zuiderkwast June 12, 2025 16:56

zuiderkwast reviewed Jun 12, 2025

View reviewed changes

Comment thread src/cluster_legacy.c Outdated

hpatro reviewed Jun 12, 2025

View reviewed changes

enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Jun 13, 2025

zuiderkwast reviewed Jun 13, 2025

View reviewed changes

Comment thread src/cluster_legacy.c Outdated

zuiderkwast reviewed Jun 13, 2025

View reviewed changes

Comment thread src/cluster_legacy.c

enjoy-binbin mentioned this pull request Jun 17, 2025

Do the failover immediately if the replica is the best ranked replica #2227

Merged

enjoy-binbin added 3 commits June 25, 2025 12:32

Revert "Remove the 500ms fixed delay for automatic failover for faste…

01a1cf3

…r failover" This reverts commit 00d4b4c.

Merge remote-tracking branch 'upstream/unstable' into faster_failover

50daefd

Signed-off-by: Binbin <binloveplay1314@qq.com>

Trigger the election as soon as possible when myself marked primary a…

2f07693

…s FAIL Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin removed the major-decision-pending Major decision pending by TSC team label Jun 27, 2025

enjoy-binbin changed the title ~~Remove the 500ms fixed delay for automatic failover for faster failover~~ Trigger the election as soon as possible when myself marked primary as FAIL Jun 27, 2025

zuiderkwast approved these changes Jun 27, 2025

View reviewed changes

Comment thread src/cluster_legacy.c Outdated

Update src/cluster_legacy.c

e98cd2d

Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin changed the title ~~Trigger the election as soon as possible when myself marked primary as FAIL~~ Start failover without waiting for the next cluster cron cycle Jun 27, 2025

enjoy-binbin merged commit 23fb9f2 into valkey-io:unstable Jun 27, 2025
51 of 58 checks passed

github-project-automation Bot moved this to Done in Valkey 9.0 Jun 27, 2025

enjoy-binbin deleted the faster_failover branch June 27, 2025 16:04

enjoy-binbin added the cluster label Sep 19, 2025

Uh oh!

Conversation

enjoy-binbin commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The following is all old content. It was deleted during the discussion, but the content is worth keeping. (Don't merge this text)

Uh oh!

codecov Bot commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zuiderkwast left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

madolson commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hpatro left a comment

Choose a reason for hiding this comment

Uh oh!

enjoy-binbin commented Jun 13, 2025

Uh oh!

zuiderkwast commented Jun 13, 2025

Uh oh!

Uh oh!

Uh oh!

hwware commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

madolson commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

madolson commented Jun 16, 2025

Uh oh!

enjoy-binbin commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enjoy-binbin commented Jun 16, 2025

Uh oh!

zuiderkwast commented Jun 17, 2025

Uh oh!

enjoy-binbin commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enjoy-binbin commented Jun 27, 2025

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hpatro commented Jun 27, 2025

Uh oh!

zuiderkwast commented Jun 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

enjoy-binbin commented Jun 12, 2025 •

edited

Loading

codecov Bot commented Jun 12, 2025 •

edited

Loading

zuiderkwast left a comment •

edited

Loading

madolson commented Jun 12, 2025 •

edited

Loading

hwware commented Jun 13, 2025 •

edited

Loading

madolson commented Jun 13, 2025 •

edited

Loading

enjoy-binbin commented Jun 16, 2025 •

edited

Loading

enjoy-binbin commented Jun 17, 2025 •

edited

Loading