Do the failover immediately if the replica is the best ranked replica by enjoy-binbin · Pull Request #2227 · valkey-io/valkey

enjoy-binbin · 2025-06-17T15:52:04Z

In #2023 (#2209, etc.), we are exploring ways to make failover faster, that is,
to minimize the delay.

When a node is marked as FAIL and before the failover starts, there is a delay of
500-1000ms. The original purpose of this delay:

Allow FAIL to propagate to at least a majority of the primaries. This makes sure
they will vote when a replica sends failover auth request.
Allow replicas to exchange their offsets, so they will have a correct view of
their own rank.

We want to minimize this delay while ensuring safety. It is useful for example in
these cases:

If there is only one replica, then we don't need any delay, or
If there are more replicas, with a fast network, the replicas can exchange the
offsets very quickly and start the failover within a few milliseconds instead
of 500-1000ms.

In this PR, when we can be sure that the replica is the best ranked replica, we let
it initiate a failover immediately and completely remove the delay.

How to ensure safety?

To make sure this replica has the best rank, it only skips the delay if it is
sure that it have the best rank and that all replicas in the same shard agree that
the primary is failing. A new flag CLUSTER_NODE_MY_PRIMARY_FAIL is introduced to
indicate that each replica has marked its primary as FAIL. If all replicas say that
the primary is failing, we also know that the offset is not updated, because the
offset is not incrementing when the primary is failing. We can skip the delay only
if we have received a message from all replicas and they all have set this flag.
To make sure the primaries will vote even if they didn't receive the FAIL yet,
we use the CLUSTERMSG_FLAG0_FORCEACK to make sure they will vote. This is equivalent
to broadcasting a FAIL message to all primaries before we broadcast the failover
auth request (but cheaper).

The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R) is illustrated
in the following sequence diagram:

A      R        B      C
|      |        |      |
| FAIL |        |      |
|----->| AUTH R.|      |
|      |------->|      |
| FAIL |        |      |
|-------------->|      |
|      | AUTH R.|      |
|      |-------------->|
| FAIL |        |      |
|--------------------->|

Details

This is the how the failover is initiated, with new steps marked with (new):

A majority of primaries have marked another primary as PFAIL.
Some nodes counts failure reports and marks the failing primary as FAIL. The node
that detects FAIL broadcasts it to all nodes in the cluster.
When a replica receives FAIL (or detects FAIL itself by counting PFAIL reports) it
schedules a failover:
a. It sets a timeout (500ms + random 0-500ms).
b. It broadcasts pong to the other replicas in the same shard.
c. (new) The pong (actually the clusterMsg header) has a new flag CLUSTER_NODE_MY_PRIMARY_FAIL.
When the replicas broadcast pong to each other here, this flag is set.
(new) When the following conditions are met, skip the remaining delay and start
the failover using AUTH REQUEST with the FORCE ACK flag set, that is if
a. a PONG is received from every other replica in the same shard (broadcast within
the shard) and
b. all replicas have marked that its primary is FAIL in their last message (the new
CLUSTER_NODE_MY_PRIMARY_FAIL flag is set) and
c. this is the best replica (rank = 0) and
d. my replication offset != 0.
When the delay has passed and no other replica has initiated failover, then initiate failover.

Notes:

With 3(c), we don't need to wait for FAIL to propagate to all voting primaries.
At this point, a FAIL has already been broadcast by some node, but there is a race
so our auth request may arrive to some node before the FAIL. Using the FORCE ACK flag
ensures the primaries will vote for us. (It is equivalent to broacasting another FAIL
just before broadcasting auth request.)
4(b) ensures that we have received the replication offset from all other replicas and
that it's up to date. If a replica says that it's primary is failing, it also means that
the replication from the primary to that replica has stopped.
4(c) is to avoid a special bad case. It can happen that not all replicas know about each
other. In this case, two replicas can think they are both the best replica and start the
failover at the same time. This can already happen without this PR. When it happens, it
usually means that a new replica has just joined and it has no data (offset = 0) and if
it wins the election, there is a problem of dataloss (discussed and partially mitigated
for the replica migration case in Fix data loss when replica do a failover with a old history repl offset #885). To avoid this case, skip this fast failover path
if the replica has offset = 0.

In valkey-io#2209, we are exploring ways to make failover faster, that is, to minimize the delay. The purpose of this delay is to allow us to have a scheduled election, so that during the delay, there is a way for the cluster to get more detailed information. For example, other primary nodes need to receive the propagation of FAIL in order to vote, and we need to know the offset of other replica nodes in order to rank. We want to minimize delay while ensuring safety. It is very useful for example, if there is only one replica, then we don't need any delay. In this PR, when we find that the replica is the best ranked replica, we let it initiate a failover immediately and completely remove the delay. How to ensure safety? 1. We try to broadcast a FAIL message to the cluster before the replica initiates the election, so that other primaries will not refuse to vote. 2. Each replica node will mark in flags whether its primary node is in FAIL state, that is, a new CLUSTER_NODE_MY_PRIMARY_FAIL is introduced. And when we gossip with others, the replica can know whether all replicas under the primary node have reached a consensus on the FAIL state based on this flag. If they have reached a consensus, it means that we know the offset information of all replicas, which means that we can calculate the rank based on the existing information without worrying about the offset being outdated. Signed-off-by: Binbin <binloveplay1314@qq.com>

Signed-off-by: Binbin <binloveplay1314@qq.com>

codecov · 2025-06-17T16:08:56Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.53%. Comparing base (f7a64c5) to head (9ecb0ab).
⚠️ Report is 2 commits behind head on unstable.

Additional details and impacted files

@@            Coverage Diff            @@
##           unstable    #2227   +/-   ##
=========================================
  Coverage     76.53%   76.53%           
=========================================
  Files           159      159           
  Lines         79652    79679   +27     
=========================================
+ Hits          60958    60980   +22     
- Misses        18694    18699    +5

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`88.24% <100.00%> (+0.03%)`	⬆️

... and 21 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

zuiderkwast

Good job!

I have some ideas. See comments.

Some test case would be nice to see, to check that we can do a failover faster than the delay that we skip. For example with node timeout 2000, check that we can do failover in 2500 ms? I guess a test like this can be flaky but at least try it to see if the feature works, would be good.

I'm a little confused about CLUSTER_TODO_HANDLE_FASTER_FAILOVER. Why do we need it? Why is it different to CLUSTER_TODO_HANDLE_FAILOVER, when can we use it, when can we not use it, etc. It is not clear to me.

IIUC, it's not really needed for the the feature skip delay if clusterAllReplicasThinkPrimaryIsFail() returns true and my rank is 0, is it? If it's a separate optimization, then I guess we skip the FASTER_FAILOVER idea in this PR, to keep it simpler. WDYT?

enjoy-binbin · 2025-06-18T02:42:40Z

Some test case would be nice to see, to check that we can do a failover faster than the delay that we skip. For example with node timeout 2000, check that we can do failover in 2500 ms? I guess a test like this can be flaky but at least try it to see if the feature works, would be good.

Yes, i will add the test later, i test it locally yesterday and it work.

I'm a little confused about CLUSTER_TODO_HANDLE_FASTER_FAILOVER. Why do we need it? Why is it different to CLUSTER_TODO_HANDLE_FAILOVER, when can we use it, when can we not use it, etc. It is not clear to me.

Yes, it is indeed a bit confusing, sorry i did not mention it. I want to put CLUSTER_TODO_HANDLE_FASTER_FAILOVER after other flags, like after CLUSTER_TODO_UPDATE_STATE, so that we will first print CLUSTER IS DOWN and then try to trigger the failover. That is now we will trigger failover faster, in the past, we will update the cluster state and then try to failover in ClusterCron.

IIUC, it's not really needed for the the feature skip delay if clusterAllReplicasThinkPrimaryIsFail() returns true and my rank is 0, is it? If it's a separate optimization, then I guess we skip the FASTER_FAILOVER idea in this PR, to keep it simpler. WDYT?

Sure, i can remove FASTER_FAILOVER and then only put it in #2209

Signed-off-by: Binbin <binloveplay1314@qq.com>

zuiderkwast · 2025-07-02T14:54:47Z

@hpatro I have discussed with @enjoy-binbin and updated the top comment with a more thorough reasoning about the possible scenarios. PTAL.

The special case that a new replica joins and doesn't know about the other replicas is a serious problem, but it's not new. It's not solved here but we can avoid making it worse. (I hope we can solve it in another PR, to make sure a replica is made aware of the other replicas in the shard ASAP, for example it starts replicating.)

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin · 2025-07-11T04:03:06Z

Let's also include a check that myself's replication offset != 0 here?

It makes sure that if myself is a new replica just connected that don't yet know about other replicas, we should not try to do this fast failover.

The special case that a new replica joins and doesn't know about the other replicas is a serious problem, but it's not new. It's not solved here but we can avoid making it worse. (I hope we can solve it in another PR, to make sure a replica is made aware of the other replicas in the shard ASAP, for example it starts replicating.)

Ok, i pushed the new commit, sorry for the late response, i somehow forgot it and thought we are waitting response from @hpatro

zuiderkwast · 2025-07-11T04:47:11Z

@madolson This is based on your idea to keep track of other replicas' view of the primay's fail status. Great if you can take a look.

hpatro

Haven't taken a complete look.

@enjoy-binbin Could you please share if we still continue the primary rank logic and not bypass that with fast failover logic?

enjoy-binbin · 2025-07-12T14:00:10Z

Could you please share if we still continue the primary rank logic and not bypass that with fast failover logic?

What do you means? sorry i don't quite get the question.

zuiderkwast

I would like to have this in 9.0. I hope others will agree. Maybe we can merge it after RC1.

Signed-off-by: Binbin <binloveplay1314@qq.com>

…valkey-io#2227) In valkey-io#2023 (valkey-io#2209, etc.), we are exploring ways to make failover faster, that is, to minimize the delay. When a node is marked as FAIL and before the failover starts, there is a delay of 500-1000ms. The original purpose of this delay: 1. Allow FAIL to propagate to at least a majority of the primaries. This makes sure they will vote when a replica sends failover auth request. 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. We want to minimize this delay while ensuring safety. It is useful for example in these cases: 1. If there is only one replica, then we don't need any delay, or 2. If there are more replicas, with a fast network, the replicas can exchange the offsets very quickly and start the failover within a few milliseconds instead of 500-1000ms. In this PR, when we can be sure that the replica is the best ranked replica, we let it initiate a failover immediately and completely remove the delay. ### How to ensure safety? 1. To make sure this replica has the best rank, it only skips the delay if it is sure that it have the best rank and that all replicas in the same shard agree that the primary is failing. A new flag `CLUSTER_NODE_MY_PRIMARY_FAIL` is introduced to indicate that each replica has marked its primary as FAIL. If all replicas say that the primary is failing, we also know that the offset is not updated, because the offset is not incrementing when the primary is failing. We can skip the delay only if we have received a message from all replicas and they all have set this flag. 2. To make sure the primaries will vote even if they didn't receive the FAIL yet, we use the `CLUSTERMSG_FLAG0_FORCEACK` to make sure they will vote. This is equivalent to broadcasting a FAIL message to all primaries before we broadcast the failover auth request (but cheaper). The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R) is illustrated in the following sequence diagram: ``` A R B C | | | | | FAIL | | | |----->| AUTH R.| | | |------->| | | FAIL | | | |-------------->| | | | AUTH R.| | | |-------------->| | FAIL | | | |--------------------->| ``` ### Details This is the how the failover is initiated, with new steps marked with **(new)**: 1. A majority of primaries have marked another primary as PFAIL. 2. Some nodes counts failure reports and marks the failing primary as FAIL. The node that detects FAIL broadcasts it to all nodes in the cluster. 3. When a replica receives FAIL (or detects FAIL itself by counting PFAIL reports) it schedules a failover: a. It sets a timeout (500ms + random 0-500ms). b. It broadcasts pong to the other replicas in the same shard. c. **(new)** The pong (actually the clusterMsg header) has a new flag `CLUSTER_NODE_MY_PRIMARY_FAIL`. When the replicas broadcast pong to each other here, this flag is set. 4. **(new)** When the following conditions are met, skip the remaining delay and start the failover using AUTH REQUEST with the FORCE ACK flag set, that is if a. a PONG is received from every other replica in the same shard (broadcast within the shard) and b. all replicas have marked that its primary is FAIL in their last message (the new `CLUSTER_NODE_MY_PRIMARY_FAIL` flag is set) and c. this is the best replica (rank = 0) and d. my replication offset != 0. 5. When the delay has passed and no other replica has initiated failover, then initiate failover. Notes: * With 3(c), we don't need to wait for FAIL to propagate to all voting primaries. At this point, a FAIL has already been broadcast by some node, but there is a race so our auth request may arrive to some node before the FAIL. Using the FORCE ACK flag ensures the primaries will vote for us. (It is equivalent to broacasting another FAIL just before broadcasting auth request.) * 4(b) ensures that we have received the replication offset from all other replicas and that it's up to date. If a replica says that it's primary is failing, it also means that the replication from the primary to that replica has stopped. * 4(c) is to avoid a special bad case. It can happen that not all replicas know about each other. In this case, two replicas can think they are both the best replica and start the failover at the same time. This can already happen without this PR. When it happens, it usually means that a new replica has just joined and it has no data (offset = 0) and if it wins the election, there is a problem of dataloss (discussed and partially mitigated for the replica migration case in valkey-io#885). To avoid this case, skip this fast failover path if the replica has offset = 0. --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

zuiderkwast · 2026-03-31T20:03:00Z

Test case failed in Daily:

*** [err]: The best replica can initiate an election immediately in an automatic failover in tests/unit/cluster/faster-failover.tcl
log message of '"*Successful partial resynchronization with primary*"' not found in ./tests/tmp/server.10822.103/stdout after line: 0 till line: 155

https://github.com/valkey-io/valkey/actions/runs/23774277197/job/69272507658#step:9:10598

The test case "The best replica can initiate an election immediately test" has been failing in CI jobs. Increase the timeout to account for slow runners. Old waiting time: 50 seconds. New waiting time: 120 seconds, with valgrind: 600 seconds. Intoduced in #2227. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

…#3424) The test case "The best replica can initiate an election immediately test" has been failing in CI jobs. Increase the timeout to account for slow runners. Old waiting time: 50 seconds. New waiting time: 120 seconds, with valgrind: 600 seconds. Intoduced in valkey-io#2227. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

Two changes to tests/unit/cluster/faster-failover.tcl: 1. FAIL detection timeout: `wait_for_condition 1000 10` → `1000 50` (10s → 50s) 2. psync_max_retries: 1200 → 2400 normal (120s → 240s), 6000 → 12000 valgrind (600s → 1200s) The test `The best replica can initiate an election immediately in an automatic failover` in `tests/unit/cluster/faster-failover.tcl` has been flaky since it was introduced on March 27, 2026 by #2227. **Frequency:** 8 out of 15 days (Mar 27 – Apr 8), across valgrind, sanitizer, and slow CI runners. **Common errors:** - `log message of "Successful partial resynchronization with primary" not found` (timeout waiting for psync) - `expected pattern found in srv -N log file: *best ranked replica*` (timeout waiting for FAIL propagation) The test spins up a 12-node cluster (5 primaries + 7 replicas), pauses nodes, and waits for FAIL detection to propagate across all nodes before failover + partial resync. A previous fix attempt #3424 increased the psync timeout from 50s to 120s (600s valgrind), which reduced frequency but did not eliminate it. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>

The test introduced in valkey-io#2227 is fragile because it unconditionally asserted that "best ranked replica" would be logged for every shard's replica. This assumption was incorrect for two reasons: 1. myselfIsBestRankedReplica() requires failover_failed_primary_rank == 0. When two primaries fail simultaneously, only the shard with the lowest shard_id gets failed_primary_rank == 0. The other shard's replicas can only trigger "Myself become the best ranked replica" after the first shard completes failover and the result propagates via gossip. 2. clusterAllReplicasThinkPrimaryIsFail() requires all sibling replicas to have their MY_PRIMARY_FAIL flag propagated via gossip. If the election delay is shorter than the gossip propagation time, the replica may start the election before receiving the flag. It's also too strict and checks too many things. Like we can check that there is no delay before starting the failover, but we shouldn't check that the rank 0 replica actually wins, because maybe it doesn't. Also we're not sure the other replica does a psync afterwards, it can also do a full sync. Restructure the test into three focused scenarios with retry logic: - Test 0 (3 primaries + 1 replica): The sole replica is deterministically the best ranked replica. All conditions are trivially satisfied. - Test 1 (3 primaries + 2 replicas): Single primary failure with two competing replicas. failed_primary_rank is deterministically 0, but MY_PRIMARY_FAIL gossip timing may prevent triggering in some runs. Uses internal retry with cluster recovery to ensure at least one successful trigger. - Test 2 (5 primaries + 7 replicas): Two primaries failure. Verifies no failover timeout and uses retry to cover the "best ranked replica" path for at least one shard. Signed-off-by: Binbin <binloveplay1314@qq.com>

The test introduced in #2227 is fragile because it unconditionally asserted that "best ranked replica" would be logged for every shard's replica. This assumption was incorrect for two reasons: 1. myselfIsBestRankedReplica() requires failover_failed_primary_rank == 0. When two primaries fail simultaneously, only the shard with the lowest shard_id gets failed_primary_rank == 0. The other shard's replicas can only trigger "Myself become the best ranked replica" after the first shard completes failover and the result propagates via gossip. 2. clusterAllReplicasThinkPrimaryIsFail() requires all sibling replicas to have their MY_PRIMARY_FAIL flag propagated via gossip. If the election delay is shorter than the gossip propagation time, the replica may start the election before receiving the flag. It's also too strict and checks too many things. Like we can check that there is no delay before starting the failover, but we shouldn't check that the rank 0 replica actually wins, because maybe it doesn't. Also we're not sure the other replica does a psync afterwards, it can also do a full sync. Restructure the test into three focused scenarios with retry logic: - Test 0 (3 primaries + 1 replica): The sole replica is deterministically the best ranked replica. All conditions are trivially satisfied. - Test 1 (3 primaries + 2 replicas): Single primary failure with two competing replicas. failed_primary_rank is deterministically 0, but MY_PRIMARY_FAIL gossip timing may prevent triggering in some runs. Uses internal retry with cluster recovery to ensure at least one successful trigger. - Test 2 (5 primaries + 7 replicas): Two primaries failure. Verifies no failover timeout and uses retry to cover the "best ranked replica" path for at least one shard. Signed-off-by: Binbin <binloveplay1314@qq.com>

…valkey-io#2227) In valkey-io#2023 (valkey-io#2209, etc.), we are exploring ways to make failover faster, that is, to minimize the delay. When a node is marked as FAIL and before the failover starts, there is a delay of 500-1000ms. The original purpose of this delay: 1. Allow FAIL to propagate to at least a majority of the primaries. This makes sure they will vote when a replica sends failover auth request. 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. We want to minimize this delay while ensuring safety. It is useful for example in these cases: 1. If there is only one replica, then we don't need any delay, or 2. If there are more replicas, with a fast network, the replicas can exchange the offsets very quickly and start the failover within a few milliseconds instead of 500-1000ms. In this PR, when we can be sure that the replica is the best ranked replica, we let it initiate a failover immediately and completely remove the delay. ### How to ensure safety? 1. To make sure this replica has the best rank, it only skips the delay if it is sure that it have the best rank and that all replicas in the same shard agree that the primary is failing. A new flag `CLUSTER_NODE_MY_PRIMARY_FAIL` is introduced to indicate that each replica has marked its primary as FAIL. If all replicas say that the primary is failing, we also know that the offset is not updated, because the offset is not incrementing when the primary is failing. We can skip the delay only if we have received a message from all replicas and they all have set this flag. 2. To make sure the primaries will vote even if they didn't receive the FAIL yet, we use the `CLUSTERMSG_FLAG0_FORCEACK` to make sure they will vote. This is equivalent to broadcasting a FAIL message to all primaries before we broadcast the failover auth request (but cheaper). The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R) is illustrated in the following sequence diagram: ``` A R B C | | | | | FAIL | | | |----->| AUTH R.| | | |------->| | | FAIL | | | |-------------->| | | | AUTH R.| | | |-------------->| | FAIL | | | |--------------------->| ``` ### Details This is the how the failover is initiated, with new steps marked with **(new)**: 1. A majority of primaries have marked another primary as PFAIL. 2. Some nodes counts failure reports and marks the failing primary as FAIL. The node that detects FAIL broadcasts it to all nodes in the cluster. 3. When a replica receives FAIL (or detects FAIL itself by counting PFAIL reports) it schedules a failover: a. It sets a timeout (500ms + random 0-500ms). b. It broadcasts pong to the other replicas in the same shard. c. **(new)** The pong (actually the clusterMsg header) has a new flag `CLUSTER_NODE_MY_PRIMARY_FAIL`. When the replicas broadcast pong to each other here, this flag is set. 4. **(new)** When the following conditions are met, skip the remaining delay and start the failover using AUTH REQUEST with the FORCE ACK flag set, that is if a. a PONG is received from every other replica in the same shard (broadcast within the shard) and b. all replicas have marked that its primary is FAIL in their last message (the new `CLUSTER_NODE_MY_PRIMARY_FAIL` flag is set) and c. this is the best replica (rank = 0) and d. my replication offset != 0. 5. When the delay has passed and no other replica has initiated failover, then initiate failover. Notes: * With 3(c), we don't need to wait for FAIL to propagate to all voting primaries. At this point, a FAIL has already been broadcast by some node, but there is a race so our auth request may arrive to some node before the FAIL. Using the FORCE ACK flag ensures the primaries will vote for us. (It is equivalent to broacasting another FAIL just before broadcasting auth request.) * 4(b) ensures that we have received the replication offset from all other replicas and that it's up to date. If a replica says that it's primary is failing, it also means that the replication from the primary to that replica has stopped. * 4(c) is to avoid a special bad case. It can happen that not all replicas know about each other. In this case, two replicas can think they are both the best replica and start the failover at the same time. This can already happen without this PR. When it happens, it usually means that a new replica has just joined and it has no data (offset = 0) and if it wins the election, there is a problem of dataloss (discussed and partially mitigated for the replica migration case in valkey-io#885). To avoid this case, skip this fast failover path if the replica has offset = 0. --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

…#3424) The test case "The best replica can initiate an election immediately test" has been failing in CI jobs. Increase the timeout to account for slow runners. Old waiting time: 50 seconds. New waiting time: 120 seconds, with valgrind: 600 seconds. Intoduced in valkey-io#2227. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

…-io#3463) Two changes to tests/unit/cluster/faster-failover.tcl: 1. FAIL detection timeout: `wait_for_condition 1000 10` → `1000 50` (10s → 50s) 2. psync_max_retries: 1200 → 2400 normal (120s → 240s), 6000 → 12000 valgrind (600s → 1200s) The test `The best replica can initiate an election immediately in an automatic failover` in `tests/unit/cluster/faster-failover.tcl` has been flaky since it was introduced on March 27, 2026 by valkey-io#2227. **Frequency:** 8 out of 15 days (Mar 27 – Apr 8), across valgrind, sanitizer, and slow CI runners. **Common errors:** - `log message of "Successful partial resynchronization with primary" not found` (timeout waiting for psync) - `expected pattern found in srv -N log file: *best ranked replica*` (timeout waiting for FAIL propagation) The test spins up a 12-node cluster (5 primaries + 7 replicas), pauses nodes, and waits for FAIL detection to propagate across all nodes before failover + partial resync. A previous fix attempt valkey-io#3424 increased the psync timeout from 50s to 120s (600s valgrind), which reduced frequency but did not eliminate it. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>

The test introduced in valkey-io#2227 is fragile because it unconditionally asserted that "best ranked replica" would be logged for every shard's replica. This assumption was incorrect for two reasons: 1. myselfIsBestRankedReplica() requires failover_failed_primary_rank == 0. When two primaries fail simultaneously, only the shard with the lowest shard_id gets failed_primary_rank == 0. The other shard's replicas can only trigger "Myself become the best ranked replica" after the first shard completes failover and the result propagates via gossip. 2. clusterAllReplicasThinkPrimaryIsFail() requires all sibling replicas to have their MY_PRIMARY_FAIL flag propagated via gossip. If the election delay is shorter than the gossip propagation time, the replica may start the election before receiving the flag. It's also too strict and checks too many things. Like we can check that there is no delay before starting the failover, but we shouldn't check that the rank 0 replica actually wins, because maybe it doesn't. Also we're not sure the other replica does a psync afterwards, it can also do a full sync. Restructure the test into three focused scenarios with retry logic: - Test 0 (3 primaries + 1 replica): The sole replica is deterministically the best ranked replica. All conditions are trivially satisfied. - Test 1 (3 primaries + 2 replicas): Single primary failure with two competing replicas. failed_primary_rank is deterministically 0, but MY_PRIMARY_FAIL gossip timing may prevent triggering in some runs. Uses internal retry with cluster recovery to ensure at least one successful trigger. - Test 2 (5 primaries + 7 replicas): Two primaries failure. Verifies no failover timeout and uses retry to cover the "best ranked replica" path for at least one shard. Signed-off-by: Binbin <binloveplay1314@qq.com>

…#2227) In #2023 (#2209, etc.), we are exploring ways to make failover faster, that is, to minimize the delay. When a node is marked as FAIL and before the failover starts, there is a delay of 500-1000ms. The original purpose of this delay: 1. Allow FAIL to propagate to at least a majority of the primaries. This makes sure they will vote when a replica sends failover auth request. 2. Allow replicas to exchange their offsets, so they will have a correct view of their own rank. We want to minimize this delay while ensuring safety. It is useful for example in these cases: 1. If there is only one replica, then we don't need any delay, or 2. If there are more replicas, with a fast network, the replicas can exchange the offsets very quickly and start the failover within a few milliseconds instead of 500-1000ms. In this PR, when we can be sure that the replica is the best ranked replica, we let it initiate a failover immediately and completely remove the delay. ### How to ensure safety? 1. To make sure this replica has the best rank, it only skips the delay if it is sure that it have the best rank and that all replicas in the same shard agree that the primary is failing. A new flag `CLUSTER_NODE_MY_PRIMARY_FAIL` is introduced to indicate that each replica has marked its primary as FAIL. If all replicas say that the primary is failing, we also know that the offset is not updated, because the offset is not incrementing when the primary is failing. We can skip the delay only if we have received a message from all replicas and they all have set this flag. 2. To make sure the primaries will vote even if they didn't receive the FAIL yet, we use the `CLUSTERMSG_FLAG0_FORCEACK` to make sure they will vote. This is equivalent to broadcasting a FAIL message to all primaries before we broadcast the failover auth request (but cheaper). The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R) is illustrated in the following sequence diagram: ``` A R B C | | | | | FAIL | | | |----->| AUTH R.| | | |------->| | | FAIL | | | |-------------->| | | | AUTH R.| | | |-------------->| | FAIL | | | |--------------------->| ``` ### Details This is the how the failover is initiated, with new steps marked with **(new)**: 1. A majority of primaries have marked another primary as PFAIL. 2. Some nodes counts failure reports and marks the failing primary as FAIL. The node that detects FAIL broadcasts it to all nodes in the cluster. 3. When a replica receives FAIL (or detects FAIL itself by counting PFAIL reports) it schedules a failover: a. It sets a timeout (500ms + random 0-500ms). b. It broadcasts pong to the other replicas in the same shard. c. **(new)** The pong (actually the clusterMsg header) has a new flag `CLUSTER_NODE_MY_PRIMARY_FAIL`. When the replicas broadcast pong to each other here, this flag is set. 4. **(new)** When the following conditions are met, skip the remaining delay and start the failover using AUTH REQUEST with the FORCE ACK flag set, that is if a. a PONG is received from every other replica in the same shard (broadcast within the shard) and b. all replicas have marked that its primary is FAIL in their last message (the new `CLUSTER_NODE_MY_PRIMARY_FAIL` flag is set) and c. this is the best replica (rank = 0) and d. my replication offset != 0. 5. When the delay has passed and no other replica has initiated failover, then initiate failover. Notes: * With 3(c), we don't need to wait for FAIL to propagate to all voting primaries. At this point, a FAIL has already been broadcast by some node, but there is a race so our auth request may arrive to some node before the FAIL. Using the FORCE ACK flag ensures the primaries will vote for us. (It is equivalent to broacasting another FAIL just before broadcasting auth request.) * 4(b) ensures that we have received the replication offset from all other replicas and that it's up to date. If a replica says that it's primary is failing, it also means that the replication from the primary to that replica has stopped. * 4(c) is to avoid a special bad case. It can happen that not all replicas know about each other. In this case, two replicas can think they are both the best replica and start the failover at the same time. This can already happen without this PR. When it happens, it usually means that a new replica has just joined and it has no data (offset = 0) and if it wins the election, there is a problem of dataloss (discussed and partially mitigated for the replica migration case in #885). To avoid this case, skip this fast failover path if the replica has offset = 0. --------- Signed-off-by: Binbin <binloveplay1314@qq.com> Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

The test case "The best replica can initiate an election immediately test" has been failing in CI jobs. Increase the timeout to account for slow runners. Old waiting time: 50 seconds. New waiting time: 120 seconds, with valgrind: 600 seconds. Intoduced in #2227. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

Two changes to tests/unit/cluster/faster-failover.tcl: 1. FAIL detection timeout: `wait_for_condition 1000 10` → `1000 50` (10s → 50s) 2. psync_max_retries: 1200 → 2400 normal (120s → 240s), 6000 → 12000 valgrind (600s → 1200s) The test `The best replica can initiate an election immediately in an automatic failover` in `tests/unit/cluster/faster-failover.tcl` has been flaky since it was introduced on March 27, 2026 by #2227. **Frequency:** 8 out of 15 days (Mar 27 – Apr 8), across valgrind, sanitizer, and slow CI runners. **Common errors:** - `log message of "Successful partial resynchronization with primary" not found` (timeout waiting for psync) - `expected pattern found in srv -N log file: *best ranked replica*` (timeout waiting for FAIL propagation) The test spins up a 12-node cluster (5 primaries + 7 replicas), pauses nodes, and waits for FAIL detection to propagate across all nodes before failover + partial resync. A previous fix attempt #3424 increased the psync timeout from 50s to 120s (600s valgrind), which reduced frequency but did not eliminate it. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>

The test introduced in #2227 is fragile because it unconditionally asserted that "best ranked replica" would be logged for every shard's replica. This assumption was incorrect for two reasons: 1. myselfIsBestRankedReplica() requires failover_failed_primary_rank == 0. When two primaries fail simultaneously, only the shard with the lowest shard_id gets failed_primary_rank == 0. The other shard's replicas can only trigger "Myself become the best ranked replica" after the first shard completes failover and the result propagates via gossip. 2. clusterAllReplicasThinkPrimaryIsFail() requires all sibling replicas to have their MY_PRIMARY_FAIL flag propagated via gossip. If the election delay is shorter than the gossip propagation time, the replica may start the election before receiving the flag. It's also too strict and checks too many things. Like we can check that there is no delay before starting the failover, but we shouldn't check that the rank 0 replica actually wins, because maybe it doesn't. Also we're not sure the other replica does a psync afterwards, it can also do a full sync. Restructure the test into three focused scenarios with retry logic: - Test 0 (3 primaries + 1 replica): The sole replica is deterministically the best ranked replica. All conditions are trivially satisfied. - Test 1 (3 primaries + 2 replicas): Single primary failure with two competing replicas. failed_primary_rank is deterministically 0, but MY_PRIMARY_FAIL gossip timing may prevent triggering in some runs. Uses internal retry with cluster recovery to ensure at least one successful trigger. - Test 2 (5 primaries + 7 replicas): Two primaries failure. Verifies no failover timeout and uses retry to cover the "best ranked replica" path for at least one shard. Signed-off-by: Binbin <binloveplay1314@qq.com>

…#3424) The test case "The best replica can initiate an election immediately test" has been failing in CI jobs. Increase the timeout to account for slow runners. Old waiting time: 50 seconds. New waiting time: 120 seconds, with valgrind: 600 seconds. Intoduced in valkey-io#2227. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>

enjoy-binbin mentioned this pull request Jun 17, 2025

Start failover without waiting for the next cluster cron cycle #2209

Merged

enjoy-binbin requested review from madolson and zuiderkwast June 17, 2025 15:52

Remove spaces

35d28dc

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Jun 17, 2025

zuiderkwast reviewed Jun 17, 2025

View reviewed changes

Comment thread src/cluster_legacy.c Outdated

Comment thread src/cluster_legacy.c Outdated

Comment thread src/cluster_legacy.c Outdated

enjoy-binbin added 5 commits June 20, 2025 10:37

Merge remote-tracking branch 'upstream/unstable' into faster_failover2

555d556

Signed-off-by: Binbin <binloveplay1314@qq.com>

code review

20e6f5b

Signed-off-by: Binbin <binloveplay1314@qq.com>

add test

a4e4cb4

Signed-off-by: Binbin <binloveplay1314@qq.com>

Add markNodeAsFailing helper function

146482c

Signed-off-by: Binbin <binloveplay1314@qq.com>

Set CLUSTERMSG_FLAG0_FORCEACK flag if myself is the best ranked replica

22c63ef

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin commented Jun 25, 2025

View reviewed changes

Comment thread src/cluster_legacy.c

enjoy-binbin added 2 commits July 2, 2025 19:57

Merge remote-tracking branch 'upstream/unstable' into faster_failover2

034dac0

Signed-off-by: Binbin <binloveplay1314@qq.com>

revert change

d8eb504

Signed-off-by: Binbin <binloveplay1314@qq.com>

zuiderkwast reviewed Jul 2, 2025

View reviewed changes

Comment thread src/cluster_legacy.c Outdated

enjoy-binbin added 2 commits July 11, 2025 11:57

skip newly empty replica (offset == 0)

3925892

Signed-off-by: Binbin <binloveplay1314@qq.com>

Merge remote-tracking branch 'upstream/unstable' into faster_failover2

2c0b049

Signed-off-by: Binbin <binloveplay1314@qq.com>

hpatro reviewed Jul 12, 2025

View reviewed changes

sungming2 mentioned this pull request Jul 18, 2025

Increase fixed delay with small randomness in failover delay logic to avoid same election #1669

Closed

zuiderkwast approved these changes Aug 4, 2025

View reviewed changes

Comment thread tests/unit/cluster/faster-failover.tcl Outdated

zuiderkwast added this to Valkey 9.0 Aug 4, 2025

zuiderkwast moved this to Optional for next release candidate in Valkey 9.0 Aug 4, 2025

Merge remote-tracking branch 'upstream/unstable' into faster_failover2

92e9629

Signed-off-by: Binbin <binloveplay1314@qq.com>

zuiderkwast merged commit 6822a67 into valkey-io:unstable Mar 26, 2026
151 of 162 checks passed

github-project-automation Bot moved this from Todo to To be backported in Valkey 9.1 Mar 26, 2026

enjoy-binbin deleted the faster_failover2 branch March 27, 2026 01:52

enjoy-binbin added the release-notes This issue should get a line item in the release notes label Mar 27, 2026

zuiderkwast mentioned this pull request Mar 31, 2026

Increase timeout in flaky "failover immediately" test case #3424

Merged

rainsupreme mentioned this pull request Mar 31, 2026

flaky immediate-failover test fix: match full sync as well as partial sync #3426

Closed

rainsupreme mentioned this pull request Apr 1, 2026

Revert "Do the failover immediately if the replica is the best ranked replica" #3431

Closed

roshkhatri mentioned this pull request Apr 8, 2026

Increase timeouts in faster-failover test for slow CI runners #3463

Merged

enjoy-binbin mentioned this pull request Apr 13, 2026

Rewrite the faster failover test case #3495

Merged

sarthakaggarwal97 mentioned this pull request Apr 14, 2026

Merge unstable into 9.1 #3507

Closed

sarthakaggarwal97 mentioned this pull request Apr 16, 2026

Backport Unstable to 9.1 for RC2 #3519

Merged

This was referenced May 4, 2026

[Eval] Agent review bench sarthakaggarwal97/valkey#218

Closed

[backport] Weekly backport sweep for 9.1 sarthakaggarwal97/valkey#225

Closed

sarthakaggarwal97 moved this from To be backported to Done in Valkey 9.1 May 16, 2026

Uh oh!

Conversation

enjoy-binbin commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to ensure safety?

Details

Uh oh!

codecov Bot commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

enjoy-binbin commented Jun 18, 2025 • edited by zuiderkwast Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zuiderkwast commented Jul 2, 2025

Uh oh!

enjoy-binbin commented Jul 11, 2025

Uh oh!

zuiderkwast commented Jul 11, 2025

Uh oh!

hpatro left a comment

Choose a reason for hiding this comment

Uh oh!

enjoy-binbin commented Jul 12, 2025

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zuiderkwast commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

enjoy-binbin commented Jun 17, 2025 •

edited

Loading

codecov Bot commented Jun 17, 2025 •

edited

Loading

enjoy-binbin commented Jun 18, 2025 •

edited by zuiderkwast

Loading