Skip to content

Do the failover immediately if the replica is the best ranked replica#2227

Merged
zuiderkwast merged 16 commits into
valkey-io:unstablefrom
enjoy-binbin:faster_failover2
Mar 26, 2026
Merged

Do the failover immediately if the replica is the best ranked replica#2227
zuiderkwast merged 16 commits into
valkey-io:unstablefrom
enjoy-binbin:faster_failover2

Conversation

@enjoy-binbin

@enjoy-binbin enjoy-binbin commented Jun 17, 2025

Copy link
Copy Markdown
Member

In #2023 (#2209, etc.), we are exploring ways to make failover faster, that is,
to minimize the delay.

When a node is marked as FAIL and before the failover starts, there is a delay of
500-1000ms. The original purpose of this delay:

  1. Allow FAIL to propagate to at least a majority of the primaries. This makes sure
    they will vote when a replica sends failover auth request.
  2. Allow replicas to exchange their offsets, so they will have a correct view of
    their own rank.

We want to minimize this delay while ensuring safety. It is useful for example in
these cases:

  1. If there is only one replica, then we don't need any delay, or
  2. If there are more replicas, with a fast network, the replicas can exchange the
    offsets very quickly and start the failover within a few milliseconds instead
    of 500-1000ms.

In this PR, when we can be sure that the replica is the best ranked replica, we let
it initiate a failover immediately and completely remove the delay.

How to ensure safety?

  1. To make sure this replica has the best rank, it only skips the delay if it is
    sure that it have the best rank and that all replicas in the same shard agree that
    the primary is failing. A new flag CLUSTER_NODE_MY_PRIMARY_FAIL is introduced to
    indicate that each replica has marked its primary as FAIL. If all replicas say that
    the primary is failing, we also know that the offset is not updated, because the
    offset is not incrementing when the primary is failing. We can skip the delay only
    if we have received a message from all replicas and they all have set this flag.

  2. To make sure the primaries will vote even if they didn't receive the FAIL yet,
    we use the CLUSTERMSG_FLAG0_FORCEACK to make sure they will vote. This is equivalent
    to broadcasting a FAIL message to all primaries before we broadcast the failover
    auth request (but cheaper).

The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R) is illustrated
in the following sequence diagram:

A      R        B      C
|      |        |      |
| FAIL |        |      |
|----->| AUTH R.|      |
|      |------->|      |
| FAIL |        |      |
|-------------->|      |
|      | AUTH R.|      |
|      |-------------->|
| FAIL |        |      |
|--------------------->|

Details

This is the how the failover is initiated, with new steps marked with (new):

  1. A majority of primaries have marked another primary as PFAIL.
  2. Some nodes counts failure reports and marks the failing primary as FAIL. The node
    that detects FAIL broadcasts it to all nodes in the cluster.
  3. When a replica receives FAIL (or detects FAIL itself by counting PFAIL reports) it
    schedules a failover:
    a. It sets a timeout (500ms + random 0-500ms).
    b. It broadcasts pong to the other replicas in the same shard.
    c. (new) The pong (actually the clusterMsg header) has a new flag CLUSTER_NODE_MY_PRIMARY_FAIL.
    When the replicas broadcast pong to each other here, this flag is set.
  4. (new) When the following conditions are met, skip the remaining delay and start
    the failover using AUTH REQUEST with the FORCE ACK flag set, that is if
    a. a PONG is received from every other replica in the same shard (broadcast within
    the shard) and
    b. all replicas have marked that its primary is FAIL in their last message (the new
    CLUSTER_NODE_MY_PRIMARY_FAIL flag is set) and
    c. this is the best replica (rank = 0) and
    d. my replication offset != 0.
  5. When the delay has passed and no other replica has initiated failover, then initiate failover.

Notes:

  • With 3(c), we don't need to wait for FAIL to propagate to all voting primaries.
    At this point, a FAIL has already been broadcast by some node, but there is a race
    so our auth request may arrive to some node before the FAIL. Using the FORCE ACK flag
    ensures the primaries will vote for us. (It is equivalent to broacasting another FAIL
    just before broadcasting auth request.)
  • 4(b) ensures that we have received the replication offset from all other replicas and
    that it's up to date. If a replica says that it's primary is failing, it also means that
    the replication from the primary to that replica has stopped.
  • 4(c) is to avoid a special bad case. It can happen that not all replicas know about each
    other. In this case, two replicas can think they are both the best replica and start the
    failover at the same time. This can already happen without this PR. When it happens, it
    usually means that a new replica has just joined and it has no data (offset = 0) and if
    it wins the election, there is a problem of dataloss (discussed and partially mitigated
    for the replica migration case in Fix data loss when replica do a failover with a old history repl offset #885). To avoid this case, skip this fast failover path
    if the replica has offset = 0.

In valkey-io#2209, we are exploring ways to make failover faster, that is, to
minimize the delay.

The purpose of this delay is to allow us to have a scheduled election,
so that during the delay, there is a way for the cluster to get more
detailed information. For example, other primary nodes need to receive
the propagation of FAIL in order to vote, and we need to know the offset
of other replica nodes in order to rank.

We want to minimize delay while ensuring safety. It is very useful for
example, if there is only one replica, then we don't need any delay.

In this PR, when we find that the replica is the best ranked replica,
we let it initiate a failover immediately and completely remove the delay.

How to ensure safety?

1. We try to broadcast a FAIL message to the cluster before the replica
initiates the election, so that other primaries will not refuse to vote.

2. Each replica node will mark in flags whether its primary node is in
FAIL state, that is, a new CLUSTER_NODE_MY_PRIMARY_FAIL is introduced.
And when we gossip with others, the replica can know whether all replicas
under the primary node have reached a consensus on the FAIL state based
on this flag. If they have reached a consensus, it means that we know the
offset information of all replicas, which means that we can calculate the
rank based on the existing information without worrying about the offset
being outdated.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Jun 17, 2025
@codecov

codecov Bot commented Jun 17, 2025

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.53%. Comparing base (f7a64c5) to head (9ecb0ab).
⚠️ Report is 2 commits behind head on unstable.

Additional details and impacted files
@@            Coverage Diff            @@
##           unstable    #2227   +/-   ##
=========================================
  Coverage     76.53%   76.53%           
=========================================
  Files           159      159           
  Lines         79652    79679   +27     
=========================================
+ Hits          60958    60980   +22     
- Misses        18694    18699    +5     
Files with missing lines Coverage Δ
src/cluster_legacy.c 88.24% <100.00%> (+0.03%) ⬆️

... and 21 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@zuiderkwast zuiderkwast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job!

I have some ideas. See comments.

Some test case would be nice to see, to check that we can do a failover faster than the delay that we skip. For example with node timeout 2000, check that we can do failover in 2500 ms? I guess a test like this can be flaky but at least try it to see if the feature works, would be good.

I'm a little confused about CLUSTER_TODO_HANDLE_FASTER_FAILOVER. Why do we need it? Why is it different to CLUSTER_TODO_HANDLE_FAILOVER, when can we use it, when can we not use it, etc. It is not clear to me.

IIUC, it's not really needed for the the feature skip delay if clusterAllReplicasThinkPrimaryIsFail() returns true and my rank is 0, is it? If it's a separate optimization, then I guess we skip the FASTER_FAILOVER idea in this PR, to keep it simpler. WDYT?

Comment thread src/cluster_legacy.c Outdated
Comment thread src/cluster_legacy.c Outdated
Comment thread src/cluster_legacy.c Outdated
@enjoy-binbin

enjoy-binbin commented Jun 18, 2025

Copy link
Copy Markdown
Member Author

Some test case would be nice to see, to check that we can do a failover faster than the delay that we skip. For example with node timeout 2000, check that we can do failover in 2500 ms? I guess a test like this can be flaky but at least try it to see if the feature works, would be good.

Yes, i will add the test later, i test it locally yesterday and it work.

I'm a little confused about CLUSTER_TODO_HANDLE_FASTER_FAILOVER. Why do we need it? Why is it different to CLUSTER_TODO_HANDLE_FAILOVER, when can we use it, when can we not use it, etc. It is not clear to me.

Yes, it is indeed a bit confusing, sorry i did not mention it. I want to put CLUSTER_TODO_HANDLE_FASTER_FAILOVER after other flags, like after CLUSTER_TODO_UPDATE_STATE, so that we will first print CLUSTER IS DOWN and then try to trigger the failover. That is now we will trigger failover faster, in the past, we will update the cluster state and then try to failover in ClusterCron.

IIUC, it's not really needed for the the feature skip delay if clusterAllReplicasThinkPrimaryIsFail() returns true and my rank is 0, is it? If it's a separate optimization, then I guess we skip the FASTER_FAILOVER idea in this PR, to keep it simpler. WDYT?

Sure, i can remove FASTER_FAILOVER and then only put it in #2209

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Comment thread src/cluster_legacy.c
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Comment thread src/cluster_legacy.c Outdated
@zuiderkwast

Copy link
Copy Markdown
Contributor

@hpatro I have discussed with @enjoy-binbin and updated the top comment with a more thorough reasoning about the possible scenarios. PTAL.

The special case that a new replica joins and doesn't know about the other replicas is a serious problem, but it's not new. It's not solved here but we can avoid making it worse. (I hope we can solve it in another PR, to make sure a replica is made aware of the other replicas in the shard ASAP, for example it starts replicating.)

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin

Copy link
Copy Markdown
Member Author

Let's also include a check that myself's replication offset != 0 here?

It makes sure that if myself is a new replica just connected that don't yet know about other replicas, we should not try to do this fast failover.

The special case that a new replica joins and doesn't know about the other replicas is a serious problem, but it's not new. It's not solved here but we can avoid making it worse. (I hope we can solve it in another PR, to make sure a replica is made aware of the other replicas in the shard ASAP, for example it starts replicating.)

Ok, i pushed the new commit, sorry for the late response, i somehow forgot it and thought we are waitting response from @hpatro

@zuiderkwast

Copy link
Copy Markdown
Contributor

@madolson This is based on your idea to keep track of other replicas' view of the primay's fail status. Great if you can take a look.

@hpatro hpatro left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't taken a complete look.

@enjoy-binbin Could you please share if we still continue the primary rank logic and not bypass that with fast failover logic?

@enjoy-binbin

Copy link
Copy Markdown
Member Author

Could you please share if we still continue the primary rank logic and not bypass that with fast failover logic?

What do you means? sorry i don't quite get the question.

@zuiderkwast zuiderkwast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to have this in 9.0. I hope others will agree. Maybe we can merge it after RC1.

Comment thread tests/unit/cluster/faster-failover.tcl Outdated
@zuiderkwast zuiderkwast moved this to Optional for next release candidate in Valkey 9.0 Aug 4, 2025
Signed-off-by: Binbin <binloveplay1314@qq.com>
@zuiderkwast zuiderkwast merged commit 6822a67 into valkey-io:unstable Mar 26, 2026
151 of 162 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to To be backported in Valkey 9.1 Mar 26, 2026
@enjoy-binbin enjoy-binbin deleted the faster_failover2 branch March 27, 2026 01:52
@enjoy-binbin enjoy-binbin added the release-notes This issue should get a line item in the release notes label Mar 27, 2026
charsyam pushed a commit to charsyam/valkey that referenced this pull request Mar 28, 2026
…valkey-io#2227)

In valkey-io#2023 (valkey-io#2209, etc.), we are exploring ways to make failover faster,
that is, to minimize the delay.

When a node is marked as FAIL and before the failover starts, there is a
delay of 500-1000ms. The original purpose of this delay:

1. Allow FAIL to propagate to at least a majority of the primaries. This
   makes sure they will vote when a replica sends failover auth request.
2. Allow replicas to exchange their offsets, so they will have a correct
   view of their own rank.

We want to minimize this delay while ensuring safety. It is useful for
example in these cases:

1. If there is only one replica, then we don't need any delay, or
2. If there are more replicas, with a fast network, the replicas can
   exchange the offsets very quickly and start the failover within a few
   milliseconds instead of 500-1000ms.

In this PR, when we can be sure that the replica is the best ranked
replica, we let
it initiate a failover immediately and completely remove the delay.

### How to ensure safety?

1. To make sure this replica has the best rank, it only skips the delay
   if it is sure that it have the best rank and that all replicas in the
   same shard agree that the primary is failing. A new flag
   `CLUSTER_NODE_MY_PRIMARY_FAIL` is introduced to indicate that each
   replica has marked its primary as FAIL. If all replicas say that the
   primary is failing, we also know that the offset is not updated,
   because the offset is not incrementing when the primary is failing.
   We can skip the delay only if we have received a message from all
   replicas and they all have set this flag.

2. To make sure the primaries will vote even if they didn't receive the
   FAIL yet, we use the `CLUSTERMSG_FLAG0_FORCEACK` to make sure they
   will vote. This is equivalent to broadcasting a FAIL message to all
   primaries before we broadcast the failover auth request (but
   cheaper).

The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R)
is illustrated in the following sequence diagram:

```
A      R        B      C
|      |        |      |
| FAIL |        |      |
|----->| AUTH R.|      |
|      |------->|      |
| FAIL |        |      |
|-------------->|      |
|      | AUTH R.|      |
|      |-------------->|
| FAIL |        |      |
|--------------------->|
```

### Details

This is the how the failover is initiated, with new steps marked with
**(new)**:

1. A majority of primaries have marked another primary as PFAIL.

2. Some nodes counts failure reports and marks the failing primary as
   FAIL. The node
   that detects FAIL broadcasts it to all nodes in the cluster.

3. When a replica receives FAIL (or detects FAIL itself by counting
   PFAIL reports) it schedules a failover:
   
   a. It sets a timeout (500ms + random 0-500ms).
   b. It broadcasts pong to the other replicas in the same shard.   
   c. **(new)** The pong (actually the clusterMsg header) has a new flag
      `CLUSTER_NODE_MY_PRIMARY_FAIL`. When the replicas broadcast pong
      to each other here, this flag is set.

4. **(new)** When the following conditions are met, skip the remaining
   delay and start the failover using AUTH REQUEST with the FORCE ACK
   flag set, that is if
   
   a. a PONG is received from every other replica in the same shard
      (broadcast within the shard) and
   b. all replicas have marked that its primary is FAIL in their last
      message (the new `CLUSTER_NODE_MY_PRIMARY_FAIL` flag is set) and
   c. this is the best replica (rank = 0) and
   d. my replication offset != 0.

5. When the delay has passed and no other replica has initiated
   failover, then initiate failover.

Notes:

* With 3(c), we don't need to wait for FAIL to propagate to all voting
  primaries. At this point, a FAIL has already been broadcast by some
  node, but there is a race so our auth request may arrive to some node
  before the FAIL. Using the FORCE ACK flag ensures the primaries will
  vote for us. (It is equivalent to broacasting another FAIL just before
  broadcasting auth request.)
  
* 4(b) ensures that we have received the replication offset from all
  other replicas and that it's up to date. If a replica says that it's
  primary is failing, it also means that the replication from the
  primary to that replica has stopped.

* 4(c) is to avoid a special bad case. It can happen that not all
  replicas know about each other. In this case, two replicas can think
  they are both the best replica and start the failover at the same
  time. This can already happen without this PR. When it happens, it
  usually means that a new replica has just joined and it has no data
  (offset = 0) and if it wins the election, there is a problem of
  dataloss (discussed and partially mitigated for the replica migration
  case in valkey-io#885). To avoid this case, skip this fast failover path if the
  replica has offset = 0.

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
@zuiderkwast

Copy link
Copy Markdown
Contributor

Test case failed in Daily:

*** [err]: The best replica can initiate an election immediately in an automatic failover in tests/unit/cluster/faster-failover.tcl
log message of '"*Successful partial resynchronization with primary*"' not found in ./tests/tmp/server.10822.103/stdout after line: 0 till line: 155

https://github.com/valkey-io/valkey/actions/runs/23774277197/job/69272507658#step:9:10598

zuiderkwast added a commit that referenced this pull request Apr 1, 2026
The test case "The best replica can initiate an election immediately
test" has been failing in CI jobs.

Increase the timeout to account for slow runners.

Old waiting time: 50 seconds. New waiting time: 120 seconds, with
valgrind: 600 seconds.

Intoduced in #2227.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Nikhil-Manglore pushed a commit to Nikhil-Manglore/valkey that referenced this pull request Apr 7, 2026
…#3424)

The test case "The best replica can initiate an election immediately
test" has been failing in CI jobs.

Increase the timeout to account for slow runners.

Old waiting time: 50 seconds. New waiting time: 120 seconds, with
valgrind: 600 seconds.

Intoduced in valkey-io#2227.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
zuiderkwast pushed a commit that referenced this pull request Apr 9, 2026
Two changes to tests/unit/cluster/faster-failover.tcl:

1. FAIL detection timeout: `wait_for_condition 1000 10` → `1000 50`
   (10s → 50s)
2. psync_max_retries: 1200 → 2400 normal (120s → 240s),
   6000 → 12000 valgrind (600s → 1200s)

The test `The best replica can initiate an election immediately in an
automatic failover` in `tests/unit/cluster/faster-failover.tcl` has been
flaky since it was introduced on March 27, 2026 by #2227.

**Frequency:** 8 out of 15 days (Mar 27 – Apr 8), across valgrind,
sanitizer, and slow CI runners.

**Common errors:**
- `log message of "Successful partial resynchronization with primary"
  not found` (timeout waiting for psync)
- `expected pattern found in srv -N log file: *best ranked replica*`
  (timeout waiting for FAIL propagation)

The test spins up a 12-node cluster (5 primaries + 7 replicas), pauses
nodes, and waits for FAIL detection to propagate across all nodes before
failover + partial resync. 

A previous fix attempt #3424 increased the psync timeout from 50s to
120s (600s valgrind), which reduced frequency but did not eliminate it.

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
enjoy-binbin added a commit to enjoy-binbin/valkey that referenced this pull request Apr 13, 2026
The test introduced in valkey-io#2227 is fragile because it unconditionally asserted
that "best ranked replica" would be logged for every shard's replica. This
assumption was incorrect for two reasons:

1. myselfIsBestRankedReplica() requires failover_failed_primary_rank == 0.
   When two primaries fail simultaneously, only the shard with the lowest
   shard_id gets failed_primary_rank == 0. The other shard's replicas can
   only trigger "Myself become the best ranked replica" after the first
   shard completes failover and the result propagates via gossip.

2. clusterAllReplicasThinkPrimaryIsFail() requires all sibling replicas
   to have their MY_PRIMARY_FAIL flag propagated via gossip. If the
   election delay is shorter than the gossip propagation time, the
   replica may start the election before receiving the flag.

It's also too strict and checks too many things. Like we can check that there
is no delay before starting the failover, but we shouldn't check that the rank
0 replica actually wins, because maybe it doesn't. Also we're not sure the other
replica does a psync afterwards, it can also do a full sync.

Restructure the test into three focused scenarios with retry logic:

- Test 0 (3 primaries + 1 replica): The sole replica is deterministically
  the best ranked replica. All conditions are trivially satisfied.

- Test 1 (3 primaries + 2 replicas): Single primary failure with two
  competing replicas. failed_primary_rank is deterministically 0, but
  MY_PRIMARY_FAIL gossip timing may prevent triggering in some runs.
  Uses internal retry with cluster recovery to ensure at least one
  successful trigger.

- Test 2 (5 primaries + 7 replicas): Two primaries failure. Verifies no
  failover timeout and uses retry to cover the "best ranked replica"
  path for at least one shard.

Signed-off-by: Binbin <binloveplay1314@qq.com>
enjoy-binbin added a commit that referenced this pull request Apr 16, 2026
The test introduced in #2227 is fragile because it unconditionally asserted
that "best ranked replica" would be logged for every shard's replica. This
assumption was incorrect for two reasons:

1. myselfIsBestRankedReplica() requires failover_failed_primary_rank == 0.
   When two primaries fail simultaneously, only the shard with the lowest
   shard_id gets failed_primary_rank == 0. The other shard's replicas can
   only trigger "Myself become the best ranked replica" after the first
   shard completes failover and the result propagates via gossip.

2. clusterAllReplicasThinkPrimaryIsFail() requires all sibling replicas
   to have their MY_PRIMARY_FAIL flag propagated via gossip. If the
   election delay is shorter than the gossip propagation time, the
   replica may start the election before receiving the flag.

It's also too strict and checks too many things. Like we can check that there
is no delay before starting the failover, but we shouldn't check that the rank
0 replica actually wins, because maybe it doesn't. Also we're not sure the other
replica does a psync afterwards, it can also do a full sync.

Restructure the test into three focused scenarios with retry logic:

- Test 0 (3 primaries + 1 replica): The sole replica is deterministically
  the best ranked replica. All conditions are trivially satisfied.

- Test 1 (3 primaries + 2 replicas): Single primary failure with two
  competing replicas. failed_primary_rank is deterministically 0, but
  MY_PRIMARY_FAIL gossip timing may prevent triggering in some runs.
  Uses internal retry with cluster recovery to ensure at least one
  successful trigger.

- Test 2 (5 primaries + 7 replicas): Two primaries failure. Verifies no
  failover timeout and uses retry to cover the "best ranked replica"
  path for at least one shard.

Signed-off-by: Binbin <binloveplay1314@qq.com>
sarthakaggarwal97 pushed a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 16, 2026
…valkey-io#2227)

In valkey-io#2023 (valkey-io#2209, etc.), we are exploring ways to make failover faster,
that is, to minimize the delay.

When a node is marked as FAIL and before the failover starts, there is a
delay of 500-1000ms. The original purpose of this delay:

1. Allow FAIL to propagate to at least a majority of the primaries. This
   makes sure they will vote when a replica sends failover auth request.
2. Allow replicas to exchange their offsets, so they will have a correct
   view of their own rank.

We want to minimize this delay while ensuring safety. It is useful for
example in these cases:

1. If there is only one replica, then we don't need any delay, or
2. If there are more replicas, with a fast network, the replicas can
   exchange the offsets very quickly and start the failover within a few
   milliseconds instead of 500-1000ms.

In this PR, when we can be sure that the replica is the best ranked
replica, we let
it initiate a failover immediately and completely remove the delay.

### How to ensure safety?

1. To make sure this replica has the best rank, it only skips the delay
   if it is sure that it have the best rank and that all replicas in the
   same shard agree that the primary is failing. A new flag
   `CLUSTER_NODE_MY_PRIMARY_FAIL` is introduced to indicate that each
   replica has marked its primary as FAIL. If all replicas say that the
   primary is failing, we also know that the offset is not updated,
   because the offset is not incrementing when the primary is failing.
   We can skip the delay only if we have received a message from all
   replicas and they all have set this flag.

2. To make sure the primaries will vote even if they didn't receive the
   FAIL yet, we use the `CLUSTERMSG_FLAG0_FORCEACK` to make sure they
   will vote. This is equivalent to broadcasting a FAIL message to all
   primaries before we broadcast the failover auth request (but
   cheaper).

The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R)
is illustrated in the following sequence diagram:

```
A      R        B      C
|      |        |      |
| FAIL |        |      |
|----->| AUTH R.|      |
|      |------->|      |
| FAIL |        |      |
|-------------->|      |
|      | AUTH R.|      |
|      |-------------->|
| FAIL |        |      |
|--------------------->|
```

### Details

This is the how the failover is initiated, with new steps marked with
**(new)**:

1. A majority of primaries have marked another primary as PFAIL.

2. Some nodes counts failure reports and marks the failing primary as
   FAIL. The node
   that detects FAIL broadcasts it to all nodes in the cluster.

3. When a replica receives FAIL (or detects FAIL itself by counting
   PFAIL reports) it schedules a failover:
   
   a. It sets a timeout (500ms + random 0-500ms).
   b. It broadcasts pong to the other replicas in the same shard.   
   c. **(new)** The pong (actually the clusterMsg header) has a new flag
      `CLUSTER_NODE_MY_PRIMARY_FAIL`. When the replicas broadcast pong
      to each other here, this flag is set.

4. **(new)** When the following conditions are met, skip the remaining
   delay and start the failover using AUTH REQUEST with the FORCE ACK
   flag set, that is if
   
   a. a PONG is received from every other replica in the same shard
      (broadcast within the shard) and
   b. all replicas have marked that its primary is FAIL in their last
      message (the new `CLUSTER_NODE_MY_PRIMARY_FAIL` flag is set) and
   c. this is the best replica (rank = 0) and
   d. my replication offset != 0.

5. When the delay has passed and no other replica has initiated
   failover, then initiate failover.

Notes:

* With 3(c), we don't need to wait for FAIL to propagate to all voting
  primaries. At this point, a FAIL has already been broadcast by some
  node, but there is a race so our auth request may arrive to some node
  before the FAIL. Using the FORCE ACK flag ensures the primaries will
  vote for us. (It is equivalent to broacasting another FAIL just before
  broadcasting auth request.)
  
* 4(b) ensures that we have received the replication offset from all
  other replicas and that it's up to date. If a replica says that it's
  primary is failing, it also means that the replication from the
  primary to that replica has stopped.

* 4(c) is to avoid a special bad case. It can happen that not all
  replicas know about each other. In this case, two replicas can think
  they are both the best replica and start the failover at the same
  time. This can already happen without this PR. When it happens, it
  usually means that a new replica has just joined and it has no data
  (offset = 0) and if it wins the election, there is a problem of
  dataloss (discussed and partially mitigated for the replica migration
  case in valkey-io#885). To avoid this case, skip this fast failover path if the
  replica has offset = 0.

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
sarthakaggarwal97 pushed a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 16, 2026
…#3424)

The test case "The best replica can initiate an election immediately
test" has been failing in CI jobs.

Increase the timeout to account for slow runners.

Old waiting time: 50 seconds. New waiting time: 120 seconds, with
valgrind: 600 seconds.

Intoduced in valkey-io#2227.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
sarthakaggarwal97 pushed a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 16, 2026
…-io#3463)

Two changes to tests/unit/cluster/faster-failover.tcl:

1. FAIL detection timeout: `wait_for_condition 1000 10` → `1000 50`
   (10s → 50s)
2. psync_max_retries: 1200 → 2400 normal (120s → 240s),
   6000 → 12000 valgrind (600s → 1200s)

The test `The best replica can initiate an election immediately in an
automatic failover` in `tests/unit/cluster/faster-failover.tcl` has been
flaky since it was introduced on March 27, 2026 by valkey-io#2227.

**Frequency:** 8 out of 15 days (Mar 27 – Apr 8), across valgrind,
sanitizer, and slow CI runners.

**Common errors:**
- `log message of "Successful partial resynchronization with primary"
  not found` (timeout waiting for psync)
- `expected pattern found in srv -N log file: *best ranked replica*`
  (timeout waiting for FAIL propagation)

The test spins up a 12-node cluster (5 primaries + 7 replicas), pauses
nodes, and waits for FAIL detection to propagate across all nodes before
failover + partial resync. 

A previous fix attempt valkey-io#3424 increased the psync timeout from 50s to
120s (600s valgrind), which reduced frequency but did not eliminate it.

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
sarthakaggarwal97 pushed a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 16, 2026
The test introduced in valkey-io#2227 is fragile because it unconditionally asserted
that "best ranked replica" would be logged for every shard's replica. This
assumption was incorrect for two reasons:

1. myselfIsBestRankedReplica() requires failover_failed_primary_rank == 0.
   When two primaries fail simultaneously, only the shard with the lowest
   shard_id gets failed_primary_rank == 0. The other shard's replicas can
   only trigger "Myself become the best ranked replica" after the first
   shard completes failover and the result propagates via gossip.

2. clusterAllReplicasThinkPrimaryIsFail() requires all sibling replicas
   to have their MY_PRIMARY_FAIL flag propagated via gossip. If the
   election delay is shorter than the gossip propagation time, the
   replica may start the election before receiving the flag.

It's also too strict and checks too many things. Like we can check that there
is no delay before starting the failover, but we shouldn't check that the rank
0 replica actually wins, because maybe it doesn't. Also we're not sure the other
replica does a psync afterwards, it can also do a full sync.

Restructure the test into three focused scenarios with retry logic:

- Test 0 (3 primaries + 1 replica): The sole replica is deterministically
  the best ranked replica. All conditions are trivially satisfied.

- Test 1 (3 primaries + 2 replicas): Single primary failure with two
  competing replicas. failed_primary_rank is deterministically 0, but
  MY_PRIMARY_FAIL gossip timing may prevent triggering in some runs.
  Uses internal retry with cluster recovery to ensure at least one
  successful trigger.

- Test 2 (5 primaries + 7 replicas): Two primaries failure. Verifies no
  failover timeout and uses retry to cover the "best ranked replica"
  path for at least one shard.

Signed-off-by: Binbin <binloveplay1314@qq.com>
madolson pushed a commit that referenced this pull request Apr 27, 2026
…#2227)

In #2023 (#2209, etc.), we are exploring ways to make failover faster,
that is, to minimize the delay.

When a node is marked as FAIL and before the failover starts, there is a
delay of 500-1000ms. The original purpose of this delay:

1. Allow FAIL to propagate to at least a majority of the primaries. This
   makes sure they will vote when a replica sends failover auth request.
2. Allow replicas to exchange their offsets, so they will have a correct
   view of their own rank.

We want to minimize this delay while ensuring safety. It is useful for
example in these cases:

1. If there is only one replica, then we don't need any delay, or
2. If there are more replicas, with a fast network, the replicas can
   exchange the offsets very quickly and start the failover within a few
   milliseconds instead of 500-1000ms.

In this PR, when we can be sure that the replica is the best ranked
replica, we let
it initiate a failover immediately and completely remove the delay.

### How to ensure safety?

1. To make sure this replica has the best rank, it only skips the delay
   if it is sure that it have the best rank and that all replicas in the
   same shard agree that the primary is failing. A new flag
   `CLUSTER_NODE_MY_PRIMARY_FAIL` is introduced to indicate that each
   replica has marked its primary as FAIL. If all replicas say that the
   primary is failing, we also know that the offset is not updated,
   because the offset is not incrementing when the primary is failing.
   We can skip the delay only if we have received a message from all
   replicas and they all have set this flag.

2. To make sure the primaries will vote even if they didn't receive the
   FAIL yet, we use the `CLUSTERMSG_FLAG0_FORCEACK` to make sure they
   will vote. This is equivalent to broadcasting a FAIL message to all
   primaries before we broadcast the failover auth request (but
   cheaper).

The race between FAIL (broadcast by A) and AUTH REQUEST (broadcast by R)
is illustrated in the following sequence diagram:

```
A      R        B      C
|      |        |      |
| FAIL |        |      |
|----->| AUTH R.|      |
|      |------->|      |
| FAIL |        |      |
|-------------->|      |
|      | AUTH R.|      |
|      |-------------->|
| FAIL |        |      |
|--------------------->|
```

### Details

This is the how the failover is initiated, with new steps marked with
**(new)**:

1. A majority of primaries have marked another primary as PFAIL.

2. Some nodes counts failure reports and marks the failing primary as
   FAIL. The node
   that detects FAIL broadcasts it to all nodes in the cluster.

3. When a replica receives FAIL (or detects FAIL itself by counting
   PFAIL reports) it schedules a failover:
   
   a. It sets a timeout (500ms + random 0-500ms).
   b. It broadcasts pong to the other replicas in the same shard.   
   c. **(new)** The pong (actually the clusterMsg header) has a new flag
      `CLUSTER_NODE_MY_PRIMARY_FAIL`. When the replicas broadcast pong
      to each other here, this flag is set.

4. **(new)** When the following conditions are met, skip the remaining
   delay and start the failover using AUTH REQUEST with the FORCE ACK
   flag set, that is if
   
   a. a PONG is received from every other replica in the same shard
      (broadcast within the shard) and
   b. all replicas have marked that its primary is FAIL in their last
      message (the new `CLUSTER_NODE_MY_PRIMARY_FAIL` flag is set) and
   c. this is the best replica (rank = 0) and
   d. my replication offset != 0.

5. When the delay has passed and no other replica has initiated
   failover, then initiate failover.

Notes:

* With 3(c), we don't need to wait for FAIL to propagate to all voting
  primaries. At this point, a FAIL has already been broadcast by some
  node, but there is a race so our auth request may arrive to some node
  before the FAIL. Using the FORCE ACK flag ensures the primaries will
  vote for us. (It is equivalent to broacasting another FAIL just before
  broadcasting auth request.)
  
* 4(b) ensures that we have received the replication offset from all
  other replicas and that it's up to date. If a replica says that it's
  primary is failing, it also means that the replication from the
  primary to that replica has stopped.

* 4(c) is to avoid a special bad case. It can happen that not all
  replicas know about each other. In this case, two replicas can think
  they are both the best replica and start the failover at the same
  time. This can already happen without this PR. When it happens, it
  usually means that a new replica has just joined and it has no data
  (offset = 0) and if it wins the election, there is a problem of
  dataloss (discussed and partially mitigated for the replica migration
  case in #885). To avoid this case, skip this fast failover path if the
  replica has offset = 0.

---------

Signed-off-by: Binbin <binloveplay1314@qq.com>
Co-authored-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
madolson pushed a commit that referenced this pull request Apr 27, 2026
The test case "The best replica can initiate an election immediately
test" has been failing in CI jobs.

Increase the timeout to account for slow runners.

Old waiting time: 50 seconds. New waiting time: 120 seconds, with
valgrind: 600 seconds.

Intoduced in #2227.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
madolson pushed a commit that referenced this pull request Apr 27, 2026
Two changes to tests/unit/cluster/faster-failover.tcl:

1. FAIL detection timeout: `wait_for_condition 1000 10` → `1000 50`
   (10s → 50s)
2. psync_max_retries: 1200 → 2400 normal (120s → 240s),
   6000 → 12000 valgrind (600s → 1200s)

The test `The best replica can initiate an election immediately in an
automatic failover` in `tests/unit/cluster/faster-failover.tcl` has been
flaky since it was introduced on March 27, 2026 by #2227.

**Frequency:** 8 out of 15 days (Mar 27 – Apr 8), across valgrind,
sanitizer, and slow CI runners.

**Common errors:**
- `log message of "Successful partial resynchronization with primary"
  not found` (timeout waiting for psync)
- `expected pattern found in srv -N log file: *best ranked replica*`
  (timeout waiting for FAIL propagation)

The test spins up a 12-node cluster (5 primaries + 7 replicas), pauses
nodes, and waits for FAIL detection to propagate across all nodes before
failover + partial resync. 

A previous fix attempt #3424 increased the psync timeout from 50s to
120s (600s valgrind), which reduced frequency but did not eliminate it.

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
madolson pushed a commit that referenced this pull request Apr 27, 2026
The test introduced in #2227 is fragile because it unconditionally asserted
that "best ranked replica" would be logged for every shard's replica. This
assumption was incorrect for two reasons:

1. myselfIsBestRankedReplica() requires failover_failed_primary_rank == 0.
   When two primaries fail simultaneously, only the shard with the lowest
   shard_id gets failed_primary_rank == 0. The other shard's replicas can
   only trigger "Myself become the best ranked replica" after the first
   shard completes failover and the result propagates via gossip.

2. clusterAllReplicasThinkPrimaryIsFail() requires all sibling replicas
   to have their MY_PRIMARY_FAIL flag propagated via gossip. If the
   election delay is shorter than the gossip propagation time, the
   replica may start the election before receiving the flag.

It's also too strict and checks too many things. Like we can check that there
is no delay before starting the failover, but we shouldn't check that the rank
0 replica actually wins, because maybe it doesn't. Also we're not sure the other
replica does a psync afterwards, it can also do a full sync.

Restructure the test into three focused scenarios with retry logic:

- Test 0 (3 primaries + 1 replica): The sole replica is deterministically
  the best ranked replica. All conditions are trivially satisfied.

- Test 1 (3 primaries + 2 replicas): Single primary failure with two
  competing replicas. failed_primary_rank is deterministically 0, but
  MY_PRIMARY_FAIL gossip timing may prevent triggering in some runs.
  Uses internal retry with cluster recovery to ensure at least one
  successful trigger.

- Test 2 (5 primaries + 7 replicas): Two primaries failure. Verifies no
  failover timeout and uses retry to cover the "best ranked replica"
  path for at least one shard.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@sarthakaggarwal97 sarthakaggarwal97 moved this from To be backported to Done in Valkey 9.1 May 16, 2026
roshkhatri pushed a commit to roshkhatri/valkey that referenced this pull request May 26, 2026
…#3424)

The test case "The best replica can initiate an election immediately
test" has been failing in CI jobs.

Increase the timeout to account for slow runners.

Old waiting time: 50 seconds. New waiting time: 120 seconds, with
valgrind: 600 seconds.

Intoduced in valkey-io#2227.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cluster release-notes This issue should get a line item in the release notes run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants