Fix replica (the old primary) claims to sitll have slots after manual failover by enjoy-binbin · Pull Request #2301 · valkey-io/valkey

enjoy-binbin · 2025-07-03T04:50:51Z

update

Please also pick #2370 and #2431 and #2441 and #2477 you want to pick this one.

================

After a failover occurs, when the PING-PONG of the replica, that is,
the old primary, reaches other shard nodes faster, the corresponding
code path will be reached. We will adjust the sender to be a replica
and sender_claimed_primary to be the primary node, but in myself view,
slots still belong to the sender, which is a replica.

This actually restores part of the code in 28976a9.
It was lost in the changes to #445 and #754.

These 3 changes are about avoiding the replicaof cycle but ended up creating
a "cycle" of assumptions themselves. when #445 removed this logic, it was
assumed that a replica could also drive clusterUpdateSlotsConfigWith
but this was actually a regression that resulted in a replicaof cycle (#753).
#754 fixed the replicaof cycle by disallowing a replica to drive
clusterUpdateSlotsConfigWith but missed restoring the original logic.

Added a new DEBUG DISABLE-CLUSTER-RECONNECTION <0|1> this will prevent
cluster nodes from reconnecting so that the issue can be reproduce in test.

… failover Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin · 2025-07-03T04:52:08Z

In the test, there will be a short time window that the replica claims to still have slots.

start_cluster 3 1 {tags {external:skip cluster} overrides {cluster-replica-validity-factor 0}} {
    set R0_nodeid [R 0 cluster myid]
    set R1_nodeid [R 1 cluster myid]
    set R2_nodeid [R 2 cluster myid]
    set R3_nodeid [R 3 cluster myid]


    R 3 multi
    R 3 debug clusterlink kill all $R1_nodeid
    R 3 debug clusterlink kill all $R2_nodeid
    R 3 cluster failover takeover
    R 3 exec

    puts [R 1 cluster nodes]
    after 10
    puts [R 1 cluster nodes]
    after 10
    puts [R 1 cluster nodes]
    after 10
    puts [R 1 cluster nodes]
    after 10
    puts [R 1 cluster nodes]
    ...
}

We can see 84088c54fcae0666b552229b806e4f96eaba2a0b become a replica but own the slots

84088c54fcae0666b552229b806e4f96eaba2a0b 127.0.0.1:21114@31114 master - 0 1752039102266 1 connected 0-5461
f58610cd23c75be51a27c717e9f4997830bcfadf 127.0.0.1:21111@31111 slave 84088c54fcae0666b552229b806e4f96eaba2a0b 0 1752039102266 1 disconnected
bc07472c09d9e93fc7feb98a846acda76f97e67f 127.0.0.1:21112@31112 master - 0 1752039102266 3 connected 10923-16383
0d74d311b9432a64eeb1dfaefae5a25acf8a2300 127.0.0.1:21113@31113 myself,master - 0 0 2 connected 5462-10922

84088c54fcae0666b552229b806e4f96eaba2a0b 127.0.0.1:21114@31114 slave f58610cd23c75be51a27c717e9f4997830bcfadf 0 1752039102266 4 connected 0-5461
f58610cd23c75be51a27c717e9f4997830bcfadf 127.0.0.1:21111@31111 master - 0 1752039102266 4 disconnected
bc07472c09d9e93fc7feb98a846acda76f97e67f 127.0.0.1:21112@31112 master - 0 1752039102266 3 connected 10923-16383
0d74d311b9432a64eeb1dfaefae5a25acf8a2300 127.0.0.1:21113@31113 myself,master - 0 0 2 connected 5462-10922

{0 5461 {127.0.0.1 21114 84088c54fcae0666b552229b806e4f96eaba2a0b {}}} {5462 10922 {127.0.0.1 21113 0d74d311b9432a64eeb1dfaefae5a25acf8a2300 {}}} {10923 16383 {127.0.0.1 21112 bc07472c09d9e93fc7feb98a846acda76f97e67f {}}}

I will try to think of a way to add the test, there are currently no stable steps to reproduce it.

Signed-off-by: Binbin <binloveplay1314@qq.com>

codecov · 2025-07-03T08:28:25Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.46%. Comparing base (a62f83d) to head (ca44db7).
⚠️ Report is 166 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2301      +/-   ##
============================================
- Coverage     71.49%   71.46%   -0.04%     
============================================
  Files           123      123              
  Lines         66937    67104     +167     
============================================
+ Hits          47859    47954      +95     
- Misses        19078    19150      +72

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`86.78% <100.00%> (-0.12%)`	⬇️
src/debug.c	`53.75% <100.00%> (+0.16%)`	⬆️
src/server.c	`88.06% <100.00%> (-0.03%)`	⬇️
src/server.h	`100.00% <ø> (ø)`

... and 21 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

zuiderkwast

LGTM

Is this the failures we saw in Daily a few days ago?

enjoy-binbin · 2025-07-05T02:59:34Z

Is this the failures we saw in Daily a few days ago?

I guess not. I found this from another code review.

enjoy-binbin · 2025-07-05T05:06:06Z

btw, there is a saying that, we should not update sender->replicaof info based on the sender infomartion, like in here, we updated the sender_claimed_primary flag, so i guess it’s OK if we update the slots info.

But I also want to hear your thoughts, like, should sender->replicaof information be updated based on sender information?

zuiderkwast · 2025-07-05T17:07:55Z

btw, there is a saying that, we should not update sender->replicaof info based on the sender infomartion

I don'tknow this saying. Is it written anywhere?

I think information about sender directly from the sender can be trusted. It is direct information, so it should be correct. Why not?

enjoy-binbin · 2025-07-06T15:44:01Z

I think information about sender directly from the sender can be trusted

ohh, maybe i gave the wrond understanding, i mean, update the other node (sender->replicaof) infomation based on the sender infomartion.

I don'tknow this saying. Is it written anywhere?

I guess not, i did not see this, a person from my team said this, he mentioned something that we usually don't update the info of another node based on the info of one node. Like in here, we update the info of sender->replicaof (flags / slots info) based on the info of sender.

PingXie · 2025-07-07T05:21:07Z

In the test, there will be a short time window that the replica claims to still have slots.

I think this should be a transitional state and eventually the replica drops these slots. that being said, this pr makes sense to me and is the right thing to do, since the role change and the slot ownership change should've been an atomic operation.

This actually restores part of the code in 28976a9.
It was lost in the changes to #445 and #754.

yeah these 3 changes are about avoiding the replicaof cycle but ended up creating a "cycle" of assumptions themselves. when #445 removed this logic, it was assumed that a replica could also drive clusterUpdateSlotsConfigWith but this was actually a regression that resulted in a replicaof cycle (#753). #754 fixed the replicaof cycle by disallowing a replica to drive clusterUpdateSlotsConfigWith but missed restoring the original logic.

PingXie

LGTM - thanks @enjoy-binbin. great investigation.

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin · 2025-07-09T06:52:41Z

@PingXie @zuiderkwast Please check the new changes, i added some assert, a new debug command, and the test (Tests can fail reliably before the fix).

PingXie

great test!

Co-authored-by: Ping Xie <pingxie@outlook.com> Signed-off-by: Binbin <binloveplay1314@qq.com>

Signed-off-by: Binbin <binloveplay1314@qq.com>

zuiderkwast · 2025-07-11T12:13:03Z

Merged, very nice. Backport or not?

PingXie · 2025-07-12T07:14:49Z

Merged, very nice. Backport or not?

This is a safe change so I vote for backport

enjoy-binbin · 2025-07-12T13:27:07Z

Added to 8.0/8.1 project for backport.

…r manual failover (valkey-io#2301)" This reverts commit 507042d. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>

…er manual failover (valkey-io#2301)" This reverts commit 6ce01c1. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>

…r manual failover (valkey-io#2301)" This reverts commit 507042d.

…g slots (#2370) In #2301, we added clusterMoveNodeSlots to implement the logic of moving slots from old primary to new primary, when myself receives the replica (old primary) message first and the new primary message later in a shard failover. However due to this, when myself receives the new primary message later next time, there is no way to call clusterUpdateSlotsConfigWith, because we have already updated the slots of the new primary before. This result in, for example, importing slots and migrating slots not being updated, see #445. In this commit, we also make clusterMoveNodeSlots to move importing slots and migrating slots. Fixes #2363. Signed-off-by: Binbin <binloveplay1314@qq.com>

…r manual failover (valkey-io#2301)" This reverts commit 507042d.

madolson · 2025-08-04T17:24:41Z

This is a safe change so I vote for backport

@PingXie, unless we are extremely confident, I wouldn't consider backporting novel severAsserts to be safe. This can introduce new crashes that have caused CVEs in the past.

The assert was added in valkey-io#2301 and we found that there are some situations would trigger assert and crash the server. The reason we added the assert is because, in the code: 1. sender_claimed_primary and sender are in the same shard 2. and sender is the old primary, sender_claimed_primary is the old replica 3. and now sender become a replica, sender_claimed_primary become a primary That means a failover happend in the shard, and sender should be the primary of sender_claimed_primary. But obviously this assumption may be wrong, we rely on shard_id to determine whether it is in a same shard, and assume that a shard can only have one primary. But this is wrong, from valkey-io#2279 we can know there will be a case that we can create two primaries in the same shard due to the untimely update of shard_id. So we can create a test that trigger the assert in this way: 1. pre condition: two primaries in the same shard, one has slots and one is empty. 2. replica doing a cluster failover 3. the empty primary doing a cluster replicate with the replica (new primary) We change the assert to an if condition to fix it. Signed-off-by: Binbin <binloveplay1314@qq.com>

…2431) The assert was added in #2301 and we found that there are some situations would trigger assert and crash the server. The reason we added the assert is because, in the code: 1. sender_claimed_primary and sender are in the same shard 2. and sender is the old primary, sender_claimed_primary is the old replica 3. and now sender become a replica, sender_claimed_primary become a primary That means a failover happend in the shard, and sender should be the primary of sender_claimed_primary. But obviously this assumption may be wrong, we rely on shard_id to determine whether it is in a same shard, and assume that a shard can only have one primary. But this is wrong, from #2279 we can know there will be a case that we can create two primaries in the same shard due to the untimely update of shard_id. So we can create a test that trigger the assert in this way: 1. pre condition: two primaries in the same shard, one has slots and one is empty. 2. replica doing a cluster failover 3. the empty primary doing a cluster replicate with the replica (new primary) We change the assert to an if condition to fix it. Closes #2423. Note that the test written here also exposes the issue in #2441, so these two may need to be addressed together. Signed-off-by: Binbin <binloveplay1314@qq.com>

…alkey-io#2431) The assert was added in valkey-io#2301 and we found that there are some situations would trigger assert and crash the server. The reason we added the assert is because, in the code: 1. sender_claimed_primary and sender are in the same shard 2. and sender is the old primary, sender_claimed_primary is the old replica 3. and now sender become a replica, sender_claimed_primary become a primary That means a failover happend in the shard, and sender should be the primary of sender_claimed_primary. But obviously this assumption may be wrong, we rely on shard_id to determine whether it is in a same shard, and assume that a shard can only have one primary. But this is wrong, from valkey-io#2279 we can know there will be a case that we can create two primaries in the same shard due to the untimely update of shard_id. So we can create a test that trigger the assert in this way: 1. pre condition: two primaries in the same shard, one has slots and one is empty. 2. replica doing a cluster failover 3. the empty primary doing a cluster replicate with the replica (new primary) We change the assert to an if condition to fix it. Closes valkey-io#2423. Note that the test written here also exposes the issue in valkey-io#2441, so these two may need to be addressed together. Signed-off-by: Binbin <binloveplay1314@qq.com>

madolson · 2025-09-30T18:22:31Z

I don't think we should backport these. I removed from the 8.0 and 8.1 projects. Let me know if anyone disagrees.

…alkey-io#2431) The assert was added in valkey-io#2301 and we found that there are some situations would trigger assert and crash the server. The reason we added the assert is because, in the code: 1. sender_claimed_primary and sender are in the same shard 2. and sender is the old primary, sender_claimed_primary is the old replica 3. and now sender become a replica, sender_claimed_primary become a primary That means a failover happend in the shard, and sender should be the primary of sender_claimed_primary. But obviously this assumption may be wrong, we rely on shard_id to determine whether it is in a same shard, and assume that a shard can only have one primary. But this is wrong, from valkey-io#2279 we can know there will be a case that we can create two primaries in the same shard due to the untimely update of shard_id. So we can create a test that trigger the assert in this way: 1. pre condition: two primaries in the same shard, one has slots and one is empty. 2. replica doing a cluster failover 3. the empty primary doing a cluster replicate with the replica (new primary) We change the assert to an if condition to fix it. Closes valkey-io#2423. Note that the test written here also exposes the issue in valkey-io#2441, so these two may need to be addressed together. Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>

…g slots (valkey-io#2370) In valkey-io#2301, we added clusterMoveNodeSlots to implement the logic of moving slots from old primary to new primary, when myself receives the replica (old primary) message first and the new primary message later in a shard failover. However due to this, when myself receives the new primary message later next time, there is no way to call clusterUpdateSlotsConfigWith, because we have already updated the slots of the new primary before. This result in, for example, importing slots and migrating slots not being updated, see valkey-io#445. In this commit, we also make clusterMoveNodeSlots to move importing slots and migrating slots. Fixes valkey-io#2363. Signed-off-by: Binbin <binloveplay1314@qq.com>

…g slots (valkey-io#2370) In valkey-io#2301, we added clusterMoveNodeSlots to implement the logic of moving slots from old primary to new primary, when myself receives the replica (old primary) message first and the new primary message later in a shard failover. However due to this, when myself receives the new primary message later next time, there is no way to call clusterUpdateSlotsConfigWith, because we have already updated the slots of the new primary before. This result in, for example, importing slots and migrating slots not being updated, see valkey-io#445. In this commit, we also make clusterMoveNodeSlots to move importing slots and migrating slots. Fixes valkey-io#2363. Signed-off-by: Binbin <binloveplay1314@qq.com> (cherry picked from commit a3907ad)

Fix replica (the old primary) claims to sitll have slots after manual…

e85d1d1

… failover Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin requested a review from PingXie July 3, 2025 04:58

Add a assert

210df4c

Signed-off-by: Binbin <binloveplay1314@qq.com>

zuiderkwast approved these changes Jul 4, 2025

View reviewed changes

enjoy-binbin added this to Valkey 9.0 Jul 5, 2025

enjoy-binbin moved this to In Progress in Valkey 9.0 Jul 5, 2025

PingXie approved these changes Jul 7, 2025

View reviewed changes

change position, add process slot info, add test, add debug command

3c793ae

Signed-off-by: Binbin <binloveplay1314@qq.com>

PingXie approved these changes Jul 9, 2025

View reviewed changes

Comment thread src/cluster_legacy.c Outdated

Comment thread tests/unit/cluster/manual-failover.tcl Outdated

Update tests/unit/cluster/manual-failover.tcl

01351eb

Co-authored-by: Ping Xie <pingxie@outlook.com> Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Jul 9, 2025

enjoy-binbin added 2 commits July 9, 2025 15:47

update tmp name

94aa0d0

Signed-off-by: Binbin <binloveplay1314@qq.com>

changes for assert

ca44db7

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin merged commit 507042d into valkey-io:unstable Jul 11, 2025
128 of 140 checks passed

github-project-automation Bot moved this from In Progress to Done in Valkey 9.0 Jul 11, 2025

enjoy-binbin deleted the replica_slots branch July 11, 2025 11:00

enjoy-binbin added the release-notes This issue should get a line item in the release notes label Jul 11, 2025

enjoy-binbin added this to Valkey 8.0 and Valkey 8.1 Jul 12, 2025

enjoy-binbin moved this to To be backported in Valkey 8.0 Jul 12, 2025

enjoy-binbin moved this to To be backported in Valkey 8.1 Jul 12, 2025

roshkhatri mentioned this pull request Jul 16, 2025

[test-failure] Slot-migration related #2363

Closed

sarthakaggarwal97 added a commit to sarthakaggarwal97/valkey that referenced this pull request Jul 21, 2025

Revert "Fix replica (the old primary) claims to sitll have slots afte…

0b501e0

…r manual failover (valkey-io#2301)" This reverts commit 507042d.

enjoy-binbin mentioned this pull request Jul 22, 2025

Update clusterMoveNodeSlots to also move importing slots and migrating slots #2370

Merged

sarthakaggarwal97 added a commit to sarthakaggarwal97/valkey that referenced this pull request Jul 28, 2025

Revert "Fix replica (the old primary) claims to sitll have slots afte…

a4e50de

…r manual failover (valkey-io#2301)" This reverts commit 507042d.

madolson reviewed Aug 4, 2025

View reviewed changes

Comment thread src/cluster_legacy.c

enjoy-binbin mentioned this pull request Aug 5, 2025

Change the same shard failover assert to if condition to avoid crash #2431

Merged

enjoy-binbin added the cluster label Sep 19, 2025

zuiderkwast moved this from To be backported to Don't backport yet in Valkey 8.0 Sep 30, 2025

madolson removed this from Valkey 8.0 Sep 30, 2025

madolson removed this from Valkey 8.1 Sep 30, 2025

Uh oh!

Conversation

enjoy-binbin commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

update

Uh oh!

enjoy-binbin commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

enjoy-binbin commented Jul 5, 2025

Uh oh!

enjoy-binbin commented Jul 5, 2025

Uh oh!

zuiderkwast commented Jul 5, 2025

Uh oh!

enjoy-binbin commented Jul 6, 2025

Uh oh!

PingXie commented Jul 7, 2025

Uh oh!

PingXie left a comment

Choose a reason for hiding this comment

Uh oh!

enjoy-binbin commented Jul 9, 2025

Uh oh!

PingXie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zuiderkwast commented Jul 11, 2025

Uh oh!

PingXie commented Jul 12, 2025

Uh oh!

enjoy-binbin commented Jul 12, 2025

Uh oh!

Uh oh!

madolson commented Aug 4, 2025

Uh oh!

madolson commented Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

enjoy-binbin commented Jul 3, 2025 •

edited

Loading

enjoy-binbin commented Jul 3, 2025 •

edited

Loading

codecov Bot commented Jul 3, 2025 •

edited

Loading