Skip to content

Fix replica (the old primary) claims to sitll have slots after manual failover#2301

Merged
enjoy-binbin merged 6 commits into
valkey-io:unstablefrom
enjoy-binbin:replica_slots
Jul 11, 2025
Merged

Fix replica (the old primary) claims to sitll have slots after manual failover#2301
enjoy-binbin merged 6 commits into
valkey-io:unstablefrom
enjoy-binbin:replica_slots

Conversation

@enjoy-binbin

@enjoy-binbin enjoy-binbin commented Jul 3, 2025

Copy link
Copy Markdown
Member

update

Please also pick #2370 and #2431 and #2441 and #2477 you want to pick this one.

================

After a failover occurs, when the PING-PONG of the replica, that is,
the old primary, reaches other shard nodes faster, the corresponding
code path will be reached. We will adjust the sender to be a replica
and sender_claimed_primary to be the primary node, but in myself view,
slots still belong to the sender, which is a replica.

This actually restores part of the code in 28976a9.
It was lost in the changes to #445 and #754.

These 3 changes are about avoiding the replicaof cycle but ended up creating
a "cycle" of assumptions themselves. when #445 removed this logic, it was
assumed that a replica could also drive clusterUpdateSlotsConfigWith
but this was actually a regression that resulted in a replicaof cycle (#753).
#754 fixed the replicaof cycle by disallowing a replica to drive
clusterUpdateSlotsConfigWith but missed restoring the original logic.

Added a new DEBUG DISABLE-CLUSTER-RECONNECTION <0|1> this will prevent
cluster nodes from reconnecting so that the issue can be reproduce in test.

… failover

Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin

enjoy-binbin commented Jul 3, 2025

Copy link
Copy Markdown
Member Author

In the test, there will be a short time window that the replica claims to still have slots.

start_cluster 3 1 {tags {external:skip cluster} overrides {cluster-replica-validity-factor 0}} {
    set R0_nodeid [R 0 cluster myid]
    set R1_nodeid [R 1 cluster myid]
    set R2_nodeid [R 2 cluster myid]
    set R3_nodeid [R 3 cluster myid]


    R 3 multi
    R 3 debug clusterlink kill all $R1_nodeid
    R 3 debug clusterlink kill all $R2_nodeid
    R 3 cluster failover takeover
    R 3 exec

    puts [R 1 cluster nodes]
    after 10
    puts [R 1 cluster nodes]
    after 10
    puts [R 1 cluster nodes]
    after 10
    puts [R 1 cluster nodes]
    after 10
    puts [R 1 cluster nodes]
    ...
}

We can see 84088c54fcae0666b552229b806e4f96eaba2a0b become a replica but own the slots

84088c54fcae0666b552229b806e4f96eaba2a0b 127.0.0.1:21114@31114 master - 0 1752039102266 1 connected 0-5461
f58610cd23c75be51a27c717e9f4997830bcfadf 127.0.0.1:21111@31111 slave 84088c54fcae0666b552229b806e4f96eaba2a0b 0 1752039102266 1 disconnected
bc07472c09d9e93fc7feb98a846acda76f97e67f 127.0.0.1:21112@31112 master - 0 1752039102266 3 connected 10923-16383
0d74d311b9432a64eeb1dfaefae5a25acf8a2300 127.0.0.1:21113@31113 myself,master - 0 0 2 connected 5462-10922

84088c54fcae0666b552229b806e4f96eaba2a0b 127.0.0.1:21114@31114 slave f58610cd23c75be51a27c717e9f4997830bcfadf 0 1752039102266 4 connected 0-5461
f58610cd23c75be51a27c717e9f4997830bcfadf 127.0.0.1:21111@31111 master - 0 1752039102266 4 disconnected
bc07472c09d9e93fc7feb98a846acda76f97e67f 127.0.0.1:21112@31112 master - 0 1752039102266 3 connected 10923-16383
0d74d311b9432a64eeb1dfaefae5a25acf8a2300 127.0.0.1:21113@31113 myself,master - 0 0 2 connected 5462-10922

{0 5461 {127.0.0.1 21114 84088c54fcae0666b552229b806e4f96eaba2a0b {}}} {5462 10922 {127.0.0.1 21113 0d74d311b9432a64eeb1dfaefae5a25acf8a2300 {}}} {10923 16383 {127.0.0.1 21112 bc07472c09d9e93fc7feb98a846acda76f97e67f {}}}

I will try to think of a way to add the test, there are currently no stable steps to reproduce it.

@enjoy-binbin enjoy-binbin requested a review from PingXie July 3, 2025 04:58
Signed-off-by: Binbin <binloveplay1314@qq.com>
@codecov

codecov Bot commented Jul 3, 2025

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.46%. Comparing base (a62f83d) to head (ca44db7).
⚠️ Report is 166 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2301      +/-   ##
============================================
- Coverage     71.49%   71.46%   -0.04%     
============================================
  Files           123      123              
  Lines         66937    67104     +167     
============================================
+ Hits          47859    47954      +95     
- Misses        19078    19150      +72     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.78% <100.00%> (-0.12%) ⬇️
src/debug.c 53.75% <100.00%> (+0.16%) ⬆️
src/server.c 88.06% <100.00%> (-0.03%) ⬇️
src/server.h 100.00% <ø> (ø)

... and 21 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@zuiderkwast zuiderkwast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Is this the failures we saw in Daily a few days ago?

@enjoy-binbin

Copy link
Copy Markdown
Member Author

Is this the failures we saw in Daily a few days ago?

I guess not. I found this from another code review.

@enjoy-binbin

Copy link
Copy Markdown
Member Author

btw, there is a saying that, we should not update sender->replicaof info based on the sender infomartion, like in here, we updated the sender_claimed_primary flag, so i guess it’s OK if we update the slots info.

But I also want to hear your thoughts, like, should sender->replicaof information be updated based on sender information?

@enjoy-binbin enjoy-binbin moved this to In Progress in Valkey 9.0 Jul 5, 2025
@zuiderkwast

Copy link
Copy Markdown
Contributor

btw, there is a saying that, we should not update sender->replicaof info based on the sender infomartion

I don'tknow this saying. Is it written anywhere?

I think information about sender directly from the sender can be trusted. It is direct information, so it should be correct. Why not?

@enjoy-binbin

Copy link
Copy Markdown
Member Author

I think information about sender directly from the sender can be trusted

ohh, maybe i gave the wrond understanding, i mean, update the other node (sender->replicaof) infomation based on the sender infomartion.

I don'tknow this saying. Is it written anywhere?

I guess not, i did not see this, a person from my team said this, he mentioned something that we usually don't update the info of another node based on the info of one node. Like in here, we update the info of sender->replicaof (flags / slots info) based on the info of sender.

@PingXie

PingXie commented Jul 7, 2025

Copy link
Copy Markdown
Member

In the test, there will be a short time window that the replica claims to still have slots.

I think this should be a transitional state and eventually the replica drops these slots. that being said, this pr makes sense to me and is the right thing to do, since the role change and the slot ownership change should've been an atomic operation.

This actually restores part of the code in 28976a9.
It was lost in the changes to #445 and #754.

yeah these 3 changes are about avoiding the replicaof cycle but ended up creating a "cycle" of assumptions themselves. when #445 removed this logic, it was assumed that a replica could also drive clusterUpdateSlotsConfigWith but this was actually a regression that resulted in a replicaof cycle (#753). #754 fixed the replicaof cycle by disallowing a replica to drive clusterUpdateSlotsConfigWith but missed restoring the original logic.

@PingXie PingXie left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks @enjoy-binbin. great investigation.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin

Copy link
Copy Markdown
Member Author

@PingXie @zuiderkwast Please check the new changes, i added some assert, a new debug command, and the test (Tests can fail reliably before the fix).

@PingXie PingXie left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great test!

Comment thread src/cluster_legacy.c Outdated
Comment thread tests/unit/cluster/manual-failover.tcl Outdated
Co-authored-by: Ping Xie <pingxie@outlook.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Jul 9, 2025
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin merged commit 507042d into valkey-io:unstable Jul 11, 2025
128 of 140 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in Valkey 9.0 Jul 11, 2025
@enjoy-binbin enjoy-binbin deleted the replica_slots branch July 11, 2025 11:00
@enjoy-binbin enjoy-binbin added the release-notes This issue should get a line item in the release notes label Jul 11, 2025
@zuiderkwast

Copy link
Copy Markdown
Contributor

Merged, very nice. Backport or not?

@PingXie

PingXie commented Jul 12, 2025

Copy link
Copy Markdown
Member

Merged, very nice. Backport or not?

This is a safe change so I vote for backport

@enjoy-binbin enjoy-binbin moved this to To be backported in Valkey 8.0 Jul 12, 2025
@enjoy-binbin enjoy-binbin moved this to To be backported in Valkey 8.1 Jul 12, 2025
@enjoy-binbin

Copy link
Copy Markdown
Member Author

Added to 8.0/8.1 project for backport.

roshkhatri added a commit to roshkhatri/valkey that referenced this pull request Jul 15, 2025
…r manual failover (valkey-io#2301)"

This reverts commit 507042d.

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
roshkhatri added a commit to roshkhatri/valkey that referenced this pull request Jul 18, 2025
…er manual failover (valkey-io#2301)"

This reverts commit 6ce01c1.

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
sarthakaggarwal97 added a commit to sarthakaggarwal97/valkey that referenced this pull request Jul 21, 2025
enjoy-binbin added a commit that referenced this pull request Jul 28, 2025
…g slots (#2370)

In #2301, we added clusterMoveNodeSlots to implement the logic of
moving slots from old primary to new primary, when myself receives
the replica (old primary) message first and the new primary message
later in a shard failover.

However due to this, when myself receives the new primary message
later next time, there is no way to call clusterUpdateSlotsConfigWith,
because we have already updated the slots of the new primary before.
This result in, for example, importing slots and migrating slots
not being updated, see #445.

In this commit, we also make clusterMoveNodeSlots to move importing
slots and migrating slots.

Fixes #2363.

Signed-off-by: Binbin <binloveplay1314@qq.com>
sarthakaggarwal97 added a commit to sarthakaggarwal97/valkey that referenced this pull request Jul 28, 2025
Comment thread src/cluster_legacy.c
@madolson

madolson commented Aug 4, 2025

Copy link
Copy Markdown
Member

This is a safe change so I vote for backport

@PingXie, unless we are extremely confident, I wouldn't consider backporting novel severAsserts to be safe. This can introduce new crashes that have caused CVEs in the past.

enjoy-binbin added a commit to enjoy-binbin/valkey that referenced this pull request Aug 5, 2025
The assert was added in valkey-io#2301 and we found that there are some situations
would trigger assert and crash the server.

The reason we added the assert is because, in the code:
1. sender_claimed_primary and sender are in the same shard
2. and sender is the old primary, sender_claimed_primary is the old replica
3. and now sender become a replica, sender_claimed_primary become a primary

That means a failover happend in the shard, and sender should be the primary
of sender_claimed_primary. But obviously this assumption may be wrong, we rely
on shard_id to determine whether it is in a same shard, and assume that a shard
can only have one primary.

But this is wrong, from valkey-io#2279 we can know there will be a case that we can
create two primaries in the same shard due to the untimely update of shard_id.
So we can create a test that trigger the assert in this way:
1. pre condition: two primaries in the same shard, one has slots and one is empty.
2. replica doing a cluster failover
3. the empty primary doing a cluster replicate with the replica (new primary)

We change the assert to an if condition to fix it.

Signed-off-by: Binbin <binloveplay1314@qq.com>
enjoy-binbin added a commit that referenced this pull request Aug 12, 2025
…2431)

The assert was added in #2301 and we found that there are some situations
would trigger assert and crash the server.

The reason we added the assert is because, in the code:
1. sender_claimed_primary and sender are in the same shard
2. and sender is the old primary, sender_claimed_primary is the old replica
3. and now sender become a replica, sender_claimed_primary become a primary

That means a failover happend in the shard, and sender should be the primary
of sender_claimed_primary. But obviously this assumption may be wrong, we rely
on shard_id to determine whether it is in a same shard, and assume that a shard
can only have one primary.

But this is wrong, from #2279 we can know there will be a case that we can
create two primaries in the same shard due to the untimely update of shard_id.
So we can create a test that trigger the assert in this way:
1. pre condition: two primaries in the same shard, one has slots and one is empty.
2. replica doing a cluster failover
3. the empty primary doing a cluster replicate with the replica (new primary)

We change the assert to an if condition to fix it.

Closes #2423.

Note that the test written here also exposes the issue in #2441, so these two
may need to be addressed together.

Signed-off-by: Binbin <binloveplay1314@qq.com>
allenss-amazon pushed a commit to allenss-amazon/valkey-core that referenced this pull request Aug 19, 2025
…alkey-io#2431)

The assert was added in valkey-io#2301 and we found that there are some situations
would trigger assert and crash the server.

The reason we added the assert is because, in the code:
1. sender_claimed_primary and sender are in the same shard
2. and sender is the old primary, sender_claimed_primary is the old replica
3. and now sender become a replica, sender_claimed_primary become a primary

That means a failover happend in the shard, and sender should be the primary
of sender_claimed_primary. But obviously this assumption may be wrong, we rely
on shard_id to determine whether it is in a same shard, and assume that a shard
can only have one primary.

But this is wrong, from valkey-io#2279 we can know there will be a case that we can
create two primaries in the same shard due to the untimely update of shard_id.
So we can create a test that trigger the assert in this way:
1. pre condition: two primaries in the same shard, one has slots and one is empty.
2. replica doing a cluster failover
3. the empty primary doing a cluster replicate with the replica (new primary)

We change the assert to an if condition to fix it.

Closes valkey-io#2423.

Note that the test written here also exposes the issue in valkey-io#2441, so these two
may need to be addressed together.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@zuiderkwast zuiderkwast moved this from To be backported to Don't backport yet in Valkey 8.0 Sep 30, 2025
@madolson madolson removed this from Valkey 8.0 Sep 30, 2025
@madolson madolson removed this from Valkey 8.1 Sep 30, 2025
@madolson

Copy link
Copy Markdown
Member

I don't think we should backport these. I removed from the 8.0 and 8.1 projects. Let me know if anyone disagrees.

hpatro pushed a commit to hpatro/valkey that referenced this pull request Oct 3, 2025
…alkey-io#2431)

The assert was added in valkey-io#2301 and we found that there are some situations
would trigger assert and crash the server.

The reason we added the assert is because, in the code:
1. sender_claimed_primary and sender are in the same shard
2. and sender is the old primary, sender_claimed_primary is the old replica
3. and now sender become a replica, sender_claimed_primary become a primary

That means a failover happend in the shard, and sender should be the primary
of sender_claimed_primary. But obviously this assumption may be wrong, we rely
on shard_id to determine whether it is in a same shard, and assume that a shard
can only have one primary.

But this is wrong, from valkey-io#2279 we can know there will be a case that we can
create two primaries in the same shard due to the untimely update of shard_id.
So we can create a test that trigger the assert in this way:
1. pre condition: two primaries in the same shard, one has slots and one is empty.
2. replica doing a cluster failover
3. the empty primary doing a cluster replicate with the replica (new primary)

We change the assert to an if condition to fix it.

Closes valkey-io#2423.

Note that the test written here also exposes the issue in valkey-io#2441, so these two
may need to be addressed together.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
sarthakaggarwal97 pushed a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 24, 2026
…g slots (valkey-io#2370)

In valkey-io#2301, we added clusterMoveNodeSlots to implement the logic of
moving slots from old primary to new primary, when myself receives
the replica (old primary) message first and the new primary message
later in a shard failover.

However due to this, when myself receives the new primary message
later next time, there is no way to call clusterUpdateSlotsConfigWith,
because we have already updated the slots of the new primary before.
This result in, for example, importing slots and migrating slots
not being updated, see valkey-io#445.

In this commit, we also make clusterMoveNodeSlots to move importing
slots and migrating slots.

Fixes valkey-io#2363.

Signed-off-by: Binbin <binloveplay1314@qq.com>
sarthakaggarwal97 pushed a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 27, 2026
…g slots (valkey-io#2370)

In valkey-io#2301, we added clusterMoveNodeSlots to implement the logic of
moving slots from old primary to new primary, when myself receives
the replica (old primary) message first and the new primary message
later in a shard failover.

However due to this, when myself receives the new primary message
later next time, there is no way to call clusterUpdateSlotsConfigWith,
because we have already updated the slots of the new primary before.
This result in, for example, importing slots and migrating slots
not being updated, see valkey-io#445.

In this commit, we also make clusterMoveNodeSlots to move importing
slots and migrating slots.

Fixes valkey-io#2363.

Signed-off-by: Binbin <binloveplay1314@qq.com>
(cherry picked from commit a3907ad)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cluster release-notes This issue should get a line item in the release notes run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants