Fix primary replica chain cycle, always update sender_claimed_primary in advance by enjoy-binbin · Pull Request #2441 · valkey-io/valkey

enjoy-binbin · 2025-08-06T17:14:39Z

When writing test case for #2431, i found that the corresponding test
case occasionally triggered this debug assert:

Node ef54ff291855537b7f4f61c815955310e55bc9e2 () is no longer primary of shard ab1e16d7e11cbb0d81acf0bab4191d84e19df092; removed all 5462 slot(s) it used to own
Node ef54ff291855537b7f4f61c815955310e55bc9e2 () is now part of shard aad3ac4bf806c2bd5441ed5c421893be20f02797
Node ef54ff291855537b7f4f61c815955310e55bc9e2 () is now a replica of node a1d09af28a10d58deb43ccc8bddd8c8d095a5af5 () in shard aad3ac4bf806c2bd5441ed5c421893be20f02797

=== VALKEY BUG REPORT START: Cut & paste starting from here ===
=== ASSERTION FAILED ===
==> cluster_legacy.c:7004 'primary->replicaof == ((void*)0)' is not true

------ STACK TRACE ------

Backtrace:
0   valkey-server                       0x000000010507ac70 clusterUpdateSlotsConfigWith + 0
1   valkey-server                       0x0000000105077564 clusterReadHandler + 6180
2   valkey-server                       0x00000001050f3b0c connSocketEventHandler + 180
3   valkey-server                       0x0000000104fb7f9c aeProcessEvents + 372
4   valkey-server                       0x0000000104fe2d30 main + 27676
5   dyld                                0x0000000199177154 start + 2476

And we can see from the cluster nodes output, there are indeed a replication loop:

sender: ef54ff291855537b7f4f61c815955310e55bc9e2
sender_claimed_primary: a1d09af28a10d58deb43ccc8bddd8c8d095a5af5

ef54ff291855537b7f4f61c815955310e55bc9e2 shard-id=ab1e16d7e11cbb0d81acf0bab4191d84e19df092 slave a1d09af28a10d58deb43ccc8bddd8c8d095a5af5
a1d09af28a10d58deb43ccc8bddd8c8d095a5af5 shard-id=aad3ac4bf806c2bd5441ed5c421893be20f02797 slave ef54ff291855537b7f4f61c815955310e55bc9e2

The sender and sender_claimed_primary are not in the same shard, and
sender_claimed_primary is a replica in myself view, if we don't update
sender_claimed_primary role, in the next sender->replicaof = sender_claimed_primary,
we will create the replication loop.

… in advance When writing test case for valkey-io#2431, i found that the corresponding test case occasionally triggered this debug assert: ``` Node ef54ff291855537b7f4f61c815955310e55bc9e2 () is no longer primary of shard ab1e16d7e11cbb0d81acf0bab4191d84e19df092; removed all 5462 slot(s) it used to own Node ef54ff291855537b7f4f61c815955310e55bc9e2 () is now part of shard aad3ac4bf806c2bd5441ed5c421893be20f02797 Node ef54ff291855537b7f4f61c815955310e55bc9e2 () is now a replica of node a1d09af28a10d58deb43ccc8bddd8c8d095a5af5 () in shard aad3ac4bf806c2bd5441ed5c421893be20f02797 === VALKEY BUG REPORT START: Cut & paste starting from here === === ASSERTION FAILED === ==> cluster_legacy.c:7004 'primary->replicaof == ((void*)0)' is not true ------ STACK TRACE ------ Backtrace: 0 valkey-server 0x000000010507ac70 clusterUpdateSlotsConfigWith + 0 1 valkey-server 0x0000000105077564 clusterReadHandler + 6180 2 valkey-server 0x00000001050f3b0c connSocketEventHandler + 180 3 valkey-server 0x0000000104fb7f9c aeProcessEvents + 372 4 valkey-server 0x0000000104fe2d30 main + 27676 5 dyld 0x0000000199177154 start + 2476 ``` And we can see from the cluster nodes output, there are indeed a replication loop: ``` sender: ef54ff291855537b7f4f61c815955310e55bc9e2 sender_claimed_primary: a1d09af28a10d58deb43ccc8bddd8c8d095a5af5 ef54ff291855537b7f4f61c815955310e55bc9e2 shard-id=ab1e16d7e11cbb0d81acf0bab4191d84e19df092 slave a1d09af28a10d58deb43ccc8bddd8c8d095a5af5 a1d09af28a10d58deb43ccc8bddd8c8d095a5af5 shard-id=aad3ac4bf806c2bd5441ed5c421893be20f02797 slave ef54ff291855537b7f4f61c815955310e55bc9e2 ``` The sender and sender_claimed_primary are not in the same shard, and sender_claimed_primary is a replica in myself view, if we don't update sender_claimed_primary role, in the next `sender->replicaof = sender_claimed_primary`, we will create the replication loop. Signed-off-by: Binbin <binloveplay1314@qq.com>

codecov · 2025-08-06T17:30:25Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.93%. Comparing base (d7993b7) to head (3571ab9).
⚠️ Report is 318 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2441      +/-   ##
============================================
- Coverage     72.03%   71.93%   -0.11%     
============================================
  Files           126      126              
  Lines         70435    70444       +9     
============================================
- Hits          50740    50673      -67     
- Misses        19695    19771      +76

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`87.69% <100.00%> (-0.06%)`	⬇️

... and 15 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

madolson · 2025-08-06T18:45:42Z

+                if (sender_claimed_primary && nodeIsReplica(sender_claimed_primary)) {
+                    /* `primary` is still a `replica` in this observer node's view;
+                     * update its role and configEpoch */
+                    clusterSetNodeAsPrimary(sender_claimed_primary);
+                    sender_claimed_primary->configEpoch = sender_claimed_config_epoch;
+                    clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG | CLUSTER_TODO_FSYNC_CONFIG | CLUSTER_TODO_UPDATE_STATE);
+                }
+


I'm worried that we don't have the ability to reason about if this is right anymore. This code seems like it should be hit in the earlier case when the primary was previously a replica and is no longer a replica. I guess we had other special code for matching the shard, so maybe it's covering migration cases as well.

We opened the gate at 28976a9, it did fix a tricky issue at the time (which we encountered a few times as well). We may need @PingXie to take a closer look, i am a little worried. Now the relationship between shard_id and processPacket is getting closer and closer, but our shard_id is not updated so in time. (I suspect this debug assertion is because the shard_id hasn't been updated correctly during the pause or the packet drop, causing it to pass the same shard condition, if we want to avoid the crash, we also need to set sender_claimed_primary to primary in the else condition, or at least fix it when we see a replication cycle.)

If the sender says it's a primary and it's epoch is greater than the one we think is the primary, we should trust it, right?

this change also exposing a problem, that is, we modified the epoch of a node here.
https://github.com/valkey-io/valkey/actions/runs/16897847310/job/47870978571?pr=2441#step:7:7237

22304:S 12 Aug 2025 03:11:23.390 * Node 7db2919fe69b33a5bf8a70f1dec1fff19a77c8ff () is now a replica of node a0ad02b553db9024f194b8e9d2a2e8031e4428c5 () in shard 0b5882b4b1ae8066956a3875479b42f0db44e75c 22304:S 12 Aug 2025 03:11:23.397 * Replica rank updated to #0, added -1000 milliseconds of delay. 22304:S 12 Aug 2025 03:11:23.397 * Starting a failover election for epoch 11, node config epoch is 9

like in the log, myself update sender_claimed_primary config epoch, and sender_claimed_primary happen is myself's primary. And in the next failover cron, myself start the election with the right node epoch, and it win the election.

/* Ask for votes if needed. */ if (server.cluster->failover_auth_sent == 0) { server.cluster->currentEpoch++; server.cluster->failover_auth_epoch = server.cluster->currentEpoch; serverLog(LL_NOTICE, "Starting a failover election for epoch %llu, node config epoch is %llu", (unsigned long long)server.cluster->currentEpoch, (unsigned long long)nodeEpoch(myself)); clusterRequestFailoverAuth();

Previously, in this situation, its nodeEpoch would be an old value, and it would not be able to win the election.

So there are two ways to handle this i guess:

We check that myself->replicaof == node and cancel the election in here.

We record the current epoch in the failover cron. If the nodeEpoch changes during the delay, abort the failover.

I pushed a new commit for the option 2, see if you guys will like it. We are patching the code again and again...

....

the reason is that after clusterSetNodeAsPrimary(sender_claimed_primary);, myself replica rank become No.0 since sender_claimed_primary is not a replica anymore and then myself trigger the election immediately

pushed another new commit for the option 1 and see if i can make it right (make the CI happy)

enjoy-binbin · 2025-08-07T08:17:20Z

A test fail example: https://github.com/valkey-io/valkey/actions/runs/16754537649/job/47433388245?pr=2431#step:5:6004

zuiderkwast

Seems safe to me.

@enjoy-binbin We have some PRs that seem to affect each other. Shall we create a branch with this + #2431 + (some more PR?) to run the tests and see if solves the problems?

…2431) The assert was added in #2301 and we found that there are some situations would trigger assert and crash the server. The reason we added the assert is because, in the code: 1. sender_claimed_primary and sender are in the same shard 2. and sender is the old primary, sender_claimed_primary is the old replica 3. and now sender become a replica, sender_claimed_primary become a primary That means a failover happend in the shard, and sender should be the primary of sender_claimed_primary. But obviously this assumption may be wrong, we rely on shard_id to determine whether it is in a same shard, and assume that a shard can only have one primary. But this is wrong, from #2279 we can know there will be a case that we can create two primaries in the same shard due to the untimely update of shard_id. So we can create a test that trigger the assert in this way: 1. pre condition: two primaries in the same shard, one has slots and one is empty. 2. replica doing a cluster failover 3. the empty primary doing a cluster replicate with the replica (new primary) We change the assert to an if condition to fix it. Closes #2423. Note that the test written here also exposes the issue in #2441, so these two may need to be addressed together. Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin · 2025-08-12T02:39:02Z

i have the memory this will solve the problem in my local, i have a fresh day today, i will work on it (merge it one by one and test it)

…hain Signed-off-by: Binbin <binloveplay1314@qq.com>

Signed-off-by: Binbin <binloveplay1314@qq.com>

This reverts commit a5ee57b. Signed-off-by: Binbin <binloveplay1314@qq.com>

This reverts commit 71bd1dc. Signed-off-by: Binbin <binloveplay1314@qq.com>

Signed-off-by: Binbin <binloveplay1314@qq.com>

madolson · 2025-08-12T19:25:18Z

                }

+                if (sender_claimed_primary && nodeIsReplica(sender_claimed_primary)) {
+                    if (nodeIsReplica(myself) && areInSameShard(myself, sender_claimed_primary) &&


Is it possible that the sender_claimed_primary actually has a different shard_id, since we are just learning about this through gossip? In this specific instance the sender_claimed_primary object has our our local view of its shard_id, but it may in reality have a different shard_id.

Also below, clusterSetNodeAsPrimary doesn't configure us to do replication, so we'll mark it as our primary but we won't start replicating from it.

In valkey-io#2431 we changed the assert to a if condition, and the test cause some trouble, now we just remove the assert (if condition) and disable the test for now due to valkey-io#2441. Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin · 2025-08-13T07:52:35Z

After merging #2477, unstable now has these issue:

replication loop

clusterNode *clusterNodeGetPrimary(clusterNode *node) {
    clusterNode *primary = node;
    while (primary->replicaof != NULL) {
        primary = primary->replicaof;
        if (primary == node) break;
    }
    /* Assert that a node's replicaof/primary chain does not form a cycle. */
    debugServerAssert(primary->replicaof == NULL);
    return primary;
}

The #2477 test case will create a loop which trigger the assert. (This replicaof will eventually recover after some packets).

TO fix it, maybe we should always update sender_claimed_primary's flag even though we are not in the same shard.

And if we fix 1, alwasy update sender_claimed_primary's flag, we now meet other issue.
Do we need to consider the timeliness of epochs?

if (nodeEpoch(sender_claimed_primary) > sender_claimed_config_epoch) {
       serverLog(LL_NOTICE,
                 "Ignore stale message from %.40s (%s) in shard %.40s;"
                 " gossip config epoch: %llu, current config epoch: %llu",
                 sender->name, sender->human_nodename, sender->shard_id,
                 (unsigned long long)sender_claimed_config_epoch,
                 (unsigned long long)nodeEpoch(sender_claimed_primary));

we have this code, and in the old slot migration case, we will incr epoch when we import a slot, the new epoch may not reach all nodes in time, so some nodes will know the new epoch and some other won't. So some node will drop the packet since it think the packet is stale.

Since now we will update sender_claimed_primary's flag by calling clusterSetNodeAsPrimary, it will also remove the replica from the primary.
So there are a test case, that replica1 and replica2 doing the failover, replica1 won the election, and replica2 are still doing the delay. At this moment, replica2 learn the message and call this function, and then it's replica_rank become 0 and trigger the new election. The new election will win again since it have the newly epoch during the packet. This is somehow like two replicas winning the election in the same epoch.

void clusterSetNodeAsPrimary(clusterNode *n) {
    if (clusterNodeIsPrimary(n)) return;

    if (n->replicaof) {
        clusterNodeRemoveReplica(n->replicaof, n);
        if (n != myself) n->flags |= CLUSTER_NODE_MIGRATE_TO;
    }
    n->flags &= ~CLUSTER_NODE_REPLICA;
    n->flags |= CLUSTER_NODE_PRIMARY;
    n->replicaof = NULL;

    /* Update config and state. */
    clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG | CLUSTER_TODO_UPDATE_STATE);
}

In #2431 we changed the assert to a if condition, and the test cause some trouble, now we just remove the assert (if condition) and disable the test for now due to #2441. Signed-off-by: Binbin <binloveplay1314@qq.com>

madolson · 2025-08-18T17:53:50Z

TO fix it, maybe we should always update sender_claimed_primary's flag even though we are not in the same shard.

Can we consider not marking a node to be a primary based off of another replica node claiming it as a primary? These problems have been endemic since we introduced the logic, since there are a bunch of odd cases.

…alkey-io#2431) The assert was added in valkey-io#2301 and we found that there are some situations would trigger assert and crash the server. The reason we added the assert is because, in the code: 1. sender_claimed_primary and sender are in the same shard 2. and sender is the old primary, sender_claimed_primary is the old replica 3. and now sender become a replica, sender_claimed_primary become a primary That means a failover happend in the shard, and sender should be the primary of sender_claimed_primary. But obviously this assumption may be wrong, we rely on shard_id to determine whether it is in a same shard, and assume that a shard can only have one primary. But this is wrong, from valkey-io#2279 we can know there will be a case that we can create two primaries in the same shard due to the untimely update of shard_id. So we can create a test that trigger the assert in this way: 1. pre condition: two primaries in the same shard, one has slots and one is empty. 2. replica doing a cluster failover 3. the empty primary doing a cluster replicate with the replica (new primary) We change the assert to an if condition to fix it. Closes valkey-io#2423. Note that the test written here also exposes the issue in valkey-io#2441, so these two may need to be addressed together. Signed-off-by: Binbin <binloveplay1314@qq.com>

In valkey-io#2431 we changed the assert to a if condition, and the test cause some trouble, now we just remove the assert (if condition) and disable the test for now due to valkey-io#2441. Signed-off-by: Binbin <binloveplay1314@qq.com>

In #2431 we changed the assert to a if condition, and the test cause some trouble, now we just remove the assert (if condition) and disable the test for now due to #2441. Signed-off-by: Binbin <binloveplay1314@qq.com>

…alkey-io#2431) The assert was added in valkey-io#2301 and we found that there are some situations would trigger assert and crash the server. The reason we added the assert is because, in the code: 1. sender_claimed_primary and sender are in the same shard 2. and sender is the old primary, sender_claimed_primary is the old replica 3. and now sender become a replica, sender_claimed_primary become a primary That means a failover happend in the shard, and sender should be the primary of sender_claimed_primary. But obviously this assumption may be wrong, we rely on shard_id to determine whether it is in a same shard, and assume that a shard can only have one primary. But this is wrong, from valkey-io#2279 we can know there will be a case that we can create two primaries in the same shard due to the untimely update of shard_id. So we can create a test that trigger the assert in this way: 1. pre condition: two primaries in the same shard, one has slots and one is empty. 2. replica doing a cluster failover 3. the empty primary doing a cluster replicate with the replica (new primary) We change the assert to an if condition to fix it. Closes valkey-io#2423. Note that the test written here also exposes the issue in valkey-io#2441, so these two may need to be addressed together. Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>

In valkey-io#2431 we changed the assert to a if condition, and the test cause some trouble, now we just remove the assert (if condition) and disable the test for now due to valkey-io#2441. Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>

enjoy-binbin requested a review from PingXie August 6, 2025 17:14

madolson reviewed Aug 6, 2025

View reviewed changes

enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Aug 7, 2025

enjoy-binbin added this to Valkey 9.0 Aug 7, 2025

enjoy-binbin moved this to Optional for next release candidate in Valkey 9.0 Aug 7, 2025

enjoy-binbin mentioned this pull request Aug 8, 2025

Change the same shard failover assert to if condition to avoid crash #2431

Merged

madolson moved this from Optional for next release candidate to Todo in Valkey 9.0 Aug 11, 2025

zuiderkwast approved these changes Aug 11, 2025

View reviewed changes

enjoy-binbin mentioned this pull request Aug 12, 2025

Fix replica (the old primary) claims to sitll have slots after manual failover #2301

Merged

enjoy-binbin added 5 commits August 12, 2025 10:41

Merge remote-tracking branch 'upstream/unstable' into fix_replicaof_c…

7d51e35

…hain Signed-off-by: Binbin <binloveplay1314@qq.com>

Fix test

5550ee9

Signed-off-by: Binbin <binloveplay1314@qq.com>

revert test case

5365e38

Signed-off-by: Binbin <binloveplay1314@qq.com>

Cancel the election if the node config epoch changed

71bd1dc

Signed-off-by: Binbin <binloveplay1314@qq.com>

fix make

a5ee57b

Signed-off-by: Binbin <binloveplay1314@qq.com>

zuiderkwast reviewed Aug 12, 2025

View reviewed changes

Comment thread src/cluster_legacy.c

Comment thread src/cluster_legacy.h

enjoy-binbin added 3 commits August 12, 2025 15:57

Revert "fix make"

adffa25

This reverts commit a5ee57b. Signed-off-by: Binbin <binloveplay1314@qq.com>

Revert "Cancel the election if the node config epoch changed"

e1dea3f

This reverts commit 71bd1dc. Signed-off-by: Binbin <binloveplay1314@qq.com>

take a try

3571ab9

Signed-off-by: Binbin <binloveplay1314@qq.com>

madolson reviewed Aug 12, 2025

View reviewed changes

enjoy-binbin mentioned this pull request Aug 13, 2025

Remove if condition and disable the new failover test #2477

Merged

enjoy-binbin marked this pull request as draft August 13, 2025 07:42

enjoy-binbin added the cluster label Sep 19, 2025

zuiderkwast added this to Valkey 9.1 Dec 9, 2025

zuiderkwast removed this from Valkey 9.0 Dec 9, 2025

zuiderkwast moved this to Todo in Valkey 9.1 Dec 9, 2025

enjoy-binbin closed this Feb 9, 2026

enjoy-binbin deleted the fix_replicaof_chain branch February 9, 2026 04:44

github-project-automation Bot moved this from Todo to Done in Valkey 9.1 Feb 9, 2026

Uh oh!

Conversation

enjoy-binbin commented Aug 6, 2025

Uh oh!

codecov Bot commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

madolson Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

enjoy-binbin Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zuiderkwast Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

enjoy-binbin Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

enjoy-binbin Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

enjoy-binbin Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

enjoy-binbin Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

enjoy-binbin Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

enjoy-binbin commented Aug 7, 2025

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

enjoy-binbin commented Aug 12, 2025

Uh oh!

Uh oh!

Uh oh!

madolson Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

enjoy-binbin commented Aug 13, 2025

Uh oh!

madolson commented Aug 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Aug 6, 2025 •

edited

Loading

madolson Aug 6, 2025 •

edited

Loading

enjoy-binbin Aug 7, 2025 •

edited

Loading