Skip to content

Fix primary replica chain cycle, always update sender_claimed_primary in advance#2441

Closed
enjoy-binbin wants to merge 9 commits into
valkey-io:unstablefrom
enjoy-binbin:fix_replicaof_chain
Closed

Fix primary replica chain cycle, always update sender_claimed_primary in advance#2441
enjoy-binbin wants to merge 9 commits into
valkey-io:unstablefrom
enjoy-binbin:fix_replicaof_chain

Conversation

@enjoy-binbin

Copy link
Copy Markdown
Member

When writing test case for #2431, i found that the corresponding test
case occasionally triggered this debug assert:

Node ef54ff291855537b7f4f61c815955310e55bc9e2 () is no longer primary of shard ab1e16d7e11cbb0d81acf0bab4191d84e19df092; removed all 5462 slot(s) it used to own
Node ef54ff291855537b7f4f61c815955310e55bc9e2 () is now part of shard aad3ac4bf806c2bd5441ed5c421893be20f02797
Node ef54ff291855537b7f4f61c815955310e55bc9e2 () is now a replica of node a1d09af28a10d58deb43ccc8bddd8c8d095a5af5 () in shard aad3ac4bf806c2bd5441ed5c421893be20f02797

=== VALKEY BUG REPORT START: Cut & paste starting from here ===
=== ASSERTION FAILED ===
==> cluster_legacy.c:7004 'primary->replicaof == ((void*)0)' is not true

------ STACK TRACE ------

Backtrace:
0   valkey-server                       0x000000010507ac70 clusterUpdateSlotsConfigWith + 0
1   valkey-server                       0x0000000105077564 clusterReadHandler + 6180
2   valkey-server                       0x00000001050f3b0c connSocketEventHandler + 180
3   valkey-server                       0x0000000104fb7f9c aeProcessEvents + 372
4   valkey-server                       0x0000000104fe2d30 main + 27676
5   dyld                                0x0000000199177154 start + 2476

And we can see from the cluster nodes output, there are indeed a replication loop:

sender: ef54ff291855537b7f4f61c815955310e55bc9e2
sender_claimed_primary: a1d09af28a10d58deb43ccc8bddd8c8d095a5af5

ef54ff291855537b7f4f61c815955310e55bc9e2 shard-id=ab1e16d7e11cbb0d81acf0bab4191d84e19df092 slave a1d09af28a10d58deb43ccc8bddd8c8d095a5af5
a1d09af28a10d58deb43ccc8bddd8c8d095a5af5 shard-id=aad3ac4bf806c2bd5441ed5c421893be20f02797 slave ef54ff291855537b7f4f61c815955310e55bc9e2

The sender and sender_claimed_primary are not in the same shard, and
sender_claimed_primary is a replica in myself view, if we don't update
sender_claimed_primary role, in the next sender->replicaof = sender_claimed_primary,
we will create the replication loop.

… in advance

When writing test case for valkey-io#2431, i found that the corresponding test
case occasionally triggered this debug assert:
```
Node ef54ff291855537b7f4f61c815955310e55bc9e2 () is no longer primary of shard ab1e16d7e11cbb0d81acf0bab4191d84e19df092; removed all 5462 slot(s) it used to own
Node ef54ff291855537b7f4f61c815955310e55bc9e2 () is now part of shard aad3ac4bf806c2bd5441ed5c421893be20f02797
Node ef54ff291855537b7f4f61c815955310e55bc9e2 () is now a replica of node a1d09af28a10d58deb43ccc8bddd8c8d095a5af5 () in shard aad3ac4bf806c2bd5441ed5c421893be20f02797

=== VALKEY BUG REPORT START: Cut & paste starting from here ===
=== ASSERTION FAILED ===
==> cluster_legacy.c:7004 'primary->replicaof == ((void*)0)' is not true

------ STACK TRACE ------

Backtrace:
0   valkey-server                       0x000000010507ac70 clusterUpdateSlotsConfigWith + 0
1   valkey-server                       0x0000000105077564 clusterReadHandler + 6180
2   valkey-server                       0x00000001050f3b0c connSocketEventHandler + 180
3   valkey-server                       0x0000000104fb7f9c aeProcessEvents + 372
4   valkey-server                       0x0000000104fe2d30 main + 27676
5   dyld                                0x0000000199177154 start + 2476
```

And we can see from the cluster nodes output, there are indeed a replication loop:
```
sender: ef54ff291855537b7f4f61c815955310e55bc9e2
sender_claimed_primary: a1d09af28a10d58deb43ccc8bddd8c8d095a5af5

ef54ff291855537b7f4f61c815955310e55bc9e2 shard-id=ab1e16d7e11cbb0d81acf0bab4191d84e19df092 slave a1d09af28a10d58deb43ccc8bddd8c8d095a5af5
a1d09af28a10d58deb43ccc8bddd8c8d095a5af5 shard-id=aad3ac4bf806c2bd5441ed5c421893be20f02797 slave ef54ff291855537b7f4f61c815955310e55bc9e2
```

The sender and sender_claimed_primary are not in the same shard, and
sender_claimed_primary is a replica in myself view, if we don't update
sender_claimed_primary role, in the next `sender->replicaof = sender_claimed_primary`,
we will create the replication loop.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin requested a review from PingXie August 6, 2025 17:14
@codecov

codecov Bot commented Aug 6, 2025

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.93%. Comparing base (d7993b7) to head (3571ab9).
⚠️ Report is 318 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2441      +/-   ##
============================================
- Coverage     72.03%   71.93%   -0.11%     
============================================
  Files           126      126              
  Lines         70435    70444       +9     
============================================
- Hits          50740    50673      -67     
- Misses        19695    19771      +76     
Files with missing lines Coverage Δ
src/cluster_legacy.c 87.69% <100.00%> (-0.06%) ⬇️

... and 15 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread src/cluster_legacy.c
Comment on lines +3829 to +3836
if (sender_claimed_primary && nodeIsReplica(sender_claimed_primary)) {
/* `primary` is still a `replica` in this observer node's view;
* update its role and configEpoch */
clusterSetNodeAsPrimary(sender_claimed_primary);
sender_claimed_primary->configEpoch = sender_claimed_config_epoch;
clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG | CLUSTER_TODO_FSYNC_CONFIG | CLUSTER_TODO_UPDATE_STATE);
}

@madolson madolson Aug 6, 2025

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that we don't have the ability to reason about if this is right anymore. This code seems like it should be hit in the earlier case when the primary was previously a replica and is no longer a replica. I guess we had other special code for matching the shard, so maybe it's covering migration cases as well.

@enjoy-binbin enjoy-binbin Aug 7, 2025

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We opened the gate at 28976a9, it did fix a tricky issue at the time (which we encountered a few times as well). We may need @PingXie to take a closer look, i am a little worried. Now the relationship between shard_id and processPacket is getting closer and closer, but our shard_id is not updated so in time. (I suspect this debug assertion is because the shard_id hasn't been updated correctly during the pause or the packet drop, causing it to pass the same shard condition, if we want to avoid the crash, we also need to set sender_claimed_primary to primary in the else condition, or at least fix it when we see a replication cycle.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the sender says it's a primary and it's epoch is greater than the one we think is the primary, we should trust it, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change also exposing a problem, that is, we modified the epoch of a node here.
https://github.com/valkey-io/valkey/actions/runs/16897847310/job/47870978571?pr=2441#step:7:7237

22304:S 12 Aug 2025 03:11:23.390 * Node 7db2919fe69b33a5bf8a70f1dec1fff19a77c8ff () is now a replica of node a0ad02b553db9024f194b8e9d2a2e8031e4428c5 () in shard 0b5882b4b1ae8066956a3875479b42f0db44e75c
22304:S 12 Aug 2025 03:11:23.397 * Replica rank updated to #0, added -1000 milliseconds of delay.
22304:S 12 Aug 2025 03:11:23.397 * Starting a failover election for epoch 11, node config epoch is 9

like in the log, myself update sender_claimed_primary config epoch, and sender_claimed_primary happen is myself's primary. And in the next failover cron, myself start the election with the right node epoch, and it win the election.

    /* Ask for votes if needed. */
    if (server.cluster->failover_auth_sent == 0) {
        server.cluster->currentEpoch++;
        server.cluster->failover_auth_epoch = server.cluster->currentEpoch;
        serverLog(LL_NOTICE, "Starting a failover election for epoch %llu, node config epoch is %llu",
                  (unsigned long long)server.cluster->currentEpoch, (unsigned long long)nodeEpoch(myself));
        clusterRequestFailoverAuth();

Previously, in this situation, its nodeEpoch would be an old value, and it would not be able to win the election.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there are two ways to handle this i guess:

  1. We check that myself->replicaof == node and cancel the election in here.
  2. We record the current epoch in the failover cron. If the nodeEpoch changes during the delay, abort the failover.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a new commit for the option 2, see if you guys will like it. We are patching the code again and again...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

....

the reason is that after clusterSetNodeAsPrimary(sender_claimed_primary);, myself replica rank become No.0 since sender_claimed_primary is not a replica anymore and then myself trigger the election immediately

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed another new commit for the option 1 and see if i can make it right (make the CI happy)

@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Aug 7, 2025
@enjoy-binbin

Copy link
Copy Markdown
Member Author

@enjoy-binbin enjoy-binbin moved this to Optional for next release candidate in Valkey 9.0 Aug 7, 2025
@madolson madolson moved this from Optional for next release candidate to Todo in Valkey 9.0 Aug 11, 2025

@zuiderkwast zuiderkwast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems safe to me.

@enjoy-binbin We have some PRs that seem to affect each other. Shall we create a branch with this + #2431 + (some more PR?) to run the tests and see if solves the problems?

enjoy-binbin added a commit that referenced this pull request Aug 12, 2025
…2431)

The assert was added in #2301 and we found that there are some situations
would trigger assert and crash the server.

The reason we added the assert is because, in the code:
1. sender_claimed_primary and sender are in the same shard
2. and sender is the old primary, sender_claimed_primary is the old replica
3. and now sender become a replica, sender_claimed_primary become a primary

That means a failover happend in the shard, and sender should be the primary
of sender_claimed_primary. But obviously this assumption may be wrong, we rely
on shard_id to determine whether it is in a same shard, and assume that a shard
can only have one primary.

But this is wrong, from #2279 we can know there will be a case that we can
create two primaries in the same shard due to the untimely update of shard_id.
So we can create a test that trigger the assert in this way:
1. pre condition: two primaries in the same shard, one has slots and one is empty.
2. replica doing a cluster failover
3. the empty primary doing a cluster replicate with the replica (new primary)

We change the assert to an if condition to fix it.

Closes #2423.

Note that the test written here also exposes the issue in #2441, so these two
may need to be addressed together.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin

Copy link
Copy Markdown
Member Author

i have the memory this will solve the problem in my local, i have a fresh day today, i will work on it (merge it one by one and test it)

…hain

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Comment thread src/cluster_legacy.c
Comment thread src/cluster_legacy.h
This reverts commit a5ee57b.

Signed-off-by: Binbin <binloveplay1314@qq.com>
This reverts commit 71bd1dc.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Comment thread src/cluster_legacy.c
}

if (sender_claimed_primary && nodeIsReplica(sender_claimed_primary)) {
if (nodeIsReplica(myself) && areInSameShard(myself, sender_claimed_primary) &&

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that the sender_claimed_primary actually has a different shard_id, since we are just learning about this through gossip? In this specific instance the sender_claimed_primary object has our our local view of its shard_id, but it may in reality have a different shard_id.

Also below, clusterSetNodeAsPrimary doesn't configure us to do replication, so we'll mark it as our primary but we won't start replicating from it.

enjoy-binbin added a commit to enjoy-binbin/valkey that referenced this pull request Aug 13, 2025
In valkey-io#2431 we changed the assert to a if condition, and the test
cause some trouble, now we just remove the assert (if condition)
and disable the test for now due to valkey-io#2441.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin marked this pull request as draft August 13, 2025 07:42
@enjoy-binbin

Copy link
Copy Markdown
Member Author

After merging #2477, unstable now has these issue:

  1. replication loop
clusterNode *clusterNodeGetPrimary(clusterNode *node) {
    clusterNode *primary = node;
    while (primary->replicaof != NULL) {
        primary = primary->replicaof;
        if (primary == node) break;
    }
    /* Assert that a node's replicaof/primary chain does not form a cycle. */
    debugServerAssert(primary->replicaof == NULL);
    return primary;
}

The #2477 test case will create a loop which trigger the assert. (This replicaof will eventually recover after some packets).

TO fix it, maybe we should always update sender_claimed_primary's flag even though we are not in the same shard.

  1. And if we fix 1, alwasy update sender_claimed_primary's flag, we now meet other issue.
    Do we need to consider the timeliness of epochs?
if (nodeEpoch(sender_claimed_primary) > sender_claimed_config_epoch) {
       serverLog(LL_NOTICE,
                 "Ignore stale message from %.40s (%s) in shard %.40s;"
                 " gossip config epoch: %llu, current config epoch: %llu",
                 sender->name, sender->human_nodename, sender->shard_id,
                 (unsigned long long)sender_claimed_config_epoch,
                 (unsigned long long)nodeEpoch(sender_claimed_primary));

we have this code, and in the old slot migration case, we will incr epoch when we import a slot, the new epoch may not reach all nodes in time, so some nodes will know the new epoch and some other won't. So some node will drop the packet since it think the packet is stale.

  1. Since now we will update sender_claimed_primary's flag by calling clusterSetNodeAsPrimary, it will also remove the replica from the primary.
    So there are a test case, that replica1 and replica2 doing the failover, replica1 won the election, and replica2 are still doing the delay. At this moment, replica2 learn the message and call this function, and then it's replica_rank become 0 and trigger the new election. The new election will win again since it have the newly epoch during the packet. This is somehow like two replicas winning the election in the same epoch.
void clusterSetNodeAsPrimary(clusterNode *n) {
    if (clusterNodeIsPrimary(n)) return;

    if (n->replicaof) {
        clusterNodeRemoveReplica(n->replicaof, n);
        if (n != myself) n->flags |= CLUSTER_NODE_MIGRATE_TO;
    }
    n->flags &= ~CLUSTER_NODE_REPLICA;
    n->flags |= CLUSTER_NODE_PRIMARY;
    n->replicaof = NULL;

    /* Update config and state. */
    clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG | CLUSTER_TODO_UPDATE_STATE);
}

enjoy-binbin added a commit that referenced this pull request Aug 13, 2025
In #2431 we changed the assert to a if condition, and the test cause
some trouble, now we just remove the assert (if condition) and disable
the test for now due to #2441.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@madolson

Copy link
Copy Markdown
Member

TO fix it, maybe we should always update sender_claimed_primary's flag even though we are not in the same shard.

Can we consider not marking a node to be a primary based off of another replica node claiming it as a primary? These problems have been endemic since we introduced the logic, since there are a bunch of odd cases.

allenss-amazon pushed a commit to allenss-amazon/valkey-core that referenced this pull request Aug 19, 2025
…alkey-io#2431)

The assert was added in valkey-io#2301 and we found that there are some situations
would trigger assert and crash the server.

The reason we added the assert is because, in the code:
1. sender_claimed_primary and sender are in the same shard
2. and sender is the old primary, sender_claimed_primary is the old replica
3. and now sender become a replica, sender_claimed_primary become a primary

That means a failover happend in the shard, and sender should be the primary
of sender_claimed_primary. But obviously this assumption may be wrong, we rely
on shard_id to determine whether it is in a same shard, and assume that a shard
can only have one primary.

But this is wrong, from valkey-io#2279 we can know there will be a case that we can
create two primaries in the same shard due to the untimely update of shard_id.
So we can create a test that trigger the assert in this way:
1. pre condition: two primaries in the same shard, one has slots and one is empty.
2. replica doing a cluster failover
3. the empty primary doing a cluster replicate with the replica (new primary)

We change the assert to an if condition to fix it.

Closes valkey-io#2423.

Note that the test written here also exposes the issue in valkey-io#2441, so these two
may need to be addressed together.

Signed-off-by: Binbin <binloveplay1314@qq.com>
allenss-amazon pushed a commit to allenss-amazon/valkey-core that referenced this pull request Aug 19, 2025
In valkey-io#2431 we changed the assert to a if condition, and the test cause
some trouble, now we just remove the assert (if condition) and disable
the test for now due to valkey-io#2441.

Signed-off-by: Binbin <binloveplay1314@qq.com>
rjd15372 pushed a commit to rjd15372/valkey that referenced this pull request Sep 19, 2025
In valkey-io#2431 we changed the assert to a if condition, and the test cause
some trouble, now we just remove the assert (if condition) and disable
the test for now due to valkey-io#2441.

Signed-off-by: Binbin <binloveplay1314@qq.com>
rjd15372 pushed a commit that referenced this pull request Sep 23, 2025
In #2431 we changed the assert to a if condition, and the test cause
some trouble, now we just remove the assert (if condition) and disable
the test for now due to #2441.

Signed-off-by: Binbin <binloveplay1314@qq.com>
hpatro pushed a commit to hpatro/valkey that referenced this pull request Oct 3, 2025
…alkey-io#2431)

The assert was added in valkey-io#2301 and we found that there are some situations
would trigger assert and crash the server.

The reason we added the assert is because, in the code:
1. sender_claimed_primary and sender are in the same shard
2. and sender is the old primary, sender_claimed_primary is the old replica
3. and now sender become a replica, sender_claimed_primary become a primary

That means a failover happend in the shard, and sender should be the primary
of sender_claimed_primary. But obviously this assumption may be wrong, we rely
on shard_id to determine whether it is in a same shard, and assume that a shard
can only have one primary.

But this is wrong, from valkey-io#2279 we can know there will be a case that we can
create two primaries in the same shard due to the untimely update of shard_id.
So we can create a test that trigger the assert in this way:
1. pre condition: two primaries in the same shard, one has slots and one is empty.
2. replica doing a cluster failover
3. the empty primary doing a cluster replicate with the replica (new primary)

We change the assert to an if condition to fix it.

Closes valkey-io#2423.

Note that the test written here also exposes the issue in valkey-io#2441, so these two
may need to be addressed together.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
hpatro pushed a commit to hpatro/valkey that referenced this pull request Oct 3, 2025
In valkey-io#2431 we changed the assert to a if condition, and the test cause
some trouble, now we just remove the assert (if condition) and disable
the test for now due to valkey-io#2441.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
@zuiderkwast zuiderkwast removed this from Valkey 9.0 Dec 9, 2025
@zuiderkwast zuiderkwast moved this to Todo in Valkey 9.1 Dec 9, 2025
@enjoy-binbin enjoy-binbin deleted the fix_replicaof_chain branch February 9, 2026 04:44
@github-project-automation github-project-automation Bot moved this from Todo to Done in Valkey 9.1 Feb 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cluster run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants