Fix primary replica chain cycle, always update sender_claimed_primary in advance#2441
Fix primary replica chain cycle, always update sender_claimed_primary in advance#2441enjoy-binbin wants to merge 9 commits into
Conversation
… in advance When writing test case for valkey-io#2431, i found that the corresponding test case occasionally triggered this debug assert: ``` Node ef54ff291855537b7f4f61c815955310e55bc9e2 () is no longer primary of shard ab1e16d7e11cbb0d81acf0bab4191d84e19df092; removed all 5462 slot(s) it used to own Node ef54ff291855537b7f4f61c815955310e55bc9e2 () is now part of shard aad3ac4bf806c2bd5441ed5c421893be20f02797 Node ef54ff291855537b7f4f61c815955310e55bc9e2 () is now a replica of node a1d09af28a10d58deb43ccc8bddd8c8d095a5af5 () in shard aad3ac4bf806c2bd5441ed5c421893be20f02797 === VALKEY BUG REPORT START: Cut & paste starting from here === === ASSERTION FAILED === ==> cluster_legacy.c:7004 'primary->replicaof == ((void*)0)' is not true ------ STACK TRACE ------ Backtrace: 0 valkey-server 0x000000010507ac70 clusterUpdateSlotsConfigWith + 0 1 valkey-server 0x0000000105077564 clusterReadHandler + 6180 2 valkey-server 0x00000001050f3b0c connSocketEventHandler + 180 3 valkey-server 0x0000000104fb7f9c aeProcessEvents + 372 4 valkey-server 0x0000000104fe2d30 main + 27676 5 dyld 0x0000000199177154 start + 2476 ``` And we can see from the cluster nodes output, there are indeed a replication loop: ``` sender: ef54ff291855537b7f4f61c815955310e55bc9e2 sender_claimed_primary: a1d09af28a10d58deb43ccc8bddd8c8d095a5af5 ef54ff291855537b7f4f61c815955310e55bc9e2 shard-id=ab1e16d7e11cbb0d81acf0bab4191d84e19df092 slave a1d09af28a10d58deb43ccc8bddd8c8d095a5af5 a1d09af28a10d58deb43ccc8bddd8c8d095a5af5 shard-id=aad3ac4bf806c2bd5441ed5c421893be20f02797 slave ef54ff291855537b7f4f61c815955310e55bc9e2 ``` The sender and sender_claimed_primary are not in the same shard, and sender_claimed_primary is a replica in myself view, if we don't update sender_claimed_primary role, in the next `sender->replicaof = sender_claimed_primary`, we will create the replication loop. Signed-off-by: Binbin <binloveplay1314@qq.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #2441 +/- ##
============================================
- Coverage 72.03% 71.93% -0.11%
============================================
Files 126 126
Lines 70435 70444 +9
============================================
- Hits 50740 50673 -67
- Misses 19695 19771 +76
🚀 New features to boost your workflow:
|
| if (sender_claimed_primary && nodeIsReplica(sender_claimed_primary)) { | ||
| /* `primary` is still a `replica` in this observer node's view; | ||
| * update its role and configEpoch */ | ||
| clusterSetNodeAsPrimary(sender_claimed_primary); | ||
| sender_claimed_primary->configEpoch = sender_claimed_config_epoch; | ||
| clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG | CLUSTER_TODO_FSYNC_CONFIG | CLUSTER_TODO_UPDATE_STATE); | ||
| } | ||
|
|
There was a problem hiding this comment.
I'm worried that we don't have the ability to reason about if this is right anymore. This code seems like it should be hit in the earlier case when the primary was previously a replica and is no longer a replica. I guess we had other special code for matching the shard, so maybe it's covering migration cases as well.
There was a problem hiding this comment.
We opened the gate at 28976a9, it did fix a tricky issue at the time (which we encountered a few times as well). We may need @PingXie to take a closer look, i am a little worried. Now the relationship between shard_id and processPacket is getting closer and closer, but our shard_id is not updated so in time. (I suspect this debug assertion is because the shard_id hasn't been updated correctly during the pause or the packet drop, causing it to pass the same shard condition, if we want to avoid the crash, we also need to set sender_claimed_primary to primary in the else condition, or at least fix it when we see a replication cycle.)
There was a problem hiding this comment.
If the sender says it's a primary and it's epoch is greater than the one we think is the primary, we should trust it, right?
There was a problem hiding this comment.
this change also exposing a problem, that is, we modified the epoch of a node here.
https://github.com/valkey-io/valkey/actions/runs/16897847310/job/47870978571?pr=2441#step:7:7237
22304:S 12 Aug 2025 03:11:23.390 * Node 7db2919fe69b33a5bf8a70f1dec1fff19a77c8ff () is now a replica of node a0ad02b553db9024f194b8e9d2a2e8031e4428c5 () in shard 0b5882b4b1ae8066956a3875479b42f0db44e75c
22304:S 12 Aug 2025 03:11:23.397 * Replica rank updated to #0, added -1000 milliseconds of delay.
22304:S 12 Aug 2025 03:11:23.397 * Starting a failover election for epoch 11, node config epoch is 9
like in the log, myself update sender_claimed_primary config epoch, and sender_claimed_primary happen is myself's primary. And in the next failover cron, myself start the election with the right node epoch, and it win the election.
/* Ask for votes if needed. */
if (server.cluster->failover_auth_sent == 0) {
server.cluster->currentEpoch++;
server.cluster->failover_auth_epoch = server.cluster->currentEpoch;
serverLog(LL_NOTICE, "Starting a failover election for epoch %llu, node config epoch is %llu",
(unsigned long long)server.cluster->currentEpoch, (unsigned long long)nodeEpoch(myself));
clusterRequestFailoverAuth();
Previously, in this situation, its nodeEpoch would be an old value, and it would not be able to win the election.
There was a problem hiding this comment.
So there are two ways to handle this i guess:
- We check that myself->replicaof == node and cancel the election in here.
- We record the current epoch in the failover cron. If the nodeEpoch changes during the delay, abort the failover.
There was a problem hiding this comment.
I pushed a new commit for the option 2, see if you guys will like it. We are patching the code again and again...
There was a problem hiding this comment.
....
the reason is that after clusterSetNodeAsPrimary(sender_claimed_primary);, myself replica rank become No.0 since sender_claimed_primary is not a replica anymore and then myself trigger the election immediately
There was a problem hiding this comment.
pushed another new commit for the option 1 and see if i can make it right (make the CI happy)
zuiderkwast
left a comment
There was a problem hiding this comment.
Seems safe to me.
@enjoy-binbin We have some PRs that seem to affect each other. Shall we create a branch with this + #2431 + (some more PR?) to run the tests and see if solves the problems?
…2431) The assert was added in #2301 and we found that there are some situations would trigger assert and crash the server. The reason we added the assert is because, in the code: 1. sender_claimed_primary and sender are in the same shard 2. and sender is the old primary, sender_claimed_primary is the old replica 3. and now sender become a replica, sender_claimed_primary become a primary That means a failover happend in the shard, and sender should be the primary of sender_claimed_primary. But obviously this assumption may be wrong, we rely on shard_id to determine whether it is in a same shard, and assume that a shard can only have one primary. But this is wrong, from #2279 we can know there will be a case that we can create two primaries in the same shard due to the untimely update of shard_id. So we can create a test that trigger the assert in this way: 1. pre condition: two primaries in the same shard, one has slots and one is empty. 2. replica doing a cluster failover 3. the empty primary doing a cluster replicate with the replica (new primary) We change the assert to an if condition to fix it. Closes #2423. Note that the test written here also exposes the issue in #2441, so these two may need to be addressed together. Signed-off-by: Binbin <binloveplay1314@qq.com>
|
i have the memory this will solve the problem in my local, i have a fresh day today, i will work on it (merge it one by one and test it) |
…hain Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
| } | ||
|
|
||
| if (sender_claimed_primary && nodeIsReplica(sender_claimed_primary)) { | ||
| if (nodeIsReplica(myself) && areInSameShard(myself, sender_claimed_primary) && |
There was a problem hiding this comment.
Is it possible that the sender_claimed_primary actually has a different shard_id, since we are just learning about this through gossip? In this specific instance the sender_claimed_primary object has our our local view of its shard_id, but it may in reality have a different shard_id.
Also below, clusterSetNodeAsPrimary doesn't configure us to do replication, so we'll mark it as our primary but we won't start replicating from it.
In valkey-io#2431 we changed the assert to a if condition, and the test cause some trouble, now we just remove the assert (if condition) and disable the test for now due to valkey-io#2441. Signed-off-by: Binbin <binloveplay1314@qq.com>
|
After merging #2477, unstable now has these issue:
The #2477 test case will create a loop which trigger the assert. (This replicaof will eventually recover after some packets). TO fix it, maybe we should always update sender_claimed_primary's flag even though we are not in the same shard.
we have this code, and in the old slot migration case, we will incr epoch when we import a slot, the new epoch may not reach all nodes in time, so some nodes will know the new epoch and some other won't. So some node will drop the packet since it think the packet is stale.
|
Can we consider not marking a node to be a primary based off of another replica node claiming it as a primary? These problems have been endemic since we introduced the logic, since there are a bunch of odd cases. |
…alkey-io#2431) The assert was added in valkey-io#2301 and we found that there are some situations would trigger assert and crash the server. The reason we added the assert is because, in the code: 1. sender_claimed_primary and sender are in the same shard 2. and sender is the old primary, sender_claimed_primary is the old replica 3. and now sender become a replica, sender_claimed_primary become a primary That means a failover happend in the shard, and sender should be the primary of sender_claimed_primary. But obviously this assumption may be wrong, we rely on shard_id to determine whether it is in a same shard, and assume that a shard can only have one primary. But this is wrong, from valkey-io#2279 we can know there will be a case that we can create two primaries in the same shard due to the untimely update of shard_id. So we can create a test that trigger the assert in this way: 1. pre condition: two primaries in the same shard, one has slots and one is empty. 2. replica doing a cluster failover 3. the empty primary doing a cluster replicate with the replica (new primary) We change the assert to an if condition to fix it. Closes valkey-io#2423. Note that the test written here also exposes the issue in valkey-io#2441, so these two may need to be addressed together. Signed-off-by: Binbin <binloveplay1314@qq.com>
In valkey-io#2431 we changed the assert to a if condition, and the test cause some trouble, now we just remove the assert (if condition) and disable the test for now due to valkey-io#2441. Signed-off-by: Binbin <binloveplay1314@qq.com>
In valkey-io#2431 we changed the assert to a if condition, and the test cause some trouble, now we just remove the assert (if condition) and disable the test for now due to valkey-io#2441. Signed-off-by: Binbin <binloveplay1314@qq.com>
…alkey-io#2431) The assert was added in valkey-io#2301 and we found that there are some situations would trigger assert and crash the server. The reason we added the assert is because, in the code: 1. sender_claimed_primary and sender are in the same shard 2. and sender is the old primary, sender_claimed_primary is the old replica 3. and now sender become a replica, sender_claimed_primary become a primary That means a failover happend in the shard, and sender should be the primary of sender_claimed_primary. But obviously this assumption may be wrong, we rely on shard_id to determine whether it is in a same shard, and assume that a shard can only have one primary. But this is wrong, from valkey-io#2279 we can know there will be a case that we can create two primaries in the same shard due to the untimely update of shard_id. So we can create a test that trigger the assert in this way: 1. pre condition: two primaries in the same shard, one has slots and one is empty. 2. replica doing a cluster failover 3. the empty primary doing a cluster replicate with the replica (new primary) We change the assert to an if condition to fix it. Closes valkey-io#2423. Note that the test written here also exposes the issue in valkey-io#2441, so these two may need to be addressed together. Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
In valkey-io#2431 we changed the assert to a if condition, and the test cause some trouble, now we just remove the assert (if condition) and disable the test for now due to valkey-io#2441. Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
When writing test case for #2431, i found that the corresponding test
case occasionally triggered this debug assert:
And we can see from the cluster nodes output, there are indeed a replication loop:
The sender and sender_claimed_primary are not in the same shard, and
sender_claimed_primary is a replica in myself view, if we don't update
sender_claimed_primary role, in the next
sender->replicaof = sender_claimed_primary,we will create the replication loop.