The smaller config epoch primary will become the replica when two primaries in the same shard#2279
The smaller config epoch primary will become the replica when two primaries in the same shard#2279enjoy-binbin wants to merge 9 commits into
Conversation
…maries in the same shard Signed-off-by: Binbin <binloveplay1314@qq.com>
hpatro
left a comment
There was a problem hiding this comment.
Overall, the fix looks good to me.
Liked the GPT generated (slightly modified) ASCII seq diagram for this, pasting it here for others as well.
+------+ +------+ +------+
|Node A| |Node B| |Node C|
+------+ +------+ +------+
| | |
| <Primary> | <Replica> |
| | |
X (fails) | |
|---------------->|
| (promoted to primary)
| |
X (fails) |
|
|---------->
| (promoted to primary)
| (rejoins) |
| (no extensions shared) |
|---------------------------------->|
| (assigns random shard_id)|
|<----------------------------------|
| (direct ping: slot info) |
| Updates C’s shard_id correctly |
| Detects slot conflict |
|
|
+-----------------------------------------------+
| Outcome: Nodes A & C both see themselves as |
| primary temporarily until shard_id is updated |
| and converged correctly based on epoch value |
+-----------------------------------------------+
Signed-off-by: Binbin <binloveplay1314@qq.com>
|
I see the test myself cluster nodes: |
|
The problem seems to be, a primary used to had a larger config epoch, and then it became a replica, and then it update the shard_id, and then it doing the cluster reset soft and became a empty primary, but remain the larger config epoch and its shard_id, and causing this problem. IT is also the same issue, two primaries in the same shard, but the empty primary had a larger config epoch. See ##2283. I think when a replica doing the CLUSTER RESET SOFT, we should generate a new shard_id for it. |
This test case doesn's use CLUSTER RESET SOFT. ❓ |
This is the other test case, i handle it in #2283 |
Signed-off-by: Binbin <binloveplay1314@qq.com>
| /* If after processing everything, we find that myself and sender are on the | ||
| * same shard and are both primaries, if myself config epoch is smaller, make | ||
| * it a replica. */ | ||
| if (sender && nodeIsPrimary(myself) && nodeIsPrimary(sender) && areInSameShard(myself, sender) && | ||
| nodeEpoch(sender) > nodeEpoch(myself)) { | ||
| clusterHandlePrimariesSameShardCollision(sender); | ||
| } |
There was a problem hiding this comment.
The new test added in ##2283 now currently will fail in here:
86095:M 05 Jul 2025 12:57:44.202 # === ASSERTION FAILED ===
86095:M 05 Jul 2025 12:57:44.202 # ==> cluster_legacy.c:6020 'myself->numslots == 0' is not true
------ STACK TRACE ------
Backtrace:
0 valkey-server 0x000000010227faac updateShardId + 0
1 valkey-server 0x000000010227d12c clusterReadHandler + 7736
2 valkey-server 0x00000001022f8040 connSocketEventHandler + 180
3 valkey-server 0x00000001021bedd8 aeProcessEvents + 372
4 valkey-server 0x00000001021e8f50 main + 26864
5 dyld 0x0000000180b13154 start + 2476
That is because we drop the link->support_extension in here #2283 (comment)
The shard_id was not updated in time, which led to a misjudgment. So we either need to support link->support_extension or add node->numslots judgment here.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
… faster during handshake (#2310) Add support_extension to cluster link, so that when myself receives a message from sender, even if myself does not know the sender, for example during handshake the sender node is still treated as random node, in thi case, the packet we reply to sender can also carry the extension, since the sender link explicitly indicates that it supports the extension, by doing this, we can speed up the spread of the extension. Also see #2283 and #2279 for more details.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #2279 +/- ##
============================================
+ Coverage 71.42% 71.46% +0.03%
============================================
Files 123 123
Lines 67135 67147 +12
============================================
+ Hits 47952 47987 +35
+ Misses 19183 19160 -23
🚀 New features to boost your workflow:
|
Signed-off-by: Binbin <binloveplay1314@qq.com>
The assert was added in valkey-io#2301 and we found that there are some situations would trigger assert and crash the server. The reason we added the assert is because, in the code: 1. sender_claimed_primary and sender are in the same shard 2. and sender is the old primary, sender_claimed_primary is the old replica 3. and now sender become a replica, sender_claimed_primary become a primary That means a failover happend in the shard, and sender should be the primary of sender_claimed_primary. But obviously this assumption may be wrong, we rely on shard_id to determine whether it is in a same shard, and assume that a shard can only have one primary. But this is wrong, from valkey-io#2279 we can know there will be a case that we can create two primaries in the same shard due to the untimely update of shard_id. So we can create a test that trigger the assert in this way: 1. pre condition: two primaries in the same shard, one has slots and one is empty. 2. replica doing a cluster failover 3. the empty primary doing a cluster replicate with the replica (new primary) We change the assert to an if condition to fix it. Signed-off-by: Binbin <binloveplay1314@qq.com>
…2431) The assert was added in #2301 and we found that there are some situations would trigger assert and crash the server. The reason we added the assert is because, in the code: 1. sender_claimed_primary and sender are in the same shard 2. and sender is the old primary, sender_claimed_primary is the old replica 3. and now sender become a replica, sender_claimed_primary become a primary That means a failover happend in the shard, and sender should be the primary of sender_claimed_primary. But obviously this assumption may be wrong, we rely on shard_id to determine whether it is in a same shard, and assume that a shard can only have one primary. But this is wrong, from #2279 we can know there will be a case that we can create two primaries in the same shard due to the untimely update of shard_id. So we can create a test that trigger the assert in this way: 1. pre condition: two primaries in the same shard, one has slots and one is empty. 2. replica doing a cluster failover 3. the empty primary doing a cluster replicate with the replica (new primary) We change the assert to an if condition to fix it. Closes #2423. Note that the test written here also exposes the issue in #2441, so these two may need to be addressed together. Signed-off-by: Binbin <binloveplay1314@qq.com>
…alkey-io#2431) The assert was added in valkey-io#2301 and we found that there are some situations would trigger assert and crash the server. The reason we added the assert is because, in the code: 1. sender_claimed_primary and sender are in the same shard 2. and sender is the old primary, sender_claimed_primary is the old replica 3. and now sender become a replica, sender_claimed_primary become a primary That means a failover happend in the shard, and sender should be the primary of sender_claimed_primary. But obviously this assumption may be wrong, we rely on shard_id to determine whether it is in a same shard, and assume that a shard can only have one primary. But this is wrong, from valkey-io#2279 we can know there will be a case that we can create two primaries in the same shard due to the untimely update of shard_id. So we can create a test that trigger the assert in this way: 1. pre condition: two primaries in the same shard, one has slots and one is empty. 2. replica doing a cluster failover 3. the empty primary doing a cluster replicate with the replica (new primary) We change the assert to an if condition to fix it. Closes valkey-io#2423. Note that the test written here also exposes the issue in valkey-io#2441, so these two may need to be addressed together. Signed-off-by: Binbin <binloveplay1314@qq.com>
…alkey-io#2431) The assert was added in valkey-io#2301 and we found that there are some situations would trigger assert and crash the server. The reason we added the assert is because, in the code: 1. sender_claimed_primary and sender are in the same shard 2. and sender is the old primary, sender_claimed_primary is the old replica 3. and now sender become a replica, sender_claimed_primary become a primary That means a failover happend in the shard, and sender should be the primary of sender_claimed_primary. But obviously this assumption may be wrong, we rely on shard_id to determine whether it is in a same shard, and assume that a shard can only have one primary. But this is wrong, from valkey-io#2279 we can know there will be a case that we can create two primaries in the same shard due to the untimely update of shard_id. So we can create a test that trigger the assert in this way: 1. pre condition: two primaries in the same shard, one has slots and one is empty. 2. replica doing a cluster failover 3. the empty primary doing a cluster replicate with the replica (new primary) We change the assert to an if condition to fix it. Closes valkey-io#2423. Note that the test written here also exposes the issue in valkey-io#2441, so these two may need to be addressed together. Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
|
A bit too late to include this change to 9.0 GA. Let's target it for 9.1 |
📝 WalkthroughWalkthroughThis PR adds automatic collision resolution when two primary nodes are detected in the same shard. A new handler reconfigures the local node as a replica when the sender has a higher config epoch, and an integration test validates the scenario across multiple failovers and reconnections. ChangesPrimary Collision Resolution
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cluster_legacy.c`:
- Around line 4416-4422: This change in cluster_legacy.c modifies shard
collision/failover behavior around the block that checks nodeIsPrimary(myself),
nodeIsPrimary(sender), areInSameShard(myself, sender) and calls
clusterHandlePrimariesSameShardCollision(sender); before merging, update the PR
description and add an explicit request for an `@core-team` architectural review
referencing the changed logic and the symbols
clusterHandlePrimariesSameShardCollision, nodeIsPrimary, areInSameShard, and
nodeEpoch so core architects can evaluate failover semantics; ensure the PR
summary explains the behavioral impact and links to this diff for clear review.
- Around line 2469-2475: The function clusterHandlePrimariesSameShardCollision
is a helper used only within this translation unit and should have internal
linkage; make it file-local by adding the static keyword to its declaration
(change the definition of clusterHandlePrimariesSameShardCollision to be static
void clusterHandlePrimariesSameShardCollision(clusterNode *sender)) so the
symbol is not exported, leaving the body and calls to clusterSetPrimary and
clusterDoBeforeSleep unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro Plus
Run ID: d792cf65-87a4-4fe6-8f47-c4bd699fa8af
📒 Files selected for processing (2)
src/cluster_legacy.ctests/unit/cluster/shardid-propagation.tcl
| void clusterHandlePrimariesSameShardCollision(clusterNode *sender) { | ||
| serverLog(LL_NOTICE, "Two primaries in same shard, and the sender has a greater config epoch. " | ||
| "Reconfiguring myself as a replica of %.40s (%s)", | ||
| sender->name, sender->human_nodename); | ||
| clusterSetPrimary(sender, 1, 0); | ||
| clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG | CLUSTER_TODO_UPDATE_STATE | CLUSTER_TODO_FSYNC_CONFIG); | ||
| } |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Make the collision handler file-local with static.
Line 2469 introduces a helper that is only used inside src/cluster_legacy.c; it should be declared static to keep linkage local.
Proposed fix
-void clusterHandlePrimariesSameShardCollision(clusterNode *sender) {
+static void clusterHandlePrimariesSameShardCollision(clusterNode *sender) {As per coding guidelines: "Use static keyword for file-local functions in C code".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| void clusterHandlePrimariesSameShardCollision(clusterNode *sender) { | |
| serverLog(LL_NOTICE, "Two primaries in same shard, and the sender has a greater config epoch. " | |
| "Reconfiguring myself as a replica of %.40s (%s)", | |
| sender->name, sender->human_nodename); | |
| clusterSetPrimary(sender, 1, 0); | |
| clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG | CLUSTER_TODO_UPDATE_STATE | CLUSTER_TODO_FSYNC_CONFIG); | |
| } | |
| static void clusterHandlePrimariesSameShardCollision(clusterNode *sender) { | |
| serverLog(LL_NOTICE, "Two primaries in same shard, and the sender has a greater config epoch. " | |
| "Reconfiguring myself as a replica of %.40s (%s)", | |
| sender->name, sender->human_nodename); | |
| clusterSetPrimary(sender, 1, 0); | |
| clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG | CLUSTER_TODO_UPDATE_STATE | CLUSTER_TODO_FSYNC_CONFIG); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cluster_legacy.c` around lines 2469 - 2475, The function
clusterHandlePrimariesSameShardCollision is a helper used only within this
translation unit and should have internal linkage; make it file-local by adding
the static keyword to its declaration (change the definition of
clusterHandlePrimariesSameShardCollision to be static void
clusterHandlePrimariesSameShardCollision(clusterNode *sender)) so the symbol is
not exported, leaving the body and calls to clusterSetPrimary and
clusterDoBeforeSleep unchanged.
| /* If after processing everything, we find that myself and sender are on the | ||
| * same shard and are both primaries, if myself is a empty primary and myself | ||
| * config epoch is smaller, make it become a replica of sender. */ | ||
| if (sender && nodeIsPrimary(myself) && nodeIsPrimary(sender) && areInSameShard(myself, sender) && | ||
| myself->numslots == 0 && nodeEpoch(sender) > nodeEpoch(myself)) { | ||
| clusterHandlePrimariesSameShardCollision(sender); | ||
| } |
There was a problem hiding this comment.
Request @core-team architectural review before merge.
This change alters shard collision/failover behavior in src/cluster_legacy.c and should explicitly get @core-team review.
As per coding guidelines: "Request @core-team architectural review for changes to cluster*.c, replication.c, rdb.c, or aof.c".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cluster_legacy.c` around lines 4416 - 4422, This change in
cluster_legacy.c modifies shard collision/failover behavior around the block
that checks nodeIsPrimary(myself), nodeIsPrimary(sender), areInSameShard(myself,
sender) and calls clusterHandlePrimariesSameShardCollision(sender); before
merging, update the PR description and add an explicit request for an `@core-team`
architectural review referencing the changed logic and the symbols
clusterHandlePrimariesSameShardCollision, nodeIsPrimary, areInSameShard, and
nodeEpoch so core architects can evaluate failover semantics; ensure the PR
summary explains the behavioral impact and links to this diff for clear review.
In #2261, there is a scenario that results in two primaries in a shard.
Let's say a shard has a primary(A) and a replica(B). Also let's assume that
cluster-allow-replica-migration is disabled.
shard_id.a. Node C advertises the same set of slots that Node A was earlier owning.
b. Since Node A assigns a random shard ID to Node C, Node A thinks that it is still a
primary and it lost all its slots to Node C, which is in another shard.
shard_idof Node C while processingshard_idin ping extensions.So in the step 8, Node A and Node C has the same shard_id and both report themselves
as a primary, but Node C has a bigger config epoch, i guess in this case, we can try
to make Node A become a replica of Node C.
Fixes #2261.
An ASCII seq diagram: