Skip to content

The smaller config epoch primary will become the replica when two primaries in the same shard#2279

Open
enjoy-binbin wants to merge 9 commits into
valkey-io:unstablefrom
enjoy-binbin:issue_2261
Open

The smaller config epoch primary will become the replica when two primaries in the same shard#2279
enjoy-binbin wants to merge 9 commits into
valkey-io:unstablefrom
enjoy-binbin:issue_2261

Conversation

@enjoy-binbin

@enjoy-binbin enjoy-binbin commented Jun 26, 2025

Copy link
Copy Markdown
Member

In #2261, there is a scenario that results in two primaries in a shard.

Let's say a shard has a primary(A) and a replica(B). Also let's assume that
cluster-allow-replica-migration is disabled.

  1. Node A goes down, and Node B takes over primaryship.
  2. Node A continues to be down while another Node C is added as a replica of B.
  3. Node B goes down, and Node C takes over primaryship.
  4. Node A and Node B come back up and start learning about the topology.
  5. Node A comes up thinking it was the primary (but has an older config epoch compared to C).
  6. Node A learns about Node C via gossip and assigns it a random shard_id.
  7. Node A receives a direct ping from Node C.
    a. Node C advertises the same set of slots that Node A was earlier owning.
    b. Since Node A assigns a random shard ID to Node C, Node A thinks that it is still a
    primary and it lost all its slots to Node C, which is in another shard.
  8. Node A then updates the actual shard_id of Node C while processing shard_id in ping extensions.
  9. Node A and Node C end up being primaries in the same shard while Node C continues to own slots.

So in the step 8, Node A and Node C has the same shard_id and both report themselves
as a primary, but Node C has a bigger config epoch, i guess in this case, we can try
to make Node A become a replica of Node C.

Fixes #2261.

An ASCII seq diagram:

+------+          +------+          +------+
|Node A|          |Node B|          |Node C|
+------+          +------+          +------+
   |                 |                 |
   | <Primary>       | <Replica>       |
   |                 |                 |
   X (fails)         |                 |
                     |---------------->|
                     |  (promoted to primary)
                     |                 |
                     X (fails)         |
                                       |
                                       |---------->
                                       | (promoted to primary)
   | (rejoins)                         |
   | (no extensions shared)            |
   |---------------------------------->|
   | (assigns random shard_id)|
   |<----------------------------------|
   | (direct ping: slot info)          |
   | Updates C’s shard_id correctly    |
   | Detects slot conflict             |
   |                                  
   |                                  
+-----------------------------------------------+
| Outcome: Nodes A & C both see themselves as   |
| primary temporarily until shard_id is updated |
| and converged correctly based on epoch value  |
+-----------------------------------------------+

…maries in the same shard

Signed-off-by: Binbin <binloveplay1314@qq.com>
Comment thread tests/unit/cluster/shardid-propagation.tcl Outdated
Comment thread tests/unit/cluster/shardid-propagation.tcl Outdated

@hpatro hpatro left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the fix looks good to me.

Liked the GPT generated (slightly modified) ASCII seq diagram for this, pasting it here for others as well.

+------+          +------+          +------+
|Node A|          |Node B|          |Node C|
+------+          +------+          +------+
   |                 |                 |
   | <Primary>       | <Replica>       |
   |                 |                 |
   X (fails)         |                 |
                     |---------------->|
                     |  (promoted to primary)
                     |                 |
                     X (fails)         |
                                       |
                                       |---------->
                                       | (promoted to primary)
   | (rejoins)                         |
   | (no extensions shared)            |
   |---------------------------------->|
   | (assigns random shard_id)|
   |<----------------------------------|
   | (direct ping: slot info)          |
   | Updates C’s shard_id correctly    |
   | Detects slot conflict             |
   |                                  
   |                                  
+-----------------------------------------------+
| Outcome: Nodes A & C both see themselves as   |
| primary temporarily until shard_id is updated |
| and converged correctly based on epoch value  |
+-----------------------------------------------+

Comment thread src/cluster_legacy.c Outdated
Comment thread src/cluster_legacy.c Outdated
Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin

Copy link
Copy Markdown
Member Author

I see the test New replica inherits multiple migrating slots is failing:

47461:M 28 Jun 2025 12:18:24.199 * Two primaries in same shard, and the sender has a greater config epoch. Reconfiguring myself as a replica of 8af67730a16d4a47a947271cca28a03ef9fdf337 ()
cluster_legacy.c:5995 'myself->numslots == 0' is not true

myself cluster nodes:

d2081b715fb4b174127ec7a99de4e180de345fbd 127.0.0.1:21120@31120,,tls-port=0,shard-id=9d3ebac222cc295191331bd3dbba06e607b42019 master - 0 1751084304199 3 connected 10923-16383
ee0d3a2618bf64993ce1e71569f845aa6e33f46b 127.0.0.1:21121@31121,,tls-port=0,shard-id=d3f24d198dac6233c89454ed22f1a3a85b1e0e9f master - 0 1751084304199 2 connected 5462-10922
8af67730a16d4a47a947271cca28a03ef9fdf337 127.0.0.1:21119@31119,,tls-port=0,shard-id=d5ada9160b2d0a5196dcdfe0325f6accf187dec0 master - 0 1751084304199 4 connected
21fe04bcaaf3eb6dfb6217e27b3100dccb0993e4 127.0.0.1:21118@31118,,tls-port=0,shard-id=d3f24d198dac6233c89454ed22f1a3a85b1e0e9f slave ee0d3a2618bf64993ce1e71569f845aa6e33f46b 0 1751084304199 2 connected
1497ec4b6d1ecdc419741306f8d1ae44d9e8fc9d 127.0.0.1:21117@31117,,tls-port=0,shard-id=9d3ebac222cc295191331bd3dbba06e607b42019 slave d2081b715fb4b174127ec7a99de4e180de345fbd 0 1751084304199 3 connected
7c0a9f9b0610f3a3c4d895282ed41f51f9f97a5d 127.0.0.1:21122@31122,,tls-port=0,shard-id=d5ada9160b2d0a5196dcdfe0325f6accf187dec0 myself,master - 0 0 1 connected 0-5461 [7->-ee0d3a2618bf64993ce1e71569f845aa6e33f46b] [13->-ee0d3a2618bf64993ce1e71569f845aa6e33f46b] [17->-ee0d3a2618bf64993ce1e71569f845aa6e33f46b]
```

I will take a look later this day.

@enjoy-binbin

enjoy-binbin commented Jun 28, 2025

Copy link
Copy Markdown
Member Author

The problem seems to be, a primary used to had a larger config epoch, and then it became a replica, and then it update the shard_id, and then it doing the cluster reset soft and became a empty primary, but remain the larger config epoch and its shard_id, and causing this problem. IT is also the same issue, two primaries in the same shard, but the empty primary had a larger config epoch.

See ##2283. I think when a replica doing the CLUSTER RESET SOFT, we should generate a new shard_id for it.

@enjoy-binbin enjoy-binbin marked this pull request as draft June 28, 2025 15:07
@zuiderkwast

Copy link
Copy Markdown
Contributor

and then it doing the cluster reset soft and became a empty primary, but remain the larger config epoch and its shard_id, and causing this problem

This test case doesn's use CLUSTER RESET SOFT. ❓

@zuiderkwast zuiderkwast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general.

Comment thread tests/unit/cluster/shardid-propagation.tcl
@enjoy-binbin

Copy link
Copy Markdown
Member Author

This test case doesn's use CLUSTER RESET SOFT. ❓

This is the other test case, i handle it in #2283

    test "New replica inherits multiple migrating slots" {
        # Reset R3 to turn it into an empty node
        assert_equal {OK} [R 3 CLUSTER RESET]
        # Add R3 back as a replica of R0
        assert_equal {OK} [R 3 CLUSTER MEET [srv 0 "host"] [srv 0 "port"]]
        wait_for_role 0 master
        assert_equal {OK} [R 3 CLUSTER REPLICATE $R0_id]
        wait_for_role 3 slave
        # Validate final states
        wait_for_slot_state 3 "\[7->-$R1_id\] \[13->-$R1_id\] \[17->-$R1_id\]"
    }

Signed-off-by: Binbin <binloveplay1314@qq.com>
Comment thread src/cluster_legacy.c
Comment on lines +3813 to +3819
/* If after processing everything, we find that myself and sender are on the
* same shard and are both primaries, if myself config epoch is smaller, make
* it a replica. */
if (sender && nodeIsPrimary(myself) && nodeIsPrimary(sender) && areInSameShard(myself, sender) &&
nodeEpoch(sender) > nodeEpoch(myself)) {
clusterHandlePrimariesSameShardCollision(sender);
}

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new test added in ##2283 now currently will fail in here:

86095:M 05 Jul 2025 12:57:44.202 # === ASSERTION FAILED ===
86095:M 05 Jul 2025 12:57:44.202 # ==> cluster_legacy.c:6020 'myself->numslots == 0' is not true

------ STACK TRACE ------

Backtrace:
0   valkey-server                       0x000000010227faac updateShardId + 0
1   valkey-server                       0x000000010227d12c clusterReadHandler + 7736
2   valkey-server                       0x00000001022f8040 connSocketEventHandler + 180
3   valkey-server                       0x00000001021bedd8 aeProcessEvents + 372
4   valkey-server                       0x00000001021e8f50 main + 26864
5   dyld                                0x0000000180b13154 start + 2476

That is because we drop the link->support_extension in here #2283 (comment)

The shard_id was not updated in time, which led to a misjudgment. So we either need to support link->support_extension or add node->numslots judgment here.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Jul 11, 2025
enjoy-binbin added a commit that referenced this pull request Jul 16, 2025
… faster during handshake (#2310)

Add support_extension to cluster link, so that when myself receives a message from sender,
even if myself does not know the sender, for example during handshake the sender node is
still treated as random node, in thi case, the packet we reply to sender can also carry the
extension, since the sender link explicitly indicates that it supports the extension, by doing
this, we can speed up the spread of the extension.

Also see #2283 and #2279 for more details.
Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
@codecov

codecov Bot commented Jul 16, 2025

Copy link
Copy Markdown

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.46%. Comparing base (78e7208) to head (9c749e6).
Report is 7 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2279      +/-   ##
============================================
+ Coverage     71.42%   71.46%   +0.03%     
============================================
  Files           123      123              
  Lines         67135    67147      +12     
============================================
+ Hits          47952    47987      +35     
+ Misses        19183    19160      -23     
Files with missing lines Coverage Δ
src/cluster_legacy.c 86.94% <100.00%> (-0.16%) ⬇️

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin enjoy-binbin marked this pull request as ready for review July 23, 2025 08:35
@enjoy-binbin enjoy-binbin requested a review from PingXie July 23, 2025 08:35
@hpatro hpatro self-requested a review August 1, 2025 20:34
enjoy-binbin added a commit to enjoy-binbin/valkey that referenced this pull request Aug 5, 2025
The assert was added in valkey-io#2301 and we found that there are some situations
would trigger assert and crash the server.

The reason we added the assert is because, in the code:
1. sender_claimed_primary and sender are in the same shard
2. and sender is the old primary, sender_claimed_primary is the old replica
3. and now sender become a replica, sender_claimed_primary become a primary

That means a failover happend in the shard, and sender should be the primary
of sender_claimed_primary. But obviously this assumption may be wrong, we rely
on shard_id to determine whether it is in a same shard, and assume that a shard
can only have one primary.

But this is wrong, from valkey-io#2279 we can know there will be a case that we can
create two primaries in the same shard due to the untimely update of shard_id.
So we can create a test that trigger the assert in this way:
1. pre condition: two primaries in the same shard, one has slots and one is empty.
2. replica doing a cluster failover
3. the empty primary doing a cluster replicate with the replica (new primary)

We change the assert to an if condition to fix it.

Signed-off-by: Binbin <binloveplay1314@qq.com>
enjoy-binbin added a commit that referenced this pull request Aug 12, 2025
…2431)

The assert was added in #2301 and we found that there are some situations
would trigger assert and crash the server.

The reason we added the assert is because, in the code:
1. sender_claimed_primary and sender are in the same shard
2. and sender is the old primary, sender_claimed_primary is the old replica
3. and now sender become a replica, sender_claimed_primary become a primary

That means a failover happend in the shard, and sender should be the primary
of sender_claimed_primary. But obviously this assumption may be wrong, we rely
on shard_id to determine whether it is in a same shard, and assume that a shard
can only have one primary.

But this is wrong, from #2279 we can know there will be a case that we can
create two primaries in the same shard due to the untimely update of shard_id.
So we can create a test that trigger the assert in this way:
1. pre condition: two primaries in the same shard, one has slots and one is empty.
2. replica doing a cluster failover
3. the empty primary doing a cluster replicate with the replica (new primary)

We change the assert to an if condition to fix it.

Closes #2423.

Note that the test written here also exposes the issue in #2441, so these two
may need to be addressed together.

Signed-off-by: Binbin <binloveplay1314@qq.com>
allenss-amazon pushed a commit to allenss-amazon/valkey-core that referenced this pull request Aug 19, 2025
…alkey-io#2431)

The assert was added in valkey-io#2301 and we found that there are some situations
would trigger assert and crash the server.

The reason we added the assert is because, in the code:
1. sender_claimed_primary and sender are in the same shard
2. and sender is the old primary, sender_claimed_primary is the old replica
3. and now sender become a replica, sender_claimed_primary become a primary

That means a failover happend in the shard, and sender should be the primary
of sender_claimed_primary. But obviously this assumption may be wrong, we rely
on shard_id to determine whether it is in a same shard, and assume that a shard
can only have one primary.

But this is wrong, from valkey-io#2279 we can know there will be a case that we can
create two primaries in the same shard due to the untimely update of shard_id.
So we can create a test that trigger the assert in this way:
1. pre condition: two primaries in the same shard, one has slots and one is empty.
2. replica doing a cluster failover
3. the empty primary doing a cluster replicate with the replica (new primary)

We change the assert to an if condition to fix it.

Closes valkey-io#2423.

Note that the test written here also exposes the issue in valkey-io#2441, so these two
may need to be addressed together.

Signed-off-by: Binbin <binloveplay1314@qq.com>
hpatro pushed a commit to hpatro/valkey that referenced this pull request Oct 3, 2025
…alkey-io#2431)

The assert was added in valkey-io#2301 and we found that there are some situations
would trigger assert and crash the server.

The reason we added the assert is because, in the code:
1. sender_claimed_primary and sender are in the same shard
2. and sender is the old primary, sender_claimed_primary is the old replica
3. and now sender become a replica, sender_claimed_primary become a primary

That means a failover happend in the shard, and sender should be the primary
of sender_claimed_primary. But obviously this assumption may be wrong, we rely
on shard_id to determine whether it is in a same shard, and assume that a shard
can only have one primary.

But this is wrong, from valkey-io#2279 we can know there will be a case that we can
create two primaries in the same shard due to the untimely update of shard_id.
So we can create a test that trigger the assert in this way:
1. pre condition: two primaries in the same shard, one has slots and one is empty.
2. replica doing a cluster failover
3. the empty primary doing a cluster replicate with the replica (new primary)

We change the assert to an if condition to fix it.

Closes valkey-io#2423.

Note that the test written here also exposes the issue in valkey-io#2441, so these two
may need to be addressed together.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
@hpatro

hpatro commented Oct 10, 2025

Copy link
Copy Markdown
Contributor

A bit too late to include this change to 9.0 GA. Let's target it for 9.1

@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds automatic collision resolution when two primary nodes are detected in the same shard. A new handler reconfigures the local node as a replica when the sender has a higher config epoch, and an integration test validates the scenario across multiple failovers and reconnections.

Changes

Primary Collision Resolution

Layer / File(s) Summary
Collision handler implementation and integration
src/cluster_legacy.c
New clusterHandlePrimariesSameShardCollision() logs and reconfigures the local node as the sender's replica. Integrated into message processing after gossip and ping extension handling, with context comment on ping extension timing.
Collision scenario test
tests/unit/cluster/shardid-propagation.tcl
Integration test drives topology through isolation, forced failovers, and reconnection control to create two primaries in the same shard, then asserts both nodes transition to replica roles.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • zuiderkwast
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: handling the scenario where two primaries exist in the same shard by making the smaller config epoch primary become a replica.
Description check ✅ Passed The description is directly related to the changeset, providing detailed context about issue #2261, the reproduction scenario, and the proposed solution with an ASCII diagram.
Linked Issues check ✅ Passed The PR implements the core requirement from #2261: when two primaries are detected in the same shard, the node with the smaller config epoch becomes a replica of the node with the larger epoch.
Out of Scope Changes check ✅ Passed All changes are in-scope: the new collision handler addresses #2261's primary issue, the test validates the fix, and a comment clarification does not alter runtime behavior.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cluster_legacy.c`:
- Around line 4416-4422: This change in cluster_legacy.c modifies shard
collision/failover behavior around the block that checks nodeIsPrimary(myself),
nodeIsPrimary(sender), areInSameShard(myself, sender) and calls
clusterHandlePrimariesSameShardCollision(sender); before merging, update the PR
description and add an explicit request for an `@core-team` architectural review
referencing the changed logic and the symbols
clusterHandlePrimariesSameShardCollision, nodeIsPrimary, areInSameShard, and
nodeEpoch so core architects can evaluate failover semantics; ensure the PR
summary explains the behavioral impact and links to this diff for clear review.
- Around line 2469-2475: The function clusterHandlePrimariesSameShardCollision
is a helper used only within this translation unit and should have internal
linkage; make it file-local by adding the static keyword to its declaration
(change the definition of clusterHandlePrimariesSameShardCollision to be static
void clusterHandlePrimariesSameShardCollision(clusterNode *sender)) so the
symbol is not exported, leaving the body and calls to clusterSetPrimary and
clusterDoBeforeSleep unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: d792cf65-87a4-4fe6-8f47-c4bd699fa8af

📥 Commits

Reviewing files that changed from the base of the PR and between 7c3821f and 76f3a4e.

📒 Files selected for processing (2)
  • src/cluster_legacy.c
  • tests/unit/cluster/shardid-propagation.tcl

Comment thread src/cluster_legacy.c
Comment on lines +2469 to +2475
void clusterHandlePrimariesSameShardCollision(clusterNode *sender) {
serverLog(LL_NOTICE, "Two primaries in same shard, and the sender has a greater config epoch. "
"Reconfiguring myself as a replica of %.40s (%s)",
sender->name, sender->human_nodename);
clusterSetPrimary(sender, 1, 0);
clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG | CLUSTER_TODO_UPDATE_STATE | CLUSTER_TODO_FSYNC_CONFIG);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Make the collision handler file-local with static.

Line 2469 introduces a helper that is only used inside src/cluster_legacy.c; it should be declared static to keep linkage local.

Proposed fix
-void clusterHandlePrimariesSameShardCollision(clusterNode *sender) {
+static void clusterHandlePrimariesSameShardCollision(clusterNode *sender) {

As per coding guidelines: "Use static keyword for file-local functions in C code".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
void clusterHandlePrimariesSameShardCollision(clusterNode *sender) {
serverLog(LL_NOTICE, "Two primaries in same shard, and the sender has a greater config epoch. "
"Reconfiguring myself as a replica of %.40s (%s)",
sender->name, sender->human_nodename);
clusterSetPrimary(sender, 1, 0);
clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG | CLUSTER_TODO_UPDATE_STATE | CLUSTER_TODO_FSYNC_CONFIG);
}
static void clusterHandlePrimariesSameShardCollision(clusterNode *sender) {
serverLog(LL_NOTICE, "Two primaries in same shard, and the sender has a greater config epoch. "
"Reconfiguring myself as a replica of %.40s (%s)",
sender->name, sender->human_nodename);
clusterSetPrimary(sender, 1, 0);
clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG | CLUSTER_TODO_UPDATE_STATE | CLUSTER_TODO_FSYNC_CONFIG);
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cluster_legacy.c` around lines 2469 - 2475, The function
clusterHandlePrimariesSameShardCollision is a helper used only within this
translation unit and should have internal linkage; make it file-local by adding
the static keyword to its declaration (change the definition of
clusterHandlePrimariesSameShardCollision to be static void
clusterHandlePrimariesSameShardCollision(clusterNode *sender)) so the symbol is
not exported, leaving the body and calls to clusterSetPrimary and
clusterDoBeforeSleep unchanged.

Comment thread src/cluster_legacy.c
Comment on lines +4416 to +4422
/* If after processing everything, we find that myself and sender are on the
* same shard and are both primaries, if myself is a empty primary and myself
* config epoch is smaller, make it become a replica of sender. */
if (sender && nodeIsPrimary(myself) && nodeIsPrimary(sender) && areInSameShard(myself, sender) &&
myself->numslots == 0 && nodeEpoch(sender) > nodeEpoch(myself)) {
clusterHandlePrimariesSameShardCollision(sender);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Request @core-team architectural review before merge.

This change alters shard collision/failover behavior in src/cluster_legacy.c and should explicitly get @core-team review.

As per coding guidelines: "Request @core-team architectural review for changes to cluster*.c, replication.c, rdb.c, or aof.c".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cluster_legacy.c` around lines 4416 - 4422, This change in
cluster_legacy.c modifies shard collision/failover behavior around the block
that checks nodeIsPrimary(myself), nodeIsPrimary(sender), areInSameShard(myself,
sender) and calls clusterHandlePrimariesSameShardCollision(sender); before
merging, update the PR description and add an explicit request for an `@core-team`
architectural review referencing the changed logic and the symbols
clusterHandlePrimariesSameShardCollision, nodeIsPrimary, areInSameShard, and
nodeEpoch so core architects can evaluate failover semantics; ensure the PR
summary explains the behavioral impact and links to this diff for clear review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[BUG] Two primaries in same shard

4 participants