Skip to content

Fix two primaries scenario due to unknown shard_id#2586

Open
deepakrn wants to merge 15 commits into
valkey-io:unstablefrom
deepakrn:two-primaries
Open

Fix two primaries scenario due to unknown shard_id#2586
deepakrn wants to merge 15 commits into
valkey-io:unstablefrom
deepakrn:two-primaries

Conversation

@deepakrn

@deepakrn deepakrn commented Sep 5, 2025

Copy link
Copy Markdown
Contributor

This PR tries to address issue #2261 where a certain order of events leads to two primaries in the same shard.

Fundamentally, the problem with the current state of the system is - if a node hasn't directly been PINGed by another node and has learnt about its presence via gossip, it gets associated with a random shard_id. This means that whenever, the other node with random shard_id advertises its slot, this node initially thinks that it lost all its slots to the other node from another shard and then processes the shard_id change. This leaves the node in a state where it is also a primary in the same shard. Similarly, when this node receives slot configs via an UPDATE message, the node won't know about shard_id until it receives a direct PING.

This indicates that, shard_id is a pre-requisite to processing any slot config updates from a node in the cluster. This change introduces a flag to track if the shard_id is initialized and if not, it allows the rest of the code to ignore certain updates until it has learnt about the shard_id. Introducing this new flag allows the code to be robust and extensible. Therefore, if we see any other scenarios in the future that needs shard_id to be set, this flag makes it possible.

Fixes #2261

@deepakrn

deepakrn commented Sep 5, 2025

Copy link
Copy Markdown
Contributor Author

@murphyjacob4 - could you please help review this change? I do plan to clean up the tests and remove all the sleeps. Other than this, I want to get your inputs on the high level approach for solving this.

#2261 describes the sequence of events.

In this change:

  1. I am preventing a node from processing any slot config updates if the shard_id of the sender node is still not initialized.
  2. I am changing the order of processing the ping extensions (and hence shard_id) and then updating slot configs.

@PingXie PingXie requested a review from Copilot September 5, 2025 18:04

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses a "two primaries scenario" issue related to unknown shard_id values in cluster environments. The fix introduces a new flag to track nodes with uninitialized shard IDs and modifies processing order to ensure shard IDs are updated before slot configurations.

  • Introduces CLUSTER_NODE_SHARD_ID_UNINITIALIZED flag to track nodes with unknown shard IDs
  • Reorders gossip processing before slot configuration updates
  • Adds safeguards to prevent slot updates when shard ID is uninitialized

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/cluster_legacy.h Defines new flag constant for tracking uninitialized shard IDs
src/cluster_legacy.c Implements shard ID tracking logic, reorders processing, and adds validation
tests/unit/cluster/shardid-propagation.tcl Adds comprehensive test cases for shard ID propagation scenarios

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread src/cluster_legacy.c Outdated
* from pre-7.2 releases */
clusterRemoveNodeFromShard(myself);
memcpy(myself->shard_id, shard_id, CLUSTER_NAMELEN);
node->flags &= ~CLUSTER_NODE_SHARD_ID_UNINITIALIZED;

Copilot AI Sep 5, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable node should be myself to match the context. The code is updating myself->shard_id but clearing flags on node.

Suggested change
node->flags &= ~CLUSTER_NODE_SHARD_ID_UNINITIALIZED;
myself->flags &= ~CLUSTER_NODE_SHARD_ID_UNINITIALIZED;

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copilot got this wrong.

however, there is a different issue here. one of the call sites of updateShardId on line 3932 in clusterProcessPacket might be looking at an uninitialized shard-id in some core cases. I am also not super sure about clusterSetPrimary. I think the other two call sites, clusterProcessPingExtensions and clusterCommandSpecial should be good. so rather than take a guess here, I wonder if updateShortId should take a clusterNode* instead so it could check if the reference node's shard id is indeed initialized or not. for the case where we need to set the shard-id explicitly, we can just set the shard-id directly and clear the uninitialized flag, as opposed to calling updateShardId. thoughts?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To elaborate the scenario where updateShardId in clusterProcessPacket might be looking at an uninitialized shard-id:

  1. A node A in node N has uninitialized shard-id because it hasn't directly pinged N yet.
  2. Node B(which was a replica of another node) becomes a replica of node A and hence the sender claimed primary changed. This is where the shard-id of node A which is uninitialized would be associated with B.

So in step 2, we would ideally want node B to be updated to the shard-id of its primary only if it is initialized?

For clusterSetPrimary, it seems to be exercised during SET SLOT command, when replica migration is activated etc. But in those scenarios, I would expect the target node to be reachable and must have directly pinged this node and hence should have the shard_id initialized.

But, I do like your suggestion. We could stall the update until the shard_id of the target node is initialized. I'll make that change. So, there will be two flavors of updateShardId - one that takes the clusterNode reference and another that takes the shard_id string.

@@ -1,3 +1,4 @@
if {0} {

Copilot AI Sep 5, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if {0} statement will disable the entire first test block, making it unreachable. This appears to be leftover debugging code that should be removed.

Copilot uses AI. Check for mistakes.
Comment thread src/cluster_legacy.c Outdated
if (sender) {
clusterProcessGossipSection(hdr, link);
clusterProcessPingExtensions(hdr, link);
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @PingXie offline. We could simply not move this processing to earlier and instead not update slot configs until the shard_id is initialized. I'll work on making that change.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aligned with this suggestion as well. I don't want to open another pandora box by moving them around.

@codecov

codecov Bot commented Sep 5, 2025

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.73684% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 72.23%. Comparing base (78060cb) to head (07c84f5).
⚠️ Report is 346 commits behind head on unstable.

Files with missing lines Patch % Lines
src/debug.c 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2586      +/-   ##
============================================
+ Coverage     72.21%   72.23%   +0.02%     
============================================
  Files           127      127              
  Lines         70936    70948      +12     
============================================
+ Hits          51227    51252      +25     
+ Misses        19709    19696      -13     
Files with missing lines Coverage Δ
src/cluster.c 89.98% <100.00%> (ø)
src/cluster_legacy.c 87.27% <100.00%> (+0.12%) ⬆️
src/debug.c 53.90% <0.00%> (ø)

... and 16 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@murphyjacob4 murphyjacob4 self-requested a review September 5, 2025 18:56
@murphyjacob4 murphyjacob4 added the major-decision-pending Major decision pending by TSC team label Sep 5, 2025
See valkey-io#2261 for more details.

Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>
… extensions

Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>
Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>
…but not in nodes.conf

Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>
@PingXie

PingXie commented Sep 9, 2025

Copy link
Copy Markdown
Member

@deepakrn can you fix the DCO by git commit --amend -s? since you have multiple commits already, I think the easiest thing to do is to force-push at the end. and please make sure future commits are one with -s.

the clang formatting also seems off.

Comment thread src/cluster_legacy.c Outdated
Comment thread src/cluster_legacy.c Outdated
* from pre-7.2 releases */
clusterRemoveNodeFromShard(myself);
memcpy(myself->shard_id, shard_id, CLUSTER_NAMELEN);
node->flags &= ~CLUSTER_NODE_SHARD_ID_UNINITIALIZED;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copilot got this wrong.

however, there is a different issue here. one of the call sites of updateShardId on line 3932 in clusterProcessPacket might be looking at an uninitialized shard-id in some core cases. I am also not super sure about clusterSetPrimary. I think the other two call sites, clusterProcessPingExtensions and clusterCommandSpecial should be good. so rather than take a guess here, I wonder if updateShortId should take a clusterNode* instead so it could check if the reference node's shard id is indeed initialized or not. for the case where we need to set the shard-id explicitly, we can just set the shard-id directly and clear the uninitialized flag, as opposed to calling updateShardId. thoughts?

Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>
Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>
…d shard-ids

Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>
Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>
Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>
Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>
Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>
@zuiderkwast

Copy link
Copy Markdown
Contributor

@murphyjacob4 Why is this a major decision? I can't see any new API or any breaking change here. It looks more like a bug fix to me.

@deepakrn

Copy link
Copy Markdown
Contributor Author

@zuiderkwast - My initial solution had a flag (unknownshard) for indicating that a node has uninitialized shard ID. Introducing this new flag was what we thought could be a change in interface. However, I later updated the solution to simply filter out nodes with uninitialized shard IDs while saving things to nodes.conf.

For CLUSTER NODES and other topology related commands, there is no change in behavior. So, I think this should not be breaking anymore.

cc - @murphyjacob4

Comment thread src/cluster_legacy.h Outdated
#define CLUSTER_NODE_EXTENSIONS_SUPPORTED (1 << 10) /* This node supports extensions. */
#define CLUSTER_NODE_LIGHT_HDR_PUBLISH_SUPPORTED (1 << 11) /* This node supports light message header for publish type. */
#define CLUSTER_NODE_LIGHT_HDR_MODULE_SUPPORTED (1 << 12) /* This node supports light message header for module type. */
#define CLUSTER_NODE_SHARD_ID_UNINITIALIZED (1 << 13) /* This node currently has a random shard_id assigned. */

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think CLUSTER_NODE_SHARD_ID_UNINITIALIZED should be a local state and we don't need to use the precious node flags space.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out, @PingXie . I have updated the changes to use a local flag.

Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

major-decision-pending Major decision pending by TSC team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Two primaries in same shard

6 participants