Fix two primaries scenario due to unknown shard_id#2586
Conversation
|
@murphyjacob4 - could you please help review this change? I do plan to clean up the tests and remove all the sleeps. Other than this, I want to get your inputs on the high level approach for solving this. #2261 describes the sequence of events. In this change:
|
There was a problem hiding this comment.
Pull Request Overview
This PR addresses a "two primaries scenario" issue related to unknown shard_id values in cluster environments. The fix introduces a new flag to track nodes with uninitialized shard IDs and modifies processing order to ensure shard IDs are updated before slot configurations.
- Introduces
CLUSTER_NODE_SHARD_ID_UNINITIALIZEDflag to track nodes with unknown shard IDs - Reorders gossip processing before slot configuration updates
- Adds safeguards to prevent slot updates when shard ID is uninitialized
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/cluster_legacy.h | Defines new flag constant for tracking uninitialized shard IDs |
| src/cluster_legacy.c | Implements shard ID tracking logic, reorders processing, and adds validation |
| tests/unit/cluster/shardid-propagation.tcl | Adds comprehensive test cases for shard ID propagation scenarios |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| * from pre-7.2 releases */ | ||
| clusterRemoveNodeFromShard(myself); | ||
| memcpy(myself->shard_id, shard_id, CLUSTER_NAMELEN); | ||
| node->flags &= ~CLUSTER_NODE_SHARD_ID_UNINITIALIZED; |
There was a problem hiding this comment.
The variable node should be myself to match the context. The code is updating myself->shard_id but clearing flags on node.
| node->flags &= ~CLUSTER_NODE_SHARD_ID_UNINITIALIZED; | |
| myself->flags &= ~CLUSTER_NODE_SHARD_ID_UNINITIALIZED; |
There was a problem hiding this comment.
copilot got this wrong.
however, there is a different issue here. one of the call sites of updateShardId on line 3932 in clusterProcessPacket might be looking at an uninitialized shard-id in some core cases. I am also not super sure about clusterSetPrimary. I think the other two call sites, clusterProcessPingExtensions and clusterCommandSpecial should be good. so rather than take a guess here, I wonder if updateShortId should take a clusterNode* instead so it could check if the reference node's shard id is indeed initialized or not. for the case where we need to set the shard-id explicitly, we can just set the shard-id directly and clear the uninitialized flag, as opposed to calling updateShardId. thoughts?
There was a problem hiding this comment.
To elaborate the scenario where updateShardId in clusterProcessPacket might be looking at an uninitialized shard-id:
- A node A in node N has uninitialized shard-id because it hasn't directly pinged N yet.
- Node B(which was a replica of another node) becomes a replica of node A and hence the sender claimed primary changed. This is where the shard-id of node A which is uninitialized would be associated with B.
So in step 2, we would ideally want node B to be updated to the shard-id of its primary only if it is initialized?
For clusterSetPrimary, it seems to be exercised during SET SLOT command, when replica migration is activated etc. But in those scenarios, I would expect the target node to be reachable and must have directly pinged this node and hence should have the shard_id initialized.
But, I do like your suggestion. We could stall the update until the shard_id of the target node is initialized. I'll make that change. So, there will be two flavors of updateShardId - one that takes the clusterNode reference and another that takes the shard_id string.
| @@ -1,3 +1,4 @@ | |||
| if {0} { | |||
There was a problem hiding this comment.
The if {0} statement will disable the entire first test block, making it unreachable. This appears to be leftover debugging code that should be removed.
| if (sender) { | ||
| clusterProcessGossipSection(hdr, link); | ||
| clusterProcessPingExtensions(hdr, link); | ||
| } |
There was a problem hiding this comment.
Discussed with @PingXie offline. We could simply not move this processing to earlier and instead not update slot configs until the shard_id is initialized. I'll work on making that change.
There was a problem hiding this comment.
I'm aligned with this suggestion as well. I don't want to open another pandora box by moving them around.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #2586 +/- ##
============================================
+ Coverage 72.21% 72.23% +0.02%
============================================
Files 127 127
Lines 70936 70948 +12
============================================
+ Hits 51227 51252 +25
+ Misses 19709 19696 -13
🚀 New features to boost your workflow:
|
See valkey-io#2261 for more details. Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>
… extensions Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>
e3dbfbc to
a67719f
Compare
Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>
…but not in nodes.conf Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>
|
@deepakrn can you fix the DCO by the clang formatting also seems off. |
| * from pre-7.2 releases */ | ||
| clusterRemoveNodeFromShard(myself); | ||
| memcpy(myself->shard_id, shard_id, CLUSTER_NAMELEN); | ||
| node->flags &= ~CLUSTER_NODE_SHARD_ID_UNINITIALIZED; |
There was a problem hiding this comment.
copilot got this wrong.
however, there is a different issue here. one of the call sites of updateShardId on line 3932 in clusterProcessPacket might be looking at an uninitialized shard-id in some core cases. I am also not super sure about clusterSetPrimary. I think the other two call sites, clusterProcessPingExtensions and clusterCommandSpecial should be good. so rather than take a guess here, I wonder if updateShortId should take a clusterNode* instead so it could check if the reference node's shard id is indeed initialized or not. for the case where we need to set the shard-id explicitly, we can just set the shard-id directly and clear the uninitialized flag, as opposed to calling updateShardId. thoughts?
Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>
477e7e9 to
8354ddd
Compare
Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>
…d shard-ids Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>
Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>
Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>
Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>
Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>
|
@murphyjacob4 Why is this a major decision? I can't see any new API or any breaking change here. It looks more like a bug fix to me. |
|
@zuiderkwast - My initial solution had a flag ( For cc - @murphyjacob4 |
| #define CLUSTER_NODE_EXTENSIONS_SUPPORTED (1 << 10) /* This node supports extensions. */ | ||
| #define CLUSTER_NODE_LIGHT_HDR_PUBLISH_SUPPORTED (1 << 11) /* This node supports light message header for publish type. */ | ||
| #define CLUSTER_NODE_LIGHT_HDR_MODULE_SUPPORTED (1 << 12) /* This node supports light message header for module type. */ | ||
| #define CLUSTER_NODE_SHARD_ID_UNINITIALIZED (1 << 13) /* This node currently has a random shard_id assigned. */ |
There was a problem hiding this comment.
I think CLUSTER_NODE_SHARD_ID_UNINITIALIZED should be a local state and we don't need to use the precious node flags space.
There was a problem hiding this comment.
Thanks for pointing this out, @PingXie . I have updated the changes to use a local flag.
cbbb318 to
334a39b
Compare
Signed-off-by: Deepak Nandihalli <deepak.nandihalli@gmail.com>
334a39b to
07c84f5
Compare
This PR tries to address issue #2261 where a certain order of events leads to two primaries in the same shard.
Fundamentally, the problem with the current state of the system is - if a node hasn't directly been PINGed by another node and has learnt about its presence via gossip, it gets associated with a random shard_id. This means that whenever, the other node with random shard_id advertises its slot, this node initially thinks that it lost all its slots to the other node from another shard and then processes the shard_id change. This leaves the node in a state where it is also a primary in the same shard. Similarly, when this node receives slot configs via an UPDATE message, the node won't know about shard_id until it receives a direct PING.
This indicates that, shard_id is a pre-requisite to processing any slot config updates from a node in the cluster. This change introduces a flag to track if the shard_id is initialized and if not, it allows the rest of the code to ignore certain updates until it has learnt about the shard_id. Introducing this new flag allows the code to be robust and extensible. Therefore, if we see any other scenarios in the future that needs shard_id to be set, this flag makes it possible.
Fixes #2261