Generate a new shard_id when the replica executes CLUSTER RESET SOFT by enjoy-binbin · Pull Request #2283 · valkey-io/valkey

enjoy-binbin · 2025-06-28T15:01:12Z

When a replica doing the CLUSTER RESET SOFT, we should generate a
new shard_id for it. Since if the node was a replica, it inherits
the shard_id of its primary, and after the CLUSTER RESET SOFT, the
replica become a new empty primary, it should has its own shard_id.

Otherwise, when it joins back into the cluster or responds the PONG,
we will have two primaries in the same shard.

When a replica doing the CLUSTER RESET SOFT, we should generate a new shard_id for it. Since if the node was a replica, it inherits the shard_id of its primary, and after the CLUSTER RESET SOFT, the replica become a new empty primary, it should has its own shard_id. Otherwise, when it joins back into the cluster or responds the PONG, we will have two primaries in the same shard. Signed-off-by: Binbin <binloveplay1314@qq.com>

Signed-off-by: Binbin <binloveplay1314@qq.com>

codecov · 2025-06-28T15:22:20Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.46%. Comparing base (0b0b813) to head (e8f2d6c).
⚠️ Report is 172 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2283      +/-   ##
============================================
- Coverage     71.52%   71.46%   -0.06%     
============================================
  Files           123      123              
  Lines         66926    67032     +106     
============================================
+ Hits          47868    47905      +37     
- Misses        19058    19127      +69

Files with missing lines	Coverage Δ
src/cluster_legacy.c	`86.76% <100.00%> (-0.14%)`	⬇️

... and 22 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

hpatro

Agree with the change around shard id generation. But would like the extension flag marking part left as is.

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin · 2025-07-03T08:54:55Z

@hpatro and I discussed and decided to temporarily remove link->support_extension for now. After this merge, we will discuss it in another new PR.

The new link->extension is aim for this case:

Node B cluster reset soft, and remove all the nodes in its cluster_nodes_dict
Node A sends the PING, and Node B will reply a PONG even it does not know A, and in here, since Node B does not know Node A, it won't send the extension. But the link has mf_flags that we all know Node A can handle extension. So in these packets, Node A will know B change the role but not able to learn the shard-id

That is something like that. So if we don't have link->extension, Node B needs to wait until the cluster meets (I remember we have a compensation mechanism that will allow node B to eventually meet back into the cluster.) again before sending the extension

with link->support_extension time:

Testing unit/cluster/shardid-propagation
[ok]: The replica will have a new shard_id after cluster reset soft (65 ms)
[ok]: Check for memory leaks (pid 51984) (587 ms)
[ok]: Check for memory leaks (pid 51954) (312 ms)
[ok]: Check for memory leaks (pid 51924) (313 ms)
[ok]: Check for memory leaks (pid 51881) (293 ms)
[1/1 done]: unit/cluster/shardid-propagation (13 seconds)

without the link->support_extension

[ok]: The replica will have a new shard_id after cluster reset soft (3102 ms)
[ok]: Check for memory leaks (pid 54164) (301 ms)
[ok]: Check for memory leaks (pid 54127) (291 ms)
[ok]: Check for memory leaks (pid 54103) (294 ms)
[ok]: Check for memory leaks (pid 54050) (292 ms)
[1/1 done]: unit/cluster/shardid-propagation (16 seconds)

… faster during handshake (#2310) Add support_extension to cluster link, so that when myself receives a message from sender, even if myself does not know the sender, for example during handshake the sender node is still treated as random node, in thi case, the packet we reply to sender can also carry the extension, since the sender link explicitly indicates that it supports the extension, by doing this, we can speed up the spread of the extension. Also see #2283 and #2279 for more details.

When a cluster reset is performed on a replica node, a new shard ID is generated because the node is about to become an empty primary node, see valkey-io#2283. However, the log added in valkey-io#2510 caused some confusions. In clusterSetNodeAsPrimary we will print: ``` serverLog(LL_NOTICE, "Reconfiguring node %.40s (%s) as primary for shard %.40s", n->name, humanNodename(n), n->shard_id); ``` In clusterReset, we first call clusterSetNodeAsPrimary and then generate a new shard ID, which causes us to print an error shard ID log first. There is an exmaple, when a replica node performs a cluster reset, we will print: ``` xxx * Cluster reset (user request from 'xxx'). xxx * Reconfiguring node af76a3e0ffcd77bd14fa47ce4d07ab2bdc78702f (xxx) as primary for shard ea528667634af8beed83adac2b9af8360769a1b4 ``` But the node shard id is actually: ``` xxx> cluster myshardid "52ede26d1554dd203161ba09011af14574b2cc84" ``` Now after a new shard ID is generated we will print a log, and we also move the call to clusterSetNodeAsPrimary after the new shard id, so that we can have the right one. After this PR: ``` xxx * Cluster reset (user request from 'xxx'). xxx * Moving myself to a new shard bd31870ce73f5977084e6a46e337a4a1ad38fc66. xxx * Reconfiguring node 1d54b904efd30cd9d7d1abbfd63c8fafbb62e1c8 (xxx) as primary for shard bd31870ce73f5977084e6a46e337a4a1ad38fc66 ``` This is part of valkey-io#2989, but i guess we won't merge the extension fix in a short time, so i am gonna extracting it separately as a log fix (or improvement). Signed-off-by: Binbin <binloveplay1314@qq.com>

When a cluster reset is performed on a replica node, a new shard ID is generated because the node is about to become an empty primary node, see #2283. However, the log added in #2510 caused some confusions. In clusterSetNodeAsPrimary we will print: ``` serverLog(LL_NOTICE, "Reconfiguring node %.40s (%s) as primary for shard %.40s", n->name, humanNodename(n), n->shard_id); ``` In clusterReset, we first call clusterSetNodeAsPrimary and then generate a new shard ID, which causes us to print an error shard ID log first. There is an exmaple, when a replica node performs a cluster reset, we will print: ``` xxx * Cluster reset (user request from 'xxx'). xxx * Reconfiguring node af76a3e0ffcd77bd14fa47ce4d07ab2bdc78702f (xxx) as primary for shard ea528667634af8beed83adac2b9af8360769a1b4 ``` But the node shard id is actually: ``` xxx> cluster myshardid "52ede26d1554dd203161ba09011af14574b2cc84" ``` Now after a new shard ID is generated we will print a log, and we also move the call to clusterSetNodeAsPrimary after the new shard id, so that we can have the right one. After this PR: ``` xxx * Cluster reset (user request from 'xxx'). xxx * Moving myself to a new shard bd31870ce73f5977084e6a46e337a4a1ad38fc66. xxx * Reconfiguring node 1d54b904efd30cd9d7d1abbfd63c8fafbb62e1c8 (xxx) as primary for shard bd31870ce73f5977084e6a46e337a4a1ad38fc66 ``` This is part of #2989, but i guess we won't merge the extension fix in a short time, so i am gonna extracting it separately as a log fix (or improvement). Signed-off-by: Binbin <binloveplay1314@qq.com>

…3192) When a cluster reset is performed on a replica node, a new shard ID is generated because the node is about to become an empty primary node, see valkey-io#2283. However, the log added in valkey-io#2510 caused some confusions. In clusterSetNodeAsPrimary we will print: ``` serverLog(LL_NOTICE, "Reconfiguring node %.40s (%s) as primary for shard %.40s", n->name, humanNodename(n), n->shard_id); ``` In clusterReset, we first call clusterSetNodeAsPrimary and then generate a new shard ID, which causes us to print an error shard ID log first. There is an exmaple, when a replica node performs a cluster reset, we will print: ``` xxx * Cluster reset (user request from 'xxx'). xxx * Reconfiguring node af76a3e0ffcd77bd14fa47ce4d07ab2bdc78702f (xxx) as primary for shard ea528667634af8beed83adac2b9af8360769a1b4 ``` But the node shard id is actually: ``` xxx> cluster myshardid "52ede26d1554dd203161ba09011af14574b2cc84" ``` Now after a new shard ID is generated we will print a log, and we also move the call to clusterSetNodeAsPrimary after the new shard id, so that we can have the right one. After this PR: ``` xxx * Cluster reset (user request from 'xxx'). xxx * Moving myself to a new shard bd31870ce73f5977084e6a46e337a4a1ad38fc66. xxx * Reconfiguring node 1d54b904efd30cd9d7d1abbfd63c8fafbb62e1c8 (xxx) as primary for shard bd31870ce73f5977084e6a46e337a4a1ad38fc66 ``` This is part of valkey-io#2989, but i guess we won't merge the extension fix in a short time, so i am gonna extracting it separately as a log fix (or improvement). Signed-off-by: Binbin <binloveplay1314@qq.com> Signed-off-by: Harkrishn Patro <bunty.hari@gmail.com>

enjoy-binbin requested a review from hpatro June 28, 2025 15:01

enjoy-binbin mentioned this pull request Jun 28, 2025

The smaller config epoch primary will become the replica when two primaries in the same shard #2279

Open

enjoy-binbin added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Jun 28, 2025

enjoy-binbin commented Jun 28, 2025

View reviewed changes

Comment thread src/cluster_legacy.c Outdated

Update src/cluster_legacy.c

6c9c3c9

Signed-off-by: Binbin <binloveplay1314@qq.com>

hpatro reviewed Jun 30, 2025

View reviewed changes

Comment thread src/cluster_legacy.h Outdated

Comment thread src/cluster_legacy.c

Add propagate extension support back

611a260

Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin commented Jul 2, 2025

View reviewed changes

Comment thread src/cluster_legacy.c Outdated

enjoy-binbin added 3 commits July 2, 2025 17:35

Fix format

c0a2861

Signed-off-by: Binbin <binloveplay1314@qq.com>

Drop link->support_extension, will move it to another PR

80de51a

Signed-off-by: Binbin <binloveplay1314@qq.com>

Drop the comment

e8f2d6c

Signed-off-by: Binbin <binloveplay1314@qq.com>

hpatro approved these changes Jul 3, 2025

View reviewed changes

hpatro merged commit e279b00 into valkey-io:unstable Jul 3, 2025
60 of 61 checks passed

enjoy-binbin deleted the replica_reset_shardid branch July 4, 2025 01:54

enjoy-binbin added this to Valkey 9.0 Jul 4, 2025

enjoy-binbin added the release-notes This issue should get a line item in the release notes label Jul 4, 2025

enjoy-binbin moved this to Done in Valkey 9.0 Jul 4, 2025

enjoy-binbin mentioned this pull request Jul 5, 2025

Add extensions supported to cluster link layer to propagate extension faster during handshake #2310

Merged

enjoy-binbin added the cluster label Sep 19, 2025

enjoy-binbin mentioned this pull request Feb 12, 2026

Logging fix or improvement around new shard ID generation #3192

Merged

zuiderkwast mentioned this pull request Jun 2, 2026

Enhance cluster stale packet detection to prevent sub-replica and empty primary #2811

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Generate a new shard_id when the replica executes CLUSTER RESET SOFT#2283

Generate a new shard_id when the replica executes CLUSTER RESET SOFT#2283
hpatro merged 6 commits into
valkey-io:unstablefrom
enjoy-binbin:replica_reset_shardid

enjoy-binbin commented Jun 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

codecov Bot commented Jun 28, 2025 •

edited

Loading

Uh oh!

hpatro left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

enjoy-binbin commented Jul 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

enjoy-binbin commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hpatro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

enjoy-binbin commented Jul 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

enjoy-binbin commented Jun 28, 2025 •

edited

Loading

codecov Bot commented Jun 28, 2025 •

edited

Loading