Skip to content

Converge divergent shard-id persisted in nodes.conf to primary's shard id#2174

Merged
hpatro merged 4 commits into
valkey-io:unstablefrom
hpatro:handle-divergent-shard-id-nodes-conf
Jun 20, 2025
Merged

Converge divergent shard-id persisted in nodes.conf to primary's shard id#2174
hpatro merged 4 commits into
valkey-io:unstablefrom
hpatro:handle-divergent-shard-id-nodes-conf

Conversation

@hpatro

@hpatro hpatro commented Jun 4, 2025

Copy link
Copy Markdown
Contributor

Fixes #2171

Handle divergent shard-id across primary and replica from nodes.conf and reconcile all the nodes in the shard to the primary node's shard-id.

@hpatro hpatro requested a review from PingXie June 4, 2025 20:25
@codecov

codecov Bot commented Jun 4, 2025

Copy link
Copy Markdown

Codecov Report

Attention: Patch coverage is 0% with 8 lines in your changes missing coverage. Please review.

Project coverage is 73.57%. Comparing base (3789b29) to head (8f658fe).
Report is 14 commits behind head on unstable.

Files with missing lines Patch % Lines
src/cluster_legacy.c 0.00% 8 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2174      +/-   ##
============================================
+ Coverage     71.43%   73.57%   +2.13%     
============================================
  Files           122      122              
  Lines         66210    73777    +7567     
============================================
+ Hits          47300    54281    +6981     
- Misses        18910    19496     +586     
Files with missing lines Coverage Δ
src/cluster_legacy.c 89.50% <0.00%> (+2.77%) ⬆️

... and 32 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread src/cluster_legacy.c
hpatro added 2 commits June 10, 2025 20:26
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
@hpatro hpatro force-pushed the handle-divergent-shard-id-nodes-conf branch from 5c52a9d to 8f658fe Compare June 10, 2025 20:27
@hpatro

hpatro commented Jun 11, 2025

Copy link
Copy Markdown
Contributor Author

@martinrvisser I tried few scenarios manually, they seem to run fine. Could you try running this patch and see if it helps ?

@hpatro

hpatro commented Jun 11, 2025

Copy link
Copy Markdown
Contributor Author

Test scenarios:

43ee1cacd6948ee96bb367eb8795e62e8d153f05 10.125.153.181:0@28007,,tls-port=18007,shard-id=f91532eb722943e035f34292cf586f3f750d65bd myself,master - 0 1749488968682 13 connected 5461-16383
8d89f4d4e7c57a2819277732f86213241c3ec0d3 10.124.45.182:0@28007,,tls-port=18007,shard-id=a91532eb722943e035f34292cf586f3f750d65bd slave 43ee1cacd6948ee96bb367eb8795e62e8d153f05 0 1749488968682 13 connected
vars currentEpoch 13 lastVoteEpoch 9
8d89f4d4e7c57a2819277732f86213241c3ec0d3 10.124.45.182:0@28007,,tls-port=18007,shard-id=a91532eb722943e035f34292cf586f3f750d65bd slave 43ee1cacd6948ee96bb367eb8795e62e8d153f05 0 1749488968682 13 connected
43ee1cacd6948ee96bb367eb8795e62e8d153f05 10.125.153.181:0@28007,,tls-port=18007,shard-id=f91532eb722943e035f34292cf586f3f750d65bd myself,master - 0 1749488968682 13 connected 5461-16383
vars currentEpoch 13 lastVoteEpoch 9
43ee1cacd6948ee96bb367eb8795e62e8d153f05 10.125.153.181:0@28007,,tls-port=18007,shard-id=f91532eb722943e035f34292cf586f3f750d65bd master - 0 1749488968682 13 connected 5461-16383
8d89f4d4e7c57a2819277732f86213241c3ec0d3 10.124.45.182:0@28007,,tls-port=18007,shard-id=a91532eb722943e035f34292cf586f3f750d65bd myself,slave 43ee1cacd6948ee96bb367eb8795e62e8d153f05 0 1749488968682 13 connected
vars currentEpoch 13 lastVoteEpoch 9
8d89f4d4e7c57a2819277732f86213241c3ec0d3 10.124.45.182:0@28007,,tls-port=18007,shard-id=a91532eb722943e035f34292cf586f3f750d65bd myself,slave 43ee1cacd6948ee96bb367eb8795e62e8d153f05 0 1749488968682 13 connected
43ee1cacd6948ee96bb367eb8795e62e8d153f05 10.125.153.181:0@28007,,tls-port=18007,shard-id=f91532eb722943e035f34292cf586f3f750d65bd master - 0 1749488968682 13 connected 5461-16383
vars currentEpoch 13 lastVoteEpoch 9

All of these load fine.

@martinrvisser

Copy link
Copy Markdown
Contributor

@martinrvisser I tried few scenarios manually, they seem to run fine. Could you try running this patch and see if it helps ?

@hpatro fine in my tests also, nice work
2804808:M 13 Jun 2025 12:13:25.599 * Node 386bac3e5337540f2bce4181ff8980f33a989667 has a different shard id (4009ad6c27a3fe91b709a6502bd111b5d58eada0) than its primary's shard id bb85c3d913bb6f9e3d02e68ae89b5c5e61111fb4 (f713a7d0f3079e30fcc47348260a09de0c9b8b72). Updating replica's shard id to match primary's shard id.
2804808:M 13 Jun 2025 12:13:25.599 * Node configuration loaded, I'm 386bac3e5337540f2bce4181ff8980f33a989667

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>

@enjoy-binbin enjoy-binbin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This look correct, please also update the top comment to desc it. So it is somehow like a corrupted (or a outupdated) nodes.conf cause the trouble?

@hpatro

hpatro commented Jun 18, 2025

Copy link
Copy Markdown
Contributor Author

This look correct, please also update the top comment to desc it. So it is somehow like a corrupted (or a outupdated) nodes.conf cause the trouble?

Yes. Updated.

@enjoy-binbin enjoy-binbin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hpatro hpatro added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Jun 20, 2025
@hpatro hpatro merged commit 4a3e713 into valkey-io:unstable Jun 20, 2025
81 checks passed
stockholmux pushed a commit to stockholmux/valkey that referenced this pull request Aug 13, 2025
Signed-off-by: Kyle J. Davis <kyledvs@amazon.com>
@hpatro hpatro added the release-notes This issue should get a line item in the release notes label Aug 13, 2025
@zuiderkwast zuiderkwast moved this to 8.0.5 (to be released) in Valkey 8.0 Aug 21, 2025
@zuiderkwast zuiderkwast moved this to To be backported in Valkey 8.1 Aug 21, 2025
@zuiderkwast zuiderkwast moved this to To be backported in Valkey 7.2 Aug 21, 2025
zuiderkwast pushed a commit to vitarb/valkey that referenced this pull request Aug 21, 2025
…d id (valkey-io#2174)

Fixes valkey-io#2171

Handle divergent shard-id across primary and replica from nodes.conf and
reconcile all the nodes in the shard to the primary node's shard-id.

---------

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
zuiderkwast pushed a commit that referenced this pull request Aug 22, 2025
…d id (#2174)

Fixes #2171

Handle divergent shard-id across primary and replica from nodes.conf and
reconcile all the nodes in the shard to the primary node's shard-id.

---------

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
sarthakaggarwal97 pushed a commit to sarthakaggarwal97/valkey that referenced this pull request Sep 16, 2025
…d id (valkey-io#2174)

Fixes valkey-io#2171

Handle divergent shard-id across primary and replica from nodes.conf and
reconcile all the nodes in the shard to the primary node's shard-id.

---------

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
@ranshid ranshid moved this from To be backported to In Progress in Valkey 8.1 Sep 30, 2025
ranshid pushed a commit to ranshid/valkey that referenced this pull request Sep 30, 2025
…d id (valkey-io#2174)

Fixes valkey-io#2171

Handle divergent shard-id across primary and replica from nodes.conf and
reconcile all the nodes in the shard to the primary node's shard-id.

---------

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
@ranshid ranshid moved this from In Progress to To be backported in Valkey 8.1 Sep 30, 2025
ranshid pushed a commit to ranshid/valkey that referenced this pull request Sep 30, 2025
…d id (valkey-io#2174)

Fixes valkey-io#2171

Handle divergent shard-id across primary and replica from nodes.conf and
reconcile all the nodes in the shard to the primary node's shard-id.

---------

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
ranshid pushed a commit to ranshid/valkey that referenced this pull request Sep 30, 2025
…d id (valkey-io#2174)

Fixes valkey-io#2171

Handle divergent shard-id across primary and replica from nodes.conf and
reconcile all the nodes in the shard to the primary node's shard-id.

---------

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
@zuiderkwast

Copy link
Copy Markdown
Contributor

I tried backporting this one to 7.2 but I couldn't get the test case to pass, so I'm skipping it for now. If anyone wants to help backporting it, feel free to open a PR targeting the 7.2 branch. Some difficulties to note:

  • src/cluster_legacy.c doesn't exist in 7.2. It's still called src/cluster.c here.
  • Variables for primaries and replicas are using the old names masters and slaves.

zuiderkwast pushed a commit that referenced this pull request Oct 1, 2025
…d id (#2174)

Fixes #2171

Handle divergent shard-id across primary and replica from nodes.conf and
reconcile all the nodes in the shard to the primary node's shard-id.

---------

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
@zuiderkwast zuiderkwast moved this from To be backported to 8.1.4 in Valkey 8.1 Oct 1, 2025
roshkhatri added a commit to roshkhatri/valkey that referenced this pull request Feb 18, 2026
… id (valkey-io#2174)

Handle divergent shard-id across master and replica from nodes.conf and
reconcile all the nodes in the shard to the master node's shard-id.

Fixes valkey-io#2171

(cherry picked from commit 4a3e713)

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
roshkhatri added a commit to roshkhatri/valkey that referenced this pull request Feb 18, 2026
… id (valkey-io#2174)

Handle divergent shard-id across master and replica from nodes.conf and
reconcile all the nodes in the shard to the master node's shard-id.

Fixes valkey-io#2171

(cherry picked from commit 4a3e713)

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
roshkhatri added a commit to roshkhatri/valkey that referenced this pull request Feb 18, 2026
… id (valkey-io#2174)

Handle divergent shard-id across master and replica from nodes.conf and
reconcile all the nodes in the shard to the master node's shard-id.

Fixes valkey-io#2171

(cherry picked from commit 4a3e713)

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
@roshkhatri

Copy link
Copy Markdown
Member

I tried backporting this one to 7.2 but I couldn't get the test case to pass, so I'm skipping it for now. If anyone wants to help backporting it, feel free to open a PR targeting the 7.2 branch. Some difficulties to note:
src/cluster_legacy.c doesn't exist in 7.2. It's still called src/cluster.c here.
Variables for primaries and replicas are using the old names masters and slaves.

@zuiderkwast opened a PR for this here: #3229

zuiderkwast pushed a commit that referenced this pull request Feb 19, 2026
#3229)

Backporting #2174 to 7.2 branch

Handle divergent shard-id across master and replica from nodes.conf and
reconcile all the nodes in the shard to the master node's shard-id.

Fixes #2171

(cherry picked from commit 4a3e713)

Signed-off-by: Harkrishn Patro <harkrisp@amazon.com>
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes This issue should get a line item in the release notes run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)

Projects

Status: To be backported
Status: 8.0.5
Status: 8.1.4

Development

Successfully merging this pull request may close these issues.

[BUG] "Unrecoverable error: corrupted cluster config file" due to incorrect shard-id

8 participants