Deflake cluster-shards "same shard id after restart" test by sarthakaggarwal97 · Pull Request #3949 · valkey-io/valkey

sarthakaggarwal97 · 2026-06-09T17:40:43Z

This test runs start_cluster 4 5 and manually attaches an extra replica with CLUSTER REPLICATE, so one primary ends up with two replicas while the others have one. It then pauses and restarts all 8 nodes and asserts that every node's shard id is unchanged.

This is why the test fails:

A primary is termed "orphaned" when it owns slots and is flagged MIGRATE_TO, but currently has zero working replicas.
When a primary is orphaned, the cluster moves a spare replica to it. A replica is only spare if its current primary would still have more than cluster-migration-barrier replicas after it leaves (default barrier is 1).
A migrating replica adopts the target primary's shard id, so its shard id changes.

During the staggered restart, a primary can be left with no reachable replica for a short window and become orphaned. Once the cluster returns to OK, the primary that has two replicas is the only one eligible to donate, so one of its replicas migrates to the orphaned primary and takes on its shard id.

The fix is to set cluster-allow-replica-migration no on this block so replicas stay with their original primaries across the restart. Replica migration is not what this test is checking.

Stress Test Passing 100 times: https://github.com/sarthakaggarwal97/valkey/actions/runs/27186417222/job/80256227970

Fixes #3914

The "CLUSTER MYSHARDID reports same shard id after cluster restart" test runs in a 4x5 block that manually attaches an extra replica (R8) to a primary via CLUSTER REPLICATE, giving that primary two replicas. During the staggered restart, a primary becomes transiently orphaned and one of the two replicas legitimately migrates to it (the max_replicas >= 2 migration gate in cluster_legacy.c), adopting the new shard id. The assertion that the shard id is unchanged then fails on a correct behavior. Set cluster-allow-replica-migration no on this block so topology stays fixed across the restart. Replica migration is not what this test covers, and the manual CLUSTER REPLICATE attachment is unaffected. Same override is used in unit/cluster/cluster-migrateslots.tcl for the same purpose. Fixes valkey-io#3914 Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>

coderabbitai · 2026-06-09T17:40:58Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 1fa26ec7-b7b1-4bd0-b81d-baed3e35e977

📥 Commits

Reviewing files that changed from the base of the PR and between d0ffbab and 19886b5.

📒 Files selected for processing (1)

tests/unit/cluster/cluster-shards.tcl

📝 Walkthrough

Walkthrough

A cluster restart test in tests/unit/cluster/cluster-shards.tcl is updated to disable automatic replica migration during initialization. Comments clarify that replica R8 is manually attached to prevent unintended shard-id changes caused by migration semantics.

Changes

Cluster Restart Configuration

Layer / File(s)	Summary
Disable automatic replica migration in cluster restart test `tests/unit/cluster/cluster-shards.tcl`	Cluster configuration now passes `cluster-allow-replica-migration no` override to `start_cluster 4 5`. Comments note that R8 replica is manually attached during restart to avoid shard-id changes from automatic migration.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

valkey-io/valkey#3904: Both PRs modify tests/unit/cluster/cluster-shards.tcl restart-test setup to keep MYSHARDID stable across restarts—one by disabling replica migration during start_cluster, the other by adding cluster saveconfig before pausing nodes.

Suggested reviewers

zuiderkwast

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: deflaking a specific test by addressing a flakiness issue related to shard IDs after cluster restart.
Description check	✅ Passed	The description thoroughly explains the test failure, root cause, and the implemented fix related to replica migration, directly addressing the changeset.
Linked Issues check	✅ Passed	The PR fully addresses issue `#3914` by disabling replica migration during the test to prevent shard ID changes caused by transient orphaning during restart.
Out of Scope Changes check	✅ Passed	All changes are scoped to the failing test, adding a configuration override to disable replica migration and explanatory comments, with no unrelated modifications.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-06-09T18:09:02Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.62%. Comparing base (d9a80af) to head (19886b5).
⚠️ Report is 2 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #3949      +/-   ##
============================================
- Coverage     76.64%   76.62%   -0.03%     
============================================
  Files           162      162              
  Lines         80733    80733              
============================================
- Hits          61880    61860      -20     
- Misses        18853    18873      +20

see 24 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions Bot assigned sarthakaggarwal97 Jun 9, 2026

sarthakaggarwal97 requested a review from enjoy-binbin June 9, 2026 17:40

enjoy-binbin approved these changes Jun 10, 2026

View reviewed changes

enjoy-binbin merged commit 948aaf8 into valkey-io:unstable Jun 10, 2026
64 checks passed

ranshid mentioned this pull request Jun 11, 2026

[backport] Backport sweep for 9.1 #3774

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deflake cluster-shards "same shard id after restart" test#3949

Deflake cluster-shards "same shard id after restart" test#3949
enjoy-binbin merged 1 commit into
valkey-io:unstablefrom
sarthakaggarwal97:deflake-3914-cluster-shards

sarthakaggarwal97 commented Jun 9, 2026

Uh oh!

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

codecov Bot commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

sarthakaggarwal97 commented Jun 9, 2026

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

codecov Bot commented Jun 9, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading