Skip to content

Deflake cluster-shards "same shard id after restart" test#3949

Merged
enjoy-binbin merged 1 commit into
valkey-io:unstablefrom
sarthakaggarwal97:deflake-3914-cluster-shards
Jun 10, 2026
Merged

Deflake cluster-shards "same shard id after restart" test#3949
enjoy-binbin merged 1 commit into
valkey-io:unstablefrom
sarthakaggarwal97:deflake-3914-cluster-shards

Conversation

@sarthakaggarwal97

Copy link
Copy Markdown
Contributor

This test runs start_cluster 4 5 and manually attaches an extra replica with CLUSTER REPLICATE, so one primary ends up with two replicas while the others have one. It then pauses and restarts all 8 nodes and asserts that every node's shard id is unchanged.

This is why the test fails:

  • A primary is termed "orphaned" when it owns slots and is flagged MIGRATE_TO, but currently has zero working replicas.
  • When a primary is orphaned, the cluster moves a spare replica to it. A replica is only spare if its current primary would still have more than cluster-migration-barrier replicas after it leaves (default barrier is 1).
  • A migrating replica adopts the target primary's shard id, so its shard id changes.

During the staggered restart, a primary can be left with no reachable replica for a short window and become orphaned. Once the cluster returns to OK, the primary that has two replicas is the only one eligible to donate, so one of its replicas migrates to the orphaned primary and takes on its shard id.

The fix is to set cluster-allow-replica-migration no on this block so replicas stay with their original primaries across the restart. Replica migration is not what this test is checking.

Stress Test Passing 100 times: https://github.com/sarthakaggarwal97/valkey/actions/runs/27186417222/job/80256227970

Fixes #3914

The "CLUSTER MYSHARDID reports same shard id after cluster restart" test
runs in a 4x5 block that manually attaches an extra replica (R8) to a
primary via CLUSTER REPLICATE, giving that primary two replicas. During
the staggered restart, a primary becomes transiently orphaned and one of
the two replicas legitimately migrates to it (the max_replicas >= 2
migration gate in cluster_legacy.c), adopting the new shard id. The
assertion that the shard id is unchanged then fails on a correct behavior.

Set cluster-allow-replica-migration no on this block so topology stays
fixed across the restart. Replica migration is not what this test covers,
and the manual CLUSTER REPLICATE attachment is unaffected. Same override
is used in unit/cluster/cluster-migrateslots.tcl for the same purpose.

Fixes valkey-io#3914

Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 1fa26ec7-b7b1-4bd0-b81d-baed3e35e977

📥 Commits

Reviewing files that changed from the base of the PR and between d0ffbab and 19886b5.

📒 Files selected for processing (1)
  • tests/unit/cluster/cluster-shards.tcl

📝 Walkthrough

Walkthrough

A cluster restart test in tests/unit/cluster/cluster-shards.tcl is updated to disable automatic replica migration during initialization. Comments clarify that replica R8 is manually attached to prevent unintended shard-id changes caused by migration semantics.

Changes

Cluster Restart Configuration

Layer / File(s) Summary
Disable automatic replica migration in cluster restart test
tests/unit/cluster/cluster-shards.tcl
Cluster configuration now passes cluster-allow-replica-migration no override to start_cluster 4 5. Comments note that R8 replica is manually attached during restart to avoid shard-id changes from automatic migration.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

  • valkey-io/valkey#3904: Both PRs modify tests/unit/cluster/cluster-shards.tcl restart-test setup to keep MYSHARDID stable across restarts—one by disabling replica migration during start_cluster, the other by adding cluster saveconfig before pausing nodes.

Suggested reviewers

  • zuiderkwast
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: deflaking a specific test by addressing a flakiness issue related to shard IDs after cluster restart.
Description check ✅ Passed The description thoroughly explains the test failure, root cause, and the implemented fix related to replica migration, directly addressing the changeset.
Linked Issues check ✅ Passed The PR fully addresses issue #3914 by disabling replica migration during the test to prevent shard ID changes caused by transient orphaning during restart.
Out of Scope Changes check ✅ Passed All changes are scoped to the failing test, adding a configuration override to disable replica migration and explanatory comments, with no unrelated modifications.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.62%. Comparing base (d9a80af) to head (19886b5).
⚠️ Report is 2 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3949      +/-   ##
============================================
- Coverage     76.64%   76.62%   -0.03%     
============================================
  Files           162      162              
  Lines         80733    80733              
============================================
- Hits          61880    61860      -20     
- Misses        18853    18873      +20     

see 24 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@enjoy-binbin enjoy-binbin merged commit 948aaf8 into valkey-io:unstable Jun 10, 2026
64 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[TEST-FAILURE] CLUSTER MYSHARDID reports same shard id after cluster restart in tests/unit/cluster/cluster-shards.tcl

2 participants