Deflake cluster-shards "same shard id after restart" test#3949
Conversation
The "CLUSTER MYSHARDID reports same shard id after cluster restart" test runs in a 4x5 block that manually attaches an extra replica (R8) to a primary via CLUSTER REPLICATE, giving that primary two replicas. During the staggered restart, a primary becomes transiently orphaned and one of the two replicas legitimately migrates to it (the max_replicas >= 2 migration gate in cluster_legacy.c), adopting the new shard id. The assertion that the shard id is unchanged then fails on a correct behavior. Set cluster-allow-replica-migration no on this block so topology stays fixed across the restart. Replica migration is not what this test covers, and the manual CLUSTER REPLICATE attachment is unaffected. Same override is used in unit/cluster/cluster-migrateslots.tcl for the same purpose. Fixes valkey-io#3914 Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughA cluster restart test in ChangesCluster Restart Configuration
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #3949 +/- ##
============================================
- Coverage 76.64% 76.62% -0.03%
============================================
Files 162 162
Lines 80733 80733
============================================
- Hits 61880 61860 -20
- Misses 18853 18873 +20 🚀 New features to boost your workflow:
|
This test runs
start_cluster 4 5and manually attaches an extra replica withCLUSTER REPLICATE, so one primary ends up with two replicas while the others have one. It then pauses and restarts all 8 nodes and asserts that every node's shard id is unchanged.This is why the test fails:
MIGRATE_TO, but currently has zero working replicas.cluster-migration-barrierreplicas after it leaves (default barrier is 1).During the staggered restart, a primary can be left with no reachable replica for a short window and become orphaned. Once the cluster returns to OK, the primary that has two replicas is the only one eligible to donate, so one of its replicas migrates to the orphaned primary and takes on its shard id.
The fix is to set
cluster-allow-replica-migration noon this block so replicas stay with their original primaries across the restart. Replica migration is not what this test is checking.Stress Test Passing 100 times: https://github.com/sarthakaggarwal97/valkey/actions/runs/27186417222/job/80256227970
Fixes #3914