fix: add CLUSTER SAVECONFIG before pause in shard restart tests#3904
Conversation
Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
📝 WalkthroughWalkthroughThe PR adds explicit ChangesCluster restart test reliability
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
21e75cf to
db19803
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/unit/cluster/cluster-shards.tcl (1)
284-298: 💤 Low valueAdd a brief comment explaining why
cluster saveconfigis required here.The save-before-pause is the entire point of this fix, but the reason is non-obvious: under valgrind the
SIGSTOPfrompause_processcan arrive beforeclusterBeforeSleep()flushes the deferredCLUSTER_TODO_SAVE_CONFIGwrite, so the restarted node would otherwise read a stale/new random shard id. A one-line comment preserves that intent for future readers.As per coding guidelines: "Comment complex test scenarios to explain intent (test documentation)".📝 Suggested comment
for {set i 0} {$i < 8} {incr i 4} { dict set node_ids $i [R $i cluster myshardid] + # Force a synchronous nodes.conf flush before SIGSTOP; otherwise the + # deferred shard-id save may not reach disk before pause_process (esp. under valgrind). R $i cluster saveconfig pause_process [srv [expr -1*$i] pid] }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/cluster/cluster-shards.tcl` around lines 284 - 298, Add a one-line comment above the loop that calls "cluster saveconfig" explaining that the explicit save is required because under valgrind a SIGSTOP from pause_process can occur before clusterBeforeSleep() flushes the deferred CLUSTER_TODO_SAVE_CONFIG write; without this forced save the restarted node (after restart_server) may read a stale or random shard id. Mention the race with SIGSTOP/valgrind and reference cluster saveconfig, pause_process, clusterBeforeSleep() and CLUSTER_TODO_SAVE_CONFIG so future readers understand the intent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@tests/unit/cluster/cluster-shards.tcl`:
- Around line 284-298: Add a one-line comment above the loop that calls "cluster
saveconfig" explaining that the explicit save is required because under valgrind
a SIGSTOP from pause_process can occur before clusterBeforeSleep() flushes the
deferred CLUSTER_TODO_SAVE_CONFIG write; without this forced save the restarted
node (after restart_server) may read a stale or random shard id. Mention the
race with SIGSTOP/valgrind and reference cluster saveconfig, pause_process,
clusterBeforeSleep() and CLUSTER_TODO_SAVE_CONFIG so future readers understand
the intent.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 8f126a5e-2a63-41d8-a4e0-d3193e0634a1
📒 Files selected for processing (1)
tests/unit/cluster/cluster-shards.tcl
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #3904 +/- ##
============================================
- Coverage 76.66% 76.62% -0.04%
============================================
Files 162 162
Lines 80717 80728 +11
============================================
- Hits 61878 61860 -18
- Misses 18839 18868 +29 🚀 New features to boost your workflow:
|
…#3904) ## Problem The shard ID restart tests in `cluster-shards.tcl` fail intermittently under valgrind because `pause_process` (SIGSTOP) can hit before the deferred config save is flushed to `nodes.conf`. When a replica learns its primary's shard ID via gossip, it updates `myself->shard_id` and sets `CLUSTER_TODO_SAVE_CONFIG` but the actual write only happens in `clusterBeforeSleep()` during the next event loop iteration. Under valgrind's slower execution, the test's SIGSTOP can arrive before that flush, so on restart the node gets a fresh random shard ID instead of the persisted one. ## Fix Add `CLUSTER SAVECONFIG` before pausing nodes in both restart tests, ensuring `nodes.conf` contains the current shard IDs on disk before SIGSTOP. This follows the same pattern used in `latency-monitor.tcl`. Fixes valkey-io#3883. Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: Harkrishn Patro <h_patro@apple.com>
Fixes #3883
Problem
The shard ID restart tests in
cluster-shards.tclfail intermittently under valgrind becausepause_process(SIGSTOP) can hit before the deferred config save is flushed tonodes.conf.When a replica learns its primary's shard ID via gossip, it updates
myself->shard_idand setsCLUSTER_TODO_SAVE_CONFIGbut the actual write only happens inclusterBeforeSleep()during the next event loop iteration. Under valgrind's slower execution, the test's SIGSTOP can arrive before that flush, so on restart the node gets a fresh random shard ID instead of the persisted one.Fix
Add
CLUSTER SAVECONFIGbefore pausing nodes in both restart tests, ensuringnodes.confcontains the current shard IDs on disk before SIGSTOP. This follows the same pattern used inlatency-monitor.tcl.