fix: add CLUSTER SAVECONFIG before pause in shard restart tests by jjuleslasarte · Pull Request #3904 · valkey-io/valkey

jjuleslasarte · 2026-06-02T23:09:07Z

Problem

The shard ID restart tests in cluster-shards.tcl fail intermittently under valgrind because pause_process (SIGSTOP) can hit before the deferred config save is flushed to nodes.conf.

When a replica learns its primary's shard ID via gossip, it updates myself->shard_id and sets CLUSTER_TODO_SAVE_CONFIG but the actual write only happens in clusterBeforeSleep() during the next event loop iteration. Under valgrind's slower execution, the test's SIGSTOP can arrive before that flush, so on restart the node gets a fresh random shard ID instead of the persisted one.

Fix

Add CLUSTER SAVECONFIG before pausing nodes in both restart tests, ensuring nodes.conf contains the current shard IDs on disk before SIGSTOP. This follows the same pattern used in latency-monitor.tcl.

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

coderabbitai · 2026-06-02T23:09:33Z

📝 Walkthrough

Walkthrough

The PR adds explicit cluster saveconfig calls in two cluster restart test scenarios to ensure shard IDs persist consistently. Before restarting or pausing cluster nodes, the tests now save configuration state, allowing CLUSTER MYSHARDID assertions to validate against persisted data.

Changes

Cluster restart test reliability

Layer / File(s)	Summary
Config persistence in restart test checkpoints `tests/unit/cluster/cluster-shards.tcl`	Two test loops now invoke `cluster saveconfig` before pausing and restarting nodes in shard restart and full cluster restart scenarios.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Suggested reviewers

zuiderkwast

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding CLUSTER SAVECONFIG calls before pause operations in shard restart tests.
Description check	✅ Passed	The description is directly related to the changeset, explaining the problem (race condition with config save under valgrind) and the fix (adding CLUSTER SAVECONFIG before pausing).
Linked Issues check	✅ Passed	The changes directly address issue `#3883` by adding CLUSTER SAVECONFIG before pause operations to ensure shard IDs persist to nodes.conf before process suspension.
Out of Scope Changes check	✅ Passed	All changes are scoped to the cluster-shards.tcl test file and directly address the race condition identified in issue `#3883`, with no extraneous modifications.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

tests/unit/cluster/cluster-shards.tcl (1)
284-298: 💤 Low value

Add a brief comment explaining why cluster saveconfig is required here.

The save-before-pause is the entire point of this fix, but the reason is non-obvious: under valgrind the SIGSTOP from pause_process can arrive before clusterBeforeSleep() flushes the deferred CLUSTER_TODO_SAVE_CONFIG write, so the restarted node would otherwise read a stale/new random shard id. A one-line comment preserves that intent for future readers.
📝 Suggested comment
     for {set i 0} {$i < 8} {incr i 4} {
         dict set node_ids $i [R $i cluster myshardid]
+        # Force a synchronous nodes.conf flush before SIGSTOP; otherwise the
+        # deferred shard-id save may not reach disk before pause_process (esp. under valgrind).
         R $i cluster saveconfig
         pause_process [srv [expr -1*$i] pid]
     }
As per coding guidelines: "Comment complex test scenarios to explain intent (test documentation)".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/cluster/cluster-shards.tcl` around lines 284 - 298, Add a one-line
comment above the loop that calls "cluster saveconfig" explaining that the
explicit save is required because under valgrind a SIGSTOP from pause_process
can occur before clusterBeforeSleep() flushes the deferred
CLUSTER_TODO_SAVE_CONFIG write; without this forced save the restarted node
(after restart_server) may read a stale or random shard id. Mention the race
with SIGSTOP/valgrind and reference cluster saveconfig, pause_process,
clusterBeforeSleep() and CLUSTER_TODO_SAVE_CONFIG so future readers understand
the intent.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/unit/cluster/cluster-shards.tcl`:
- Around line 284-298: Add a one-line comment above the loop that calls "cluster
saveconfig" explaining that the explicit save is required because under valgrind
a SIGSTOP from pause_process can occur before clusterBeforeSleep() flushes the
deferred CLUSTER_TODO_SAVE_CONFIG write; without this forced save the restarted
node (after restart_server) may read a stale or random shard id. Mention the
race with SIGSTOP/valgrind and reference cluster saveconfig, pause_process,
clusterBeforeSleep() and CLUSTER_TODO_SAVE_CONFIG so future readers understand
the intent.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 8f126a5e-2a63-41d8-a4e0-d3193e0634a1

📥 Commits

Reviewing files that changed from the base of the PR and between 3cfaf57 and db19803.

📒 Files selected for processing (1)

tests/unit/cluster/cluster-shards.tcl

codecov · 2026-06-02T23:43:43Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.62%. Comparing base (c15ecf1) to head (db19803).
⚠️ Report is 2 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #3904      +/-   ##
============================================
- Coverage     76.66%   76.62%   -0.04%     
============================================
  Files           162      162              
  Lines         80717    80728      +11     
============================================
- Hits          61878    61860      -18     
- Misses        18839    18868      +29

see 28 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…#3904) ## Problem The shard ID restart tests in `cluster-shards.tcl` fail intermittently under valgrind because `pause_process` (SIGSTOP) can hit before the deferred config save is flushed to `nodes.conf`. When a replica learns its primary's shard ID via gossip, it updates `myself->shard_id` and sets `CLUSTER_TODO_SAVE_CONFIG` but the actual write only happens in `clusterBeforeSleep()` during the next event loop iteration. Under valgrind's slower execution, the test's SIGSTOP can arrive before that flush, so on restart the node gets a fresh random shard ID instead of the persisted one. ## Fix Add `CLUSTER SAVECONFIG` before pausing nodes in both restart tests, ensuring `nodes.conf` contains the current shard IDs on disk before SIGSTOP. This follows the same pattern used in `latency-monitor.tcl`. Fixes valkey-io#3883. Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com> Signed-off-by: Harkrishn Patro <h_patro@apple.com>

fix: add CLUSTER SAVECONFIG before pause in shard restart tests

db19803

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

jjuleslasarte force-pushed the fix/cluster-shards-saveconfig-race branch from 21e75cf to db19803 Compare June 2, 2026 23:09

coderabbitai Bot reviewed Jun 2, 2026

View reviewed changes

github-actions Bot assigned jjuleslasarte Jun 2, 2026

enjoy-binbin approved these changes Jun 4, 2026

View reviewed changes

enjoy-binbin merged commit 3100f9d into valkey-io:unstable Jun 4, 2026
62 of 63 checks passed

coderabbitai Bot mentioned this pull request Jun 9, 2026

Deflake cluster-shards "same shard id after restart" test #3949

Merged

ranshid mentioned this pull request Jun 11, 2026

[backport] Backport sweep for 9.1 #3774

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: add CLUSTER SAVECONFIG before pause in shard restart tests#3904

fix: add CLUSTER SAVECONFIG before pause in shard restart tests#3904
enjoy-binbin merged 1 commit into
valkey-io:unstablefrom
jjuleslasarte:fix/cluster-shards-saveconfig-race

jjuleslasarte commented Jun 2, 2026

Uh oh!

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

codecov Bot commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jjuleslasarte commented Jun 2, 2026

Problem

Fix

Uh oh!

coderabbitai Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 2, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading