Skip to content

fix: add CLUSTER SAVECONFIG before pause in shard restart tests#3904

Merged
enjoy-binbin merged 1 commit into
valkey-io:unstablefrom
jjuleslasarte:fix/cluster-shards-saveconfig-race
Jun 4, 2026
Merged

fix: add CLUSTER SAVECONFIG before pause in shard restart tests#3904
enjoy-binbin merged 1 commit into
valkey-io:unstablefrom
jjuleslasarte:fix/cluster-shards-saveconfig-race

Conversation

@jjuleslasarte

Copy link
Copy Markdown
Contributor

Fixes #3883

Problem

The shard ID restart tests in cluster-shards.tcl fail intermittently under valgrind because pause_process (SIGSTOP) can hit before the deferred config save is flushed to nodes.conf.

When a replica learns its primary's shard ID via gossip, it updates myself->shard_id and sets CLUSTER_TODO_SAVE_CONFIG but the actual write only happens in clusterBeforeSleep() during the next event loop iteration. Under valgrind's slower execution, the test's SIGSTOP can arrive before that flush, so on restart the node gets a fresh random shard ID instead of the persisted one.

Fix

Add CLUSTER SAVECONFIG before pausing nodes in both restart tests, ensuring nodes.conf contains the current shard IDs on disk before SIGSTOP. This follows the same pattern used in latency-monitor.tcl.

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
@coderabbitai

coderabbitai Bot commented Jun 2, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

The PR adds explicit cluster saveconfig calls in two cluster restart test scenarios to ensure shard IDs persist consistently. Before restarting or pausing cluster nodes, the tests now save configuration state, allowing CLUSTER MYSHARDID assertions to validate against persisted data.

Changes

Cluster restart test reliability

Layer / File(s) Summary
Config persistence in restart test checkpoints
tests/unit/cluster/cluster-shards.tcl
Two test loops now invoke cluster saveconfig before pausing and restarting nodes in shard restart and full cluster restart scenarios.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Suggested reviewers

  • zuiderkwast
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding CLUSTER SAVECONFIG calls before pause operations in shard restart tests.
Description check ✅ Passed The description is directly related to the changeset, explaining the problem (race condition with config save under valgrind) and the fix (adding CLUSTER SAVECONFIG before pausing).
Linked Issues check ✅ Passed The changes directly address issue #3883 by adding CLUSTER SAVECONFIG before pause operations to ensure shard IDs persist to nodes.conf before process suspension.
Out of Scope Changes check ✅ Passed All changes are scoped to the cluster-shards.tcl test file and directly address the race condition identified in issue #3883, with no extraneous modifications.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@jjuleslasarte jjuleslasarte force-pushed the fix/cluster-shards-saveconfig-race branch from 21e75cf to db19803 Compare June 2, 2026 23:09

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/unit/cluster/cluster-shards.tcl (1)

284-298: 💤 Low value

Add a brief comment explaining why cluster saveconfig is required here.

The save-before-pause is the entire point of this fix, but the reason is non-obvious: under valgrind the SIGSTOP from pause_process can arrive before clusterBeforeSleep() flushes the deferred CLUSTER_TODO_SAVE_CONFIG write, so the restarted node would otherwise read a stale/new random shard id. A one-line comment preserves that intent for future readers.

📝 Suggested comment
     for {set i 0} {$i < 8} {incr i 4} {
         dict set node_ids $i [R $i cluster myshardid]
+        # Force a synchronous nodes.conf flush before SIGSTOP; otherwise the
+        # deferred shard-id save may not reach disk before pause_process (esp. under valgrind).
         R $i cluster saveconfig
         pause_process [srv [expr -1*$i] pid]
     }
As per coding guidelines: "Comment complex test scenarios to explain intent (test documentation)".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/cluster/cluster-shards.tcl` around lines 284 - 298, Add a one-line
comment above the loop that calls "cluster saveconfig" explaining that the
explicit save is required because under valgrind a SIGSTOP from pause_process
can occur before clusterBeforeSleep() flushes the deferred
CLUSTER_TODO_SAVE_CONFIG write; without this forced save the restarted node
(after restart_server) may read a stale or random shard id. Mention the race
with SIGSTOP/valgrind and reference cluster saveconfig, pause_process,
clusterBeforeSleep() and CLUSTER_TODO_SAVE_CONFIG so future readers understand
the intent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/unit/cluster/cluster-shards.tcl`:
- Around line 284-298: Add a one-line comment above the loop that calls "cluster
saveconfig" explaining that the explicit save is required because under valgrind
a SIGSTOP from pause_process can occur before clusterBeforeSleep() flushes the
deferred CLUSTER_TODO_SAVE_CONFIG write; without this forced save the restarted
node (after restart_server) may read a stale or random shard id. Mention the
race with SIGSTOP/valgrind and reference cluster saveconfig, pause_process,
clusterBeforeSleep() and CLUSTER_TODO_SAVE_CONFIG so future readers understand
the intent.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 8f126a5e-2a63-41d8-a4e0-d3193e0634a1

📥 Commits

Reviewing files that changed from the base of the PR and between 3cfaf57 and db19803.

📒 Files selected for processing (1)
  • tests/unit/cluster/cluster-shards.tcl

@codecov

codecov Bot commented Jun 2, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.62%. Comparing base (c15ecf1) to head (db19803).
⚠️ Report is 2 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3904      +/-   ##
============================================
- Coverage     76.66%   76.62%   -0.04%     
============================================
  Files           162      162              
  Lines         80717    80728      +11     
============================================
- Hits          61878    61860      -18     
- Misses        18839    18868      +29     

see 28 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@enjoy-binbin enjoy-binbin merged commit 3100f9d into valkey-io:unstable Jun 4, 2026
62 of 63 checks passed
hpatro pushed a commit to hpatro/valkey that referenced this pull request Jun 5, 2026
…#3904)

## Problem

The shard ID restart tests in `cluster-shards.tcl` fail intermittently
under valgrind because `pause_process` (SIGSTOP) can hit before the
deferred config save is flushed to `nodes.conf`.

When a replica learns its primary's shard ID via gossip, it updates
`myself->shard_id` and sets `CLUSTER_TODO_SAVE_CONFIG` but the actual
write only happens in `clusterBeforeSleep()` during the next event loop
iteration. Under valgrind's slower execution, the test's SIGSTOP can
arrive before that flush, so on restart the node gets a fresh random
shard ID instead of the persisted one.

## Fix

Add `CLUSTER SAVECONFIG` before pausing nodes in both restart tests,
ensuring `nodes.conf` contains the current shard IDs on disk before
SIGSTOP. This follows the same pattern used in `latency-monitor.tcl`.

Fixes valkey-io#3883.

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: Harkrishn Patro <h_patro@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[TEST-FAILURE] CLUSTER MYSHARDID reports same shard id after cluster restart in tests/unit/cluster/cluster-shards.tcl

2 participants