Stabilize CLUSTERSCAN unassigned-slot test by retrying DELSLOTS by enjoy-binbin · Pull Request #3959 · valkey-io/valkey

enjoy-binbin · 2026-06-10T05:51:28Z

The Case 3 portion of the test was flaky: after a single round of
CLUSTER DELSLOTS 0 on R0/R1/R2, the cluster could stay in OK state
and wait_for_cluster_state fail would time out with
Cluster node 1 cluster_state:ok.

The race is between R0's local DELSLOTS and the gossip already in
flight from R0. After R1 locally clears slot 0, a stale pre-DELSLOTS
packet from R0 (whose myslots still claims slot 0) hits the
isSlotUnclaimed fast path in clusterUpdateSlotsConfigWith and rebinds
slot 0 back to R0 on R1. See:

    if (isSlotUnclaimed(j) ||
        server.cluster->slots[j]->configEpoch < senderConfigEpoch ||
        clusterSlotFailoverGranted(j)) {
        ...
        clusterDelSlot(j);
        clusterAddSlot(sender, j);
        ...
    }

R0's subsequent "no longer claiming" PINGs cannot undo this, because
that path only sets owner_not_claiming_slot and never clears slots[j]:

    if (server.cluster->slots[j] == sender) {
        /* The slot is currently bound to the sender but the sender is no longer
         * claiming it. We don't want to unbind the slot yet as it can cause the cluster
         * to move to FAIL state and also throw client error. Keeping the slot bound to
         * the previous owner will cause a few client side redirects, but won't throw
         * any errors. We will keep track of the uncertainty in ownership to avoid
         * propagating misinformation about this slot's ownership using UPDATE
         * messages. */
        bitmapSetBit(server.cluster->owner_not_claiming_slot, j);
    }

Combined with clusterUpdateState's full-coverage check looking only
at slots[j] == NULL, R1 stays at cluster OK forever.

    if (server.cluster->slots[j] == NULL || ...) {
        new_state = CLUSTER_FAIL;
        ...
    }

Rather than fighting the protocol's intentional asymmetry around
"soft delete" via gossip, just retry the DELSLOTS pass until all
three nodes converge to FAIL. This keeps the test focused on the
CLUSTERSCAN error semantics it actually wants to verify.

This closes #3891. The test was added in #3674.

The Case 3 portion of the test was flaky: after a single round of `CLUSTER DELSLOTS 0` on R0/R1/R2, the cluster could stay in OK state and `wait_for_cluster_state fail` would time out with `Cluster node 1 cluster_state:ok`. The race is between R0's local DELSLOTS and the gossip already in flight from R0. After R1 locally clears slot 0, a stale pre-DELSLOTS packet from R0 (whose myslots still claims slot 0) hits the isSlotUnclaimed fast path in clusterUpdateSlotsConfigWith and rebinds slot 0 back to R0 on R1. See: ``` if (isSlotUnclaimed(j) || server.cluster->slots[j]->configEpoch < senderConfigEpoch || clusterSlotFailoverGranted(j)) { ... clusterDelSlot(j); clusterAddSlot(sender, j); ... } ``` R0's subsequent "no longer claiming" PINGs cannot undo this, because that path only sets owner_not_claiming_slot and never clears slots[j]: ``` if (server.cluster->slots[j] == sender) { /* The slot is currently bound to the sender but the sender is no longer * claiming it. We don't want to unbind the slot yet as it can cause the cluster * to move to FAIL state and also throw client error. Keeping the slot bound to * the previous owner will cause a few client side redirects, but won't throw * any errors. We will keep track of the uncertainty in ownership to avoid * propagating misinformation about this slot's ownership using UPDATE * messages. */ bitmapSetBit(server.cluster->owner_not_claiming_slot, j); } ``` Combined with clusterUpdateState's full-coverage check looking only at slots[j] == NULL, R1 stays at cluster OK forever. ``` if (server.cluster->slots[j] == NULL || ...) { new_state = CLUSTER_FAIL; ... } ``` Rather than fighting the protocol's intentional asymmetry around "soft delete" via gossip, just retry the DELSLOTS pass until all three nodes converge to FAIL. This keeps the test focused on the CLUSTERSCAN error semantics it actually wants to verify. This closes valkey-io#3891. The test was added in valkey-io#3674. Signed-off-by: Binbin <binloveplay1314@qq.com>

coderabbitai · 2026-06-10T05:51:41Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 2e5469ee-5578-4734-aa4b-d480c5b8752e

📥 Commits

Reviewing files that changed from the base of the PR and between e1bc58c and 3ac0875.

📒 Files selected for processing (1)

tests/unit/cluster/clusterscan.tcl

📝 Walkthrough

Walkthrough

The PR stabilizes a cluster integration test by replacing a single DELSLOTS attempt with a retry loop that waits for all cluster nodes to converge to FAIL state before proceeding, reducing flakiness caused by timing-dependent race conditions.

Changes

Cluster scan test convergence

Layer / File(s)	Summary
Cluster DELSLOTS convergence retry loop `tests/unit/cluster/clusterscan.tcl`	The test case `CLUSTERSCAN returns correct errors on cluster down and unassigned slots` now repeatedly issues `CLUSTER DELSLOTS 0` to nodes R0, R1, and R2 within a `wait_for_condition` loop until all three nodes report `cluster_state` as `"fail"`, with a dedicated error message if the cluster fails to converge within the timeout window. This replaces the previous single-shot DELSLOTS approach followed by `wait_for_cluster_state fail`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

zuiderkwast
sarthakaggarwal97

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: stabilizing a flaky test by modifying how DELSLOTS is retried until cluster convergence.
Description check	✅ Passed	The description thoroughly explains the flakiness, root cause involving gossip race conditions and protocol behavior, and justifies the solution of retrying DELSLOTS.
Linked Issues check	✅ Passed	The code changes directly address the flaky test failure in `#3891` by making the DELSLOTS deletion more robust, ensuring all nodes converge to FAIL state before proceeding.
Out of Scope Changes check	✅ Passed	All changes are scoped to stabilizing the flaky test case in tests/unit/cluster/clusterscan.tcl with no unrelated modifications.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-06-10T06:16:39Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.57%. Comparing base (948aaf8) to head (3ac0875).
⚠️ Report is 2 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #3959      +/-   ##
============================================
- Coverage     76.68%   76.57%   -0.11%     
============================================
  Files           162      162              
  Lines         80729    80753      +24     
============================================
- Hits          61903    61834      -69     
- Misses        18826    18919      +93

see 27 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

sarthakaggarwal97

This was an interesting race condition. Thanks for fixing this.

The Case 3 portion of the test was flaky: after a single round of `CLUSTER DELSLOTS 0` on R0/R1/R2, the cluster could stay in OK state and `wait_for_cluster_state fail` would time out with `Cluster node 1 cluster_state:ok`. The race is between R0's local DELSLOTS and the gossip already in flight from R0. After R1 locally clears slot 0, a stale pre-DELSLOTS packet from R0 (whose myslots still claims slot 0) hits the isSlotUnclaimed fast path in clusterUpdateSlotsConfigWith and rebinds slot 0 back to R0 on R1. See: ``` if (isSlotUnclaimed(j) || server.cluster->slots[j]->configEpoch < senderConfigEpoch || clusterSlotFailoverGranted(j)) { ... clusterDelSlot(j); clusterAddSlot(sender, j); ... } ``` R0's subsequent "no longer claiming" PINGs cannot undo this, because that path only sets owner_not_claiming_slot and never clears slots[j]: ``` if (server.cluster->slots[j] == sender) { /* The slot is currently bound to the sender but the sender is no longer * claiming it. We don't want to unbind the slot yet as it can cause the cluster * to move to FAIL state and also throw client error. Keeping the slot bound to * the previous owner will cause a few client side redirects, but won't throw * any errors. We will keep track of the uncertainty in ownership to avoid * propagating misinformation about this slot's ownership using UPDATE * messages. */ bitmapSetBit(server.cluster->owner_not_claiming_slot, j); } ``` Combined with clusterUpdateState's full-coverage check looking only at slots[j] == NULL, R1 stays at cluster OK forever. ``` if (server.cluster->slots[j] == NULL || ...) { new_state = CLUSTER_FAIL; ... } ``` Rather than fighting the protocol's intentional asymmetry around "soft delete" via gossip, just retry the DELSLOTS pass until all three nodes converge to FAIL. This keeps the test focused on the CLUSTERSCAN error semantics it actually wants to verify. This closes #3891. The test was added in #3674. Signed-off-by: Binbin <binloveplay1314@qq.com>

github-actions Bot assigned enjoy-binbin Jun 10, 2026

enjoy-binbin mentioned this pull request Jun 10, 2026

Deflake clusterscan "cluster down and unassigned slots" test #3948

Closed

nmvk approved these changes Jun 10, 2026

View reviewed changes

sarthakaggarwal97 approved these changes Jun 11, 2026

View reviewed changes

enjoy-binbin mentioned this pull request Jun 11, 2026

[backport] Backport sweep for 9.1 #3774

Merged

enjoy-binbin added this to Valkey 9.1 Jun 11, 2026

enjoy-binbin moved this to To be backported in Valkey 9.1 Jun 11, 2026

zuiderkwast approved these changes Jun 11, 2026

View reviewed changes

enjoy-binbin merged commit 86acf8d into valkey-io:unstable Jun 12, 2026
64 checks passed

enjoy-binbin deleted the fix_test2 branch June 12, 2026 02:18

valkeyrie-ops Bot mentioned this pull request Jun 12, 2026

[backport] Backport sweep for 9.1 #3970

Closed

valkeyrie-ops Bot mentioned this pull request Jun 17, 2026

[backport] Backport sweep for 9.1 #3987

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stabilize CLUSTERSCAN unassigned-slot test by retrying DELSLOTS#3959

Stabilize CLUSTERSCAN unassigned-slot test by retrying DELSLOTS#3959
enjoy-binbin merged 1 commit into
valkey-io:unstablefrom
enjoy-binbin:fix_test2

enjoy-binbin commented Jun 10, 2026

Uh oh!

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

codecov Bot commented Jun 10, 2026

Uh oh!

sarthakaggarwal97 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

enjoy-binbin commented Jun 10, 2026

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Uh oh!

codecov Bot commented Jun 10, 2026

Codecov Report

Uh oh!

sarthakaggarwal97 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading