Improve CLUSTERSCAN error handling test with broader coverage by enjoy-binbin · Pull Request #3674 · valkey-io/valkey

enjoy-binbin · 2026-05-12T06:49:16Z

Rewrite the CLUSTERDOWN test to use a 3-node cluster and cover four
distinct degraded-cluster scenarios:

Node down with full-coverage enabled -> CLUSTERDOWN on all slots
Node down with full-coverage disabled -> MOVED for unreachable
slots, local slots handled normally
Slot unassigned with full-coverage enabled -> "Hash slot not served"
for the unassigned slot, CLUSTERDOWN for others
Slot unassigned with full-coverage disabled -> "Hash slot not served"
for the unassigned slot, MOVED for remote slots

Rewrite the CLUSTERDOWN test to use a 3-node cluster and cover four distinct degraded-cluster scenarios: 1. Node down with full-coverage enabled -> CLUSTERDOWN on all slots 2. Node down with full-coverage disabled -> MOVED for unreachable slots, local slots handled normally 3. Slot unassigned with full-coverage enabled -> "Hash slot not served" for the unassigned slot, CLUSTERDOWN for others 4. Slot unassigned with full-coverage disabled -> "Hash slot not served" for the unassigned slot, MOVED for remote slots Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin · 2026-05-12T06:50:27Z

@nmvk I noticed this test while working on other changes, would you be able to take a look and review it?

enjoy-binbin · 2026-05-12T06:55:26Z

-        set slot0_owner -1
-        foreach n {0 1} {
-            if {[catch {R $n clusterscan 0-{06S}-0 SLOT 0} res] == 0} {
-                set cursor_slot_0 [lindex $res 0]


This actually returns {0 {}} in the old code, so cursor_slot_0 is 0, and in the wait condition, we are actually checking R 0 clusterscan 0.

I'm wondering if we should return "cluster down" for this. See the other PR (#3675) for more info

127.0.0.1:30001> clusterscan 0 1) "0-{06S}-0" 2) (empty array) 127.0.0.1:30001> clusterscan 0-{06S}-0 (error) CLUSTERDOWN The cluster is down

codecov · 2026-05-12T08:02:30Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.54%. Comparing base (ca9dee3) to head (de63e74).
⚠️ Report is 1 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #3674      +/-   ##
============================================
- Coverage     76.71%   76.54%   -0.18%     
============================================
  Files           162      162              
  Lines         80656    80654       -2     
============================================
- Hits          61872    61733     -139     
- Misses        18784    18921     +137

see 22 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

nmvk

Thank you @enjoy-binbin for adding the test with broader coverage

madolson

More tests are always good :)

Rewrite the CLUSTERDOWN test to use a 3-node cluster and cover four distinct degraded-cluster scenarios: 1. Node down with full-coverage enabled -> CLUSTERDOWN on all slots 2. Node down with full-coverage disabled -> MOVED for unreachable slots, local slots handled normally 3. Slot unassigned with full-coverage enabled -> "Hash slot not served" for the unassigned slot, CLUSTERDOWN for others 4. Slot unassigned with full-coverage disabled -> "Hash slot not served" for the unassigned slot, MOVED for remote slots Signed-off-by: Binbin <binloveplay1314@qq.com>

The Case 3 portion of the test was flaky: after a single round of `CLUSTER DELSLOTS 0` on R0/R1/R2, the cluster could stay in OK state and `wait_for_cluster_state fail` would time out with `Cluster node 1 cluster_state:ok`. The race is between R0's local DELSLOTS and the gossip already in flight from R0. After R1 locally clears slot 0, a stale pre-DELSLOTS packet from R0 (whose myslots still claims slot 0) hits the isSlotUnclaimed fast path in clusterUpdateSlotsConfigWith and rebinds slot 0 back to R0 on R1. See: ``` if (isSlotUnclaimed(j) || server.cluster->slots[j]->configEpoch < senderConfigEpoch || clusterSlotFailoverGranted(j)) { ... clusterDelSlot(j); clusterAddSlot(sender, j); ... } ``` R0's subsequent "no longer claiming" PINGs cannot undo this, because that path only sets owner_not_claiming_slot and never clears slots[j]: ``` if (server.cluster->slots[j] == sender) { /* The slot is currently bound to the sender but the sender is no longer * claiming it. We don't want to unbind the slot yet as it can cause the cluster * to move to FAIL state and also throw client error. Keeping the slot bound to * the previous owner will cause a few client side redirects, but won't throw * any errors. We will keep track of the uncertainty in ownership to avoid * propagating misinformation about this slot's ownership using UPDATE * messages. */ bitmapSetBit(server.cluster->owner_not_claiming_slot, j); } ``` Combined with clusterUpdateState's full-coverage check looking only at slots[j] == NULL, R1 stays at cluster OK forever. ``` if (server.cluster->slots[j] == NULL || ...) { new_state = CLUSTER_FAIL; ... } ``` Rather than fighting the protocol's intentional asymmetry around "soft delete" via gossip, just retry the DELSLOTS pass until all three nodes converge to FAIL. This keeps the test focused on the CLUSTERSCAN error semantics it actually wants to verify. This closes #3891. The test was added in #3674. Signed-off-by: Binbin <binloveplay1314@qq.com>

enjoy-binbin commented May 12, 2026

View reviewed changes

github-actions Bot assigned enjoy-binbin May 12, 2026

nmvk approved these changes May 12, 2026

View reviewed changes

madolson approved these changes May 12, 2026

View reviewed changes

enjoy-binbin merged commit a813df0 into valkey-io:unstable May 13, 2026
62 checks passed

enjoy-binbin deleted the cleanup_test branch May 13, 2026 03:39

enjoy-binbin mentioned this pull request Jun 10, 2026

Stabilize CLUSTERSCAN unassigned-slot test by retrying DELSLOTS #3959

Merged

ranshid mentioned this pull request Jun 11, 2026

[backport] Backport sweep for 9.1 #3774

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve CLUSTERSCAN error handling test with broader coverage#3674

Improve CLUSTERSCAN error handling test with broader coverage#3674
enjoy-binbin merged 1 commit into
valkey-io:unstablefrom
enjoy-binbin:cleanup_test

enjoy-binbin commented May 12, 2026

Uh oh!

enjoy-binbin commented May 12, 2026

Uh oh!

enjoy-binbin May 12, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 12, 2026

Uh oh!

nmvk left a comment

Uh oh!

madolson left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

enjoy-binbin commented May 12, 2026

Uh oh!

enjoy-binbin commented May 12, 2026

Uh oh!

enjoy-binbin May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 12, 2026

Codecov Report

Uh oh!

nmvk left a comment

Choose a reason for hiding this comment

Uh oh!

madolson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

enjoy-binbin May 12, 2026 •

edited

Loading