Skip to content

Improve CLUSTERSCAN error handling test with broader coverage#3674

Merged
enjoy-binbin merged 1 commit into
valkey-io:unstablefrom
enjoy-binbin:cleanup_test
May 13, 2026
Merged

Improve CLUSTERSCAN error handling test with broader coverage#3674
enjoy-binbin merged 1 commit into
valkey-io:unstablefrom
enjoy-binbin:cleanup_test

Conversation

@enjoy-binbin

Copy link
Copy Markdown
Member

Rewrite the CLUSTERDOWN test to use a 3-node cluster and cover four
distinct degraded-cluster scenarios:

  1. Node down with full-coverage enabled -> CLUSTERDOWN on all slots
  2. Node down with full-coverage disabled -> MOVED for unreachable
    slots, local slots handled normally
  3. Slot unassigned with full-coverage enabled -> "Hash slot not served"
    for the unassigned slot, CLUSTERDOWN for others
  4. Slot unassigned with full-coverage disabled -> "Hash slot not served"
    for the unassigned slot, MOVED for remote slots

Rewrite the CLUSTERDOWN test to use a 3-node cluster and cover four
distinct degraded-cluster scenarios:
1. Node down with full-coverage enabled -> CLUSTERDOWN on all slots
2. Node down with full-coverage disabled -> MOVED for unreachable
   slots, local slots handled normally
3. Slot unassigned with full-coverage enabled -> "Hash slot not served"
   for the unassigned slot, CLUSTERDOWN for others
4. Slot unassigned with full-coverage disabled -> "Hash slot not served"
   for the unassigned slot, MOVED for remote slots

Signed-off-by: Binbin <binloveplay1314@qq.com>
@enjoy-binbin

Copy link
Copy Markdown
Member Author

@nmvk I noticed this test while working on other changes, would you be able to take a look and review it?

set slot0_owner -1
foreach n {0 1} {
if {[catch {R $n clusterscan 0-{06S}-0 SLOT 0} res] == 0} {
set cursor_slot_0 [lindex $res 0]

@enjoy-binbin enjoy-binbin May 12, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually returns {0 {}} in the old code, so cursor_slot_0 is 0, and in the wait condition, we are actually checking R 0 clusterscan 0.

I'm wondering if we should return "cluster down" for this. See the other PR (#3675) for more info

127.0.0.1:30001> clusterscan 0
1) "0-{06S}-0"
2) (empty array)
127.0.0.1:30001> clusterscan 0-{06S}-0
(error) CLUSTERDOWN The cluster is down

@codecov

codecov Bot commented May 12, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.54%. Comparing base (ca9dee3) to head (de63e74).
⚠️ Report is 1 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3674      +/-   ##
============================================
- Coverage     76.71%   76.54%   -0.18%     
============================================
  Files           162      162              
  Lines         80656    80654       -2     
============================================
- Hits          61872    61733     -139     
- Misses        18784    18921     +137     

see 22 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@nmvk nmvk left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @enjoy-binbin for adding the test with broader coverage

@madolson madolson left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More tests are always good :)

@enjoy-binbin enjoy-binbin merged commit a813df0 into valkey-io:unstable May 13, 2026
62 checks passed
@enjoy-binbin enjoy-binbin deleted the cleanup_test branch May 13, 2026 03:39
lucasyonge pushed a commit that referenced this pull request May 14, 2026
Rewrite the CLUSTERDOWN test to use a 3-node cluster and cover four
distinct degraded-cluster scenarios:
1. Node down with full-coverage enabled -> CLUSTERDOWN on all slots
2. Node down with full-coverage disabled -> MOVED for unreachable
   slots, local slots handled normally
3. Slot unassigned with full-coverage enabled -> "Hash slot not served"
   for the unassigned slot, CLUSTERDOWN for others
4. Slot unassigned with full-coverage disabled -> "Hash slot not served"
   for the unassigned slot, MOVED for remote slots

Signed-off-by: Binbin <binloveplay1314@qq.com>
enjoy-binbin added a commit that referenced this pull request Jun 12, 2026
The Case 3 portion of the test was flaky: after a single round of
`CLUSTER DELSLOTS 0` on R0/R1/R2, the cluster could stay in OK state
and `wait_for_cluster_state fail` would time out with
`Cluster node 1 cluster_state:ok`.

The race is between R0's local DELSLOTS and the gossip already in
flight from R0. After R1 locally clears slot 0, a stale pre-DELSLOTS
packet from R0 (whose myslots still claims slot 0) hits the
isSlotUnclaimed fast path in clusterUpdateSlotsConfigWith and rebinds
slot 0 back to R0 on R1. See:
```
    if (isSlotUnclaimed(j) ||
        server.cluster->slots[j]->configEpoch < senderConfigEpoch ||
        clusterSlotFailoverGranted(j)) {
        ...
        clusterDelSlot(j);
        clusterAddSlot(sender, j);
        ...
    }
```

R0's subsequent "no longer claiming" PINGs cannot undo this, because
that path only sets owner_not_claiming_slot and never clears slots[j]:
```
    if (server.cluster->slots[j] == sender) {
        /* The slot is currently bound to the sender but the sender is no longer
         * claiming it. We don't want to unbind the slot yet as it can cause the cluster
         * to move to FAIL state and also throw client error. Keeping the slot bound to
         * the previous owner will cause a few client side redirects, but won't throw
         * any errors. We will keep track of the uncertainty in ownership to avoid
         * propagating misinformation about this slot's ownership using UPDATE
         * messages. */
        bitmapSetBit(server.cluster->owner_not_claiming_slot, j);
    }
```

Combined with clusterUpdateState's full-coverage check looking only
at slots[j] == NULL, R1 stays at cluster OK forever.
```
    if (server.cluster->slots[j] == NULL || ...) {
        new_state = CLUSTER_FAIL;
        ...
    }
```

Rather than fighting the protocol's intentional asymmetry around
"soft delete" via gossip, just retry the DELSLOTS pass until all
three nodes converge to FAIL. This keeps the test focused on the
CLUSTERSCAN error semantics it actually wants to verify.

This closes #3891. The test was added in #3674.

Signed-off-by: Binbin <binloveplay1314@qq.com>
valkeyrie-ops Bot pushed a commit that referenced this pull request Jun 13, 2026
The Case 3 portion of the test was flaky: after a single round of
`CLUSTER DELSLOTS 0` on R0/R1/R2, the cluster could stay in OK state
and `wait_for_cluster_state fail` would time out with
`Cluster node 1 cluster_state:ok`.

The race is between R0's local DELSLOTS and the gossip already in
flight from R0. After R1 locally clears slot 0, a stale pre-DELSLOTS
packet from R0 (whose myslots still claims slot 0) hits the
isSlotUnclaimed fast path in clusterUpdateSlotsConfigWith and rebinds
slot 0 back to R0 on R1. See:
```
    if (isSlotUnclaimed(j) ||
        server.cluster->slots[j]->configEpoch < senderConfigEpoch ||
        clusterSlotFailoverGranted(j)) {
        ...
        clusterDelSlot(j);
        clusterAddSlot(sender, j);
        ...
    }
```

R0's subsequent "no longer claiming" PINGs cannot undo this, because
that path only sets owner_not_claiming_slot and never clears slots[j]:
```
    if (server.cluster->slots[j] == sender) {
        /* The slot is currently bound to the sender but the sender is no longer
         * claiming it. We don't want to unbind the slot yet as it can cause the cluster
         * to move to FAIL state and also throw client error. Keeping the slot bound to
         * the previous owner will cause a few client side redirects, but won't throw
         * any errors. We will keep track of the uncertainty in ownership to avoid
         * propagating misinformation about this slot's ownership using UPDATE
         * messages. */
        bitmapSetBit(server.cluster->owner_not_claiming_slot, j);
    }
```

Combined with clusterUpdateState's full-coverage check looking only
at slots[j] == NULL, R1 stays at cluster OK forever.
```
    if (server.cluster->slots[j] == NULL || ...) {
        new_state = CLUSTER_FAIL;
        ...
    }
```

Rather than fighting the protocol's intentional asymmetry around
"soft delete" via gossip, just retry the DELSLOTS pass until all
three nodes converge to FAIL. This keeps the test focused on the
CLUSTERSCAN error semantics it actually wants to verify.

This closes #3891. The test was added in #3674.

Signed-off-by: Binbin <binloveplay1314@qq.com>
valkeyrie-ops Bot pushed a commit that referenced this pull request Jun 17, 2026
The Case 3 portion of the test was flaky: after a single round of
`CLUSTER DELSLOTS 0` on R0/R1/R2, the cluster could stay in OK state
and `wait_for_cluster_state fail` would time out with
`Cluster node 1 cluster_state:ok`.

The race is between R0's local DELSLOTS and the gossip already in
flight from R0. After R1 locally clears slot 0, a stale pre-DELSLOTS
packet from R0 (whose myslots still claims slot 0) hits the
isSlotUnclaimed fast path in clusterUpdateSlotsConfigWith and rebinds
slot 0 back to R0 on R1. See:
```
    if (isSlotUnclaimed(j) ||
        server.cluster->slots[j]->configEpoch < senderConfigEpoch ||
        clusterSlotFailoverGranted(j)) {
        ...
        clusterDelSlot(j);
        clusterAddSlot(sender, j);
        ...
    }
```

R0's subsequent "no longer claiming" PINGs cannot undo this, because
that path only sets owner_not_claiming_slot and never clears slots[j]:
```
    if (server.cluster->slots[j] == sender) {
        /* The slot is currently bound to the sender but the sender is no longer
         * claiming it. We don't want to unbind the slot yet as it can cause the cluster
         * to move to FAIL state and also throw client error. Keeping the slot bound to
         * the previous owner will cause a few client side redirects, but won't throw
         * any errors. We will keep track of the uncertainty in ownership to avoid
         * propagating misinformation about this slot's ownership using UPDATE
         * messages. */
        bitmapSetBit(server.cluster->owner_not_claiming_slot, j);
    }
```

Combined with clusterUpdateState's full-coverage check looking only
at slots[j] == NULL, R1 stays at cluster OK forever.
```
    if (server.cluster->slots[j] == NULL || ...) {
        new_state = CLUSTER_FAIL;
        ...
    }
```

Rather than fighting the protocol's intentional asymmetry around
"soft delete" via gossip, just retry the DELSLOTS pass until all
three nodes converge to FAIL. This keeps the test focused on the
CLUSTERSCAN error semantics it actually wants to verify.

This closes #3891. The test was added in #3674.

Signed-off-by: Binbin <binloveplay1314@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants