Improve CLUSTERSCAN error handling test with broader coverage#3674
Merged
Conversation
Rewrite the CLUSTERDOWN test to use a 3-node cluster and cover four distinct degraded-cluster scenarios: 1. Node down with full-coverage enabled -> CLUSTERDOWN on all slots 2. Node down with full-coverage disabled -> MOVED for unreachable slots, local slots handled normally 3. Slot unassigned with full-coverage enabled -> "Hash slot not served" for the unassigned slot, CLUSTERDOWN for others 4. Slot unassigned with full-coverage disabled -> "Hash slot not served" for the unassigned slot, MOVED for remote slots Signed-off-by: Binbin <binloveplay1314@qq.com>
Member
Author
|
@nmvk I noticed this test while working on other changes, would you be able to take a look and review it? |
enjoy-binbin
commented
May 12, 2026
| set slot0_owner -1 | ||
| foreach n {0 1} { | ||
| if {[catch {R $n clusterscan 0-{06S}-0 SLOT 0} res] == 0} { | ||
| set cursor_slot_0 [lindex $res 0] |
Member
Author
There was a problem hiding this comment.
This actually returns {0 {}} in the old code, so cursor_slot_0 is 0, and in the wait condition, we are actually checking R 0 clusterscan 0.
I'm wondering if we should return "cluster down" for this. See the other PR (#3675) for more info
127.0.0.1:30001> clusterscan 0
1) "0-{06S}-0"
2) (empty array)
127.0.0.1:30001> clusterscan 0-{06S}-0
(error) CLUSTERDOWN The cluster is down
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #3674 +/- ##
============================================
- Coverage 76.71% 76.54% -0.18%
============================================
Files 162 162
Lines 80656 80654 -2
============================================
- Hits 61872 61733 -139
- Misses 18784 18921 +137 🚀 New features to boost your workflow:
|
nmvk
approved these changes
May 12, 2026
nmvk
left a comment
Contributor
There was a problem hiding this comment.
Thank you @enjoy-binbin for adding the test with broader coverage
madolson
approved these changes
May 12, 2026
madolson
left a comment
Member
There was a problem hiding this comment.
More tests are always good :)
lucasyonge
pushed a commit
that referenced
this pull request
May 14, 2026
Rewrite the CLUSTERDOWN test to use a 3-node cluster and cover four distinct degraded-cluster scenarios: 1. Node down with full-coverage enabled -> CLUSTERDOWN on all slots 2. Node down with full-coverage disabled -> MOVED for unreachable slots, local slots handled normally 3. Slot unassigned with full-coverage enabled -> "Hash slot not served" for the unassigned slot, CLUSTERDOWN for others 4. Slot unassigned with full-coverage disabled -> "Hash slot not served" for the unassigned slot, MOVED for remote slots Signed-off-by: Binbin <binloveplay1314@qq.com>
enjoy-binbin
added a commit
that referenced
this pull request
Jun 12, 2026
The Case 3 portion of the test was flaky: after a single round of
`CLUSTER DELSLOTS 0` on R0/R1/R2, the cluster could stay in OK state
and `wait_for_cluster_state fail` would time out with
`Cluster node 1 cluster_state:ok`.
The race is between R0's local DELSLOTS and the gossip already in
flight from R0. After R1 locally clears slot 0, a stale pre-DELSLOTS
packet from R0 (whose myslots still claims slot 0) hits the
isSlotUnclaimed fast path in clusterUpdateSlotsConfigWith and rebinds
slot 0 back to R0 on R1. See:
```
if (isSlotUnclaimed(j) ||
server.cluster->slots[j]->configEpoch < senderConfigEpoch ||
clusterSlotFailoverGranted(j)) {
...
clusterDelSlot(j);
clusterAddSlot(sender, j);
...
}
```
R0's subsequent "no longer claiming" PINGs cannot undo this, because
that path only sets owner_not_claiming_slot and never clears slots[j]:
```
if (server.cluster->slots[j] == sender) {
/* The slot is currently bound to the sender but the sender is no longer
* claiming it. We don't want to unbind the slot yet as it can cause the cluster
* to move to FAIL state and also throw client error. Keeping the slot bound to
* the previous owner will cause a few client side redirects, but won't throw
* any errors. We will keep track of the uncertainty in ownership to avoid
* propagating misinformation about this slot's ownership using UPDATE
* messages. */
bitmapSetBit(server.cluster->owner_not_claiming_slot, j);
}
```
Combined with clusterUpdateState's full-coverage check looking only
at slots[j] == NULL, R1 stays at cluster OK forever.
```
if (server.cluster->slots[j] == NULL || ...) {
new_state = CLUSTER_FAIL;
...
}
```
Rather than fighting the protocol's intentional asymmetry around
"soft delete" via gossip, just retry the DELSLOTS pass until all
three nodes converge to FAIL. This keeps the test focused on the
CLUSTERSCAN error semantics it actually wants to verify.
This closes #3891. The test was added in #3674.
Signed-off-by: Binbin <binloveplay1314@qq.com>
valkeyrie-ops Bot
pushed a commit
that referenced
this pull request
Jun 13, 2026
The Case 3 portion of the test was flaky: after a single round of
`CLUSTER DELSLOTS 0` on R0/R1/R2, the cluster could stay in OK state
and `wait_for_cluster_state fail` would time out with
`Cluster node 1 cluster_state:ok`.
The race is between R0's local DELSLOTS and the gossip already in
flight from R0. After R1 locally clears slot 0, a stale pre-DELSLOTS
packet from R0 (whose myslots still claims slot 0) hits the
isSlotUnclaimed fast path in clusterUpdateSlotsConfigWith and rebinds
slot 0 back to R0 on R1. See:
```
if (isSlotUnclaimed(j) ||
server.cluster->slots[j]->configEpoch < senderConfigEpoch ||
clusterSlotFailoverGranted(j)) {
...
clusterDelSlot(j);
clusterAddSlot(sender, j);
...
}
```
R0's subsequent "no longer claiming" PINGs cannot undo this, because
that path only sets owner_not_claiming_slot and never clears slots[j]:
```
if (server.cluster->slots[j] == sender) {
/* The slot is currently bound to the sender but the sender is no longer
* claiming it. We don't want to unbind the slot yet as it can cause the cluster
* to move to FAIL state and also throw client error. Keeping the slot bound to
* the previous owner will cause a few client side redirects, but won't throw
* any errors. We will keep track of the uncertainty in ownership to avoid
* propagating misinformation about this slot's ownership using UPDATE
* messages. */
bitmapSetBit(server.cluster->owner_not_claiming_slot, j);
}
```
Combined with clusterUpdateState's full-coverage check looking only
at slots[j] == NULL, R1 stays at cluster OK forever.
```
if (server.cluster->slots[j] == NULL || ...) {
new_state = CLUSTER_FAIL;
...
}
```
Rather than fighting the protocol's intentional asymmetry around
"soft delete" via gossip, just retry the DELSLOTS pass until all
three nodes converge to FAIL. This keeps the test focused on the
CLUSTERSCAN error semantics it actually wants to verify.
This closes #3891. The test was added in #3674.
Signed-off-by: Binbin <binloveplay1314@qq.com>
valkeyrie-ops Bot
pushed a commit
that referenced
this pull request
Jun 17, 2026
The Case 3 portion of the test was flaky: after a single round of
`CLUSTER DELSLOTS 0` on R0/R1/R2, the cluster could stay in OK state
and `wait_for_cluster_state fail` would time out with
`Cluster node 1 cluster_state:ok`.
The race is between R0's local DELSLOTS and the gossip already in
flight from R0. After R1 locally clears slot 0, a stale pre-DELSLOTS
packet from R0 (whose myslots still claims slot 0) hits the
isSlotUnclaimed fast path in clusterUpdateSlotsConfigWith and rebinds
slot 0 back to R0 on R1. See:
```
if (isSlotUnclaimed(j) ||
server.cluster->slots[j]->configEpoch < senderConfigEpoch ||
clusterSlotFailoverGranted(j)) {
...
clusterDelSlot(j);
clusterAddSlot(sender, j);
...
}
```
R0's subsequent "no longer claiming" PINGs cannot undo this, because
that path only sets owner_not_claiming_slot and never clears slots[j]:
```
if (server.cluster->slots[j] == sender) {
/* The slot is currently bound to the sender but the sender is no longer
* claiming it. We don't want to unbind the slot yet as it can cause the cluster
* to move to FAIL state and also throw client error. Keeping the slot bound to
* the previous owner will cause a few client side redirects, but won't throw
* any errors. We will keep track of the uncertainty in ownership to avoid
* propagating misinformation about this slot's ownership using UPDATE
* messages. */
bitmapSetBit(server.cluster->owner_not_claiming_slot, j);
}
```
Combined with clusterUpdateState's full-coverage check looking only
at slots[j] == NULL, R1 stays at cluster OK forever.
```
if (server.cluster->slots[j] == NULL || ...) {
new_state = CLUSTER_FAIL;
...
}
```
Rather than fighting the protocol's intentional asymmetry around
"soft delete" via gossip, just retry the DELSLOTS pass until all
three nodes converge to FAIL. This keeps the test focused on the
CLUSTERSCAN error semantics it actually wants to verify.
This closes #3891. The test was added in #3674.
Signed-off-by: Binbin <binloveplay1314@qq.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rewrite the CLUSTERDOWN test to use a 3-node cluster and cover four
distinct degraded-cluster scenarios:
slots, local slots handled normally
for the unassigned slot, CLUSTERDOWN for others
for the unassigned slot, MOVED for remote slots