Stabilize CLUSTERSCAN unassigned-slot test by retrying DELSLOTS#3959
Conversation
The Case 3 portion of the test was flaky: after a single round of
`CLUSTER DELSLOTS 0` on R0/R1/R2, the cluster could stay in OK state
and `wait_for_cluster_state fail` would time out with
`Cluster node 1 cluster_state:ok`.
The race is between R0's local DELSLOTS and the gossip already in
flight from R0. After R1 locally clears slot 0, a stale pre-DELSLOTS
packet from R0 (whose myslots still claims slot 0) hits the
isSlotUnclaimed fast path in clusterUpdateSlotsConfigWith and rebinds
slot 0 back to R0 on R1. See:
```
if (isSlotUnclaimed(j) ||
server.cluster->slots[j]->configEpoch < senderConfigEpoch ||
clusterSlotFailoverGranted(j)) {
...
clusterDelSlot(j);
clusterAddSlot(sender, j);
...
}
```
R0's subsequent "no longer claiming" PINGs cannot undo this, because
that path only sets owner_not_claiming_slot and never clears slots[j]:
```
if (server.cluster->slots[j] == sender) {
/* The slot is currently bound to the sender but the sender is no longer
* claiming it. We don't want to unbind the slot yet as it can cause the cluster
* to move to FAIL state and also throw client error. Keeping the slot bound to
* the previous owner will cause a few client side redirects, but won't throw
* any errors. We will keep track of the uncertainty in ownership to avoid
* propagating misinformation about this slot's ownership using UPDATE
* messages. */
bitmapSetBit(server.cluster->owner_not_claiming_slot, j);
}
```
Combined with clusterUpdateState's full-coverage check looking only
at slots[j] == NULL, R1 stays at cluster OK forever.
```
if (server.cluster->slots[j] == NULL || ...) {
new_state = CLUSTER_FAIL;
...
}
```
Rather than fighting the protocol's intentional asymmetry around
"soft delete" via gossip, just retry the DELSLOTS pass until all
three nodes converge to FAIL. This keeps the test focused on the
CLUSTERSCAN error semantics it actually wants to verify.
This closes valkey-io#3891. The test was added in valkey-io#3674.
Signed-off-by: Binbin <binloveplay1314@qq.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe PR stabilizes a cluster integration test by replacing a single DELSLOTS attempt with a retry loop that waits for all cluster nodes to converge to FAIL state before proceeding, reducing flakiness caused by timing-dependent race conditions. ChangesCluster scan test convergence
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #3959 +/- ##
============================================
- Coverage 76.68% 76.57% -0.11%
============================================
Files 162 162
Lines 80729 80753 +24
============================================
- Hits 61903 61834 -69
- Misses 18826 18919 +93 🚀 New features to boost your workflow:
|
sarthakaggarwal97
left a comment
There was a problem hiding this comment.
This was an interesting race condition. Thanks for fixing this.
The Case 3 portion of the test was flaky: after a single round of
`CLUSTER DELSLOTS 0` on R0/R1/R2, the cluster could stay in OK state
and `wait_for_cluster_state fail` would time out with
`Cluster node 1 cluster_state:ok`.
The race is between R0's local DELSLOTS and the gossip already in
flight from R0. After R1 locally clears slot 0, a stale pre-DELSLOTS
packet from R0 (whose myslots still claims slot 0) hits the
isSlotUnclaimed fast path in clusterUpdateSlotsConfigWith and rebinds
slot 0 back to R0 on R1. See:
```
if (isSlotUnclaimed(j) ||
server.cluster->slots[j]->configEpoch < senderConfigEpoch ||
clusterSlotFailoverGranted(j)) {
...
clusterDelSlot(j);
clusterAddSlot(sender, j);
...
}
```
R0's subsequent "no longer claiming" PINGs cannot undo this, because
that path only sets owner_not_claiming_slot and never clears slots[j]:
```
if (server.cluster->slots[j] == sender) {
/* The slot is currently bound to the sender but the sender is no longer
* claiming it. We don't want to unbind the slot yet as it can cause the cluster
* to move to FAIL state and also throw client error. Keeping the slot bound to
* the previous owner will cause a few client side redirects, but won't throw
* any errors. We will keep track of the uncertainty in ownership to avoid
* propagating misinformation about this slot's ownership using UPDATE
* messages. */
bitmapSetBit(server.cluster->owner_not_claiming_slot, j);
}
```
Combined with clusterUpdateState's full-coverage check looking only
at slots[j] == NULL, R1 stays at cluster OK forever.
```
if (server.cluster->slots[j] == NULL || ...) {
new_state = CLUSTER_FAIL;
...
}
```
Rather than fighting the protocol's intentional asymmetry around
"soft delete" via gossip, just retry the DELSLOTS pass until all
three nodes converge to FAIL. This keeps the test focused on the
CLUSTERSCAN error semantics it actually wants to verify.
This closes #3891. The test was added in #3674.
Signed-off-by: Binbin <binloveplay1314@qq.com>
The Case 3 portion of the test was flaky: after a single round of
`CLUSTER DELSLOTS 0` on R0/R1/R2, the cluster could stay in OK state
and `wait_for_cluster_state fail` would time out with
`Cluster node 1 cluster_state:ok`.
The race is between R0's local DELSLOTS and the gossip already in
flight from R0. After R1 locally clears slot 0, a stale pre-DELSLOTS
packet from R0 (whose myslots still claims slot 0) hits the
isSlotUnclaimed fast path in clusterUpdateSlotsConfigWith and rebinds
slot 0 back to R0 on R1. See:
```
if (isSlotUnclaimed(j) ||
server.cluster->slots[j]->configEpoch < senderConfigEpoch ||
clusterSlotFailoverGranted(j)) {
...
clusterDelSlot(j);
clusterAddSlot(sender, j);
...
}
```
R0's subsequent "no longer claiming" PINGs cannot undo this, because
that path only sets owner_not_claiming_slot and never clears slots[j]:
```
if (server.cluster->slots[j] == sender) {
/* The slot is currently bound to the sender but the sender is no longer
* claiming it. We don't want to unbind the slot yet as it can cause the cluster
* to move to FAIL state and also throw client error. Keeping the slot bound to
* the previous owner will cause a few client side redirects, but won't throw
* any errors. We will keep track of the uncertainty in ownership to avoid
* propagating misinformation about this slot's ownership using UPDATE
* messages. */
bitmapSetBit(server.cluster->owner_not_claiming_slot, j);
}
```
Combined with clusterUpdateState's full-coverage check looking only
at slots[j] == NULL, R1 stays at cluster OK forever.
```
if (server.cluster->slots[j] == NULL || ...) {
new_state = CLUSTER_FAIL;
...
}
```
Rather than fighting the protocol's intentional asymmetry around
"soft delete" via gossip, just retry the DELSLOTS pass until all
three nodes converge to FAIL. This keeps the test focused on the
CLUSTERSCAN error semantics it actually wants to verify.
This closes #3891. The test was added in #3674.
Signed-off-by: Binbin <binloveplay1314@qq.com>
The Case 3 portion of the test was flaky: after a single round of
CLUSTER DELSLOTS 0on R0/R1/R2, the cluster could stay in OK stateand
wait_for_cluster_state failwould time out withCluster node 1 cluster_state:ok.The race is between R0's local DELSLOTS and the gossip already in
flight from R0. After R1 locally clears slot 0, a stale pre-DELSLOTS
packet from R0 (whose myslots still claims slot 0) hits the
isSlotUnclaimed fast path in clusterUpdateSlotsConfigWith and rebinds
slot 0 back to R0 on R1. See:
R0's subsequent "no longer claiming" PINGs cannot undo this, because
that path only sets owner_not_claiming_slot and never clears slots[j]:
Combined with clusterUpdateState's full-coverage check looking only
at slots[j] == NULL, R1 stays at cluster OK forever.
Rather than fighting the protocol's intentional asymmetry around
"soft delete" via gossip, just retry the DELSLOTS pass until all
three nodes converge to FAIL. This keeps the test focused on the
CLUSTERSCAN error semantics it actually wants to verify.
This closes #3891. The test was added in #3674.