Fix timing issue in cluster manual-takeover test#3826
Conversation
Tests sometimes fail here. If we add `after 1000` inside the foreach
loop, it becomes very easy to reproduce.
```
test "Use takeover to bring slaves back" {
foreach id $replica_ids {
R $id cluster failover takeover
}
}
[exception]: Executing test client: ERR You should send CLUSTER FAILOVER to a replica.
```
The reason is that the transition from the PFAIL state to the FAIL
state which requires a quorum, may take a little extra time. From a
node's own perspective, it detects PFAIL and consequently views the
cluster as being in a FAIL state; and then, once the replica-no-failover
setting is disabled, auto failover may kick in, thereby causing the
test to fail.
Since we are verifying manual failover takeover, it is best to keep
`cluster-replica-no-failover` disabled at all times.
Signed-off-by: Binbin <binloveplay1314@qq.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
💤 Files with no reviewable changes (1)
📝 WalkthroughWalkthroughThe PR removes replica failover reset logic from the manual takeover cluster test. Lines 29–30, which re-enabled replica failover on replicas 5, 6, 7 after pausing/terminating the majority of masters, are deleted. The test now proceeds from the kill phase directly to cluster-down assertions with replica failover disabled throughout. ChangesTest Flow Modification
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #3826 +/- ##
============================================
- Coverage 76.94% 76.71% -0.24%
============================================
Files 162 162
Lines 80704 80704
============================================
- Hits 62100 61911 -189
- Misses 18604 18793 +189 🚀 New features to boost your workflow:
|
|
Thanks! |
Tests sometimes fail here. If we add
after 1000inside the foreachloop, it becomes very easy to reproduce.
The reason is that the transition from the PFAIL state to the FAIL
state which requires a quorum, may take a little extra time. From a
node's own perspective, it detects PFAIL and consequently views the
cluster as being in a FAIL state; and then, once the replica-no-failover
setting is disabled, auto failover may kick in, thereby causing the
test to fail.
Since we are verifying manual failover takeover, it is best to keep
cluster-replica-no-failoverdisabled at all times.