Increase timeouts in faster-failover test for slow CI runners#3463
Conversation
… runners The test 'The best replica can initiate an election immediately in an automatic failover' has been flaky on sanitizer and slow CI runners. Changes: - Increase FAIL detection wait from 10s to 50s (wait_for_condition 1000 50) since FAIL propagation takes longer on slow runners. - Double psync_max_retries from 1200 to 2400 (normal) and 6000 to 12000 (valgrind) to give more time for partial resync log messages. Fixes flaky test introduced by commit 6822a67. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
8ce488e to
9d76223
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #3463 +/- ##
============================================
- Coverage 76.52% 76.51% -0.01%
============================================
Files 157 157
Lines 79035 79035
============================================
- Hits 60478 60475 -3
- Misses 18557 18560 +3 🚀 New features to boost your workflow:
|
dvkashapov
left a comment
There was a problem hiding this comment.
I suggest we modify timeouts similar to this PR #3462 where we multiply timeouts if we're under sanitizer or valgrind, WDYT?
@dvkashapov Should we scale also server configs such as cluster-node-timeout and failover timeouts? |
Yes, that would make sense for me, what's your opinion on that? |
It can be complex to implement it. :) |
|
This seems like a good idea to scale the timeouts according to the environments. I can see if I can implement it |
…-io#3463) Two changes to tests/unit/cluster/faster-failover.tcl: 1. FAIL detection timeout: `wait_for_condition 1000 10` → `1000 50` (10s → 50s) 2. psync_max_retries: 1200 → 2400 normal (120s → 240s), 6000 → 12000 valgrind (600s → 1200s) The test `The best replica can initiate an election immediately in an automatic failover` in `tests/unit/cluster/faster-failover.tcl` has been flaky since it was introduced on March 27, 2026 by valkey-io#2227. **Frequency:** 8 out of 15 days (Mar 27 – Apr 8), across valgrind, sanitizer, and slow CI runners. **Common errors:** - `log message of "Successful partial resynchronization with primary" not found` (timeout waiting for psync) - `expected pattern found in srv -N log file: *best ranked replica*` (timeout waiting for FAIL propagation) The test spins up a 12-node cluster (5 primaries + 7 replicas), pauses nodes, and waits for FAIL detection to propagate across all nodes before failover + partial resync. A previous fix attempt valkey-io#3424 increased the psync timeout from 50s to 120s (600s valgrind), which reduced frequency but did not eliminate it. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Two changes to tests/unit/cluster/faster-failover.tcl: 1. FAIL detection timeout: `wait_for_condition 1000 10` → `1000 50` (10s → 50s) 2. psync_max_retries: 1200 → 2400 normal (120s → 240s), 6000 → 12000 valgrind (600s → 1200s) The test `The best replica can initiate an election immediately in an automatic failover` in `tests/unit/cluster/faster-failover.tcl` has been flaky since it was introduced on March 27, 2026 by #2227. **Frequency:** 8 out of 15 days (Mar 27 – Apr 8), across valgrind, sanitizer, and slow CI runners. **Common errors:** - `log message of "Successful partial resynchronization with primary" not found` (timeout waiting for psync) - `expected pattern found in srv -N log file: *best ranked replica*` (timeout waiting for FAIL propagation) The test spins up a 12-node cluster (5 primaries + 7 replicas), pauses nodes, and waits for FAIL detection to propagate across all nodes before failover + partial resync. A previous fix attempt #3424 increased the psync timeout from 50s to 120s (600s valgrind), which reduced frequency but did not eliminate it. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Problem
The test
The best replica can initiate an election immediately in an automatic failoverintests/unit/cluster/faster-failover.tclhas been flaky since it was introduced on March 27, 2026 by #2227.Frequency: 8 out of 15 days (Mar 27 – Apr 8), across valgrind, sanitizer, and slow CI runners.
Common errors:
log message of "Successful partial resynchronization with primary" not found(timeout waiting for psync)expected pattern found in srv -N log file: *best ranked replica*(timeout waiting for FAIL propagation)Example failing runs:
A previous fix attempt (#3424) increased the psync timeout from 50s to 120s (600s valgrind), which reduced frequency but did not eliminate it.
Root Cause
The test spins up a 12-node cluster (5 primaries + 7 replicas), pauses nodes, and waits for FAIL detection to propagate across all nodes before failover + partial resync. The original timeouts were too tight for slow CI environments:
wait_for_condition 1000 10= 10s max. Withcluster-node-timeout 5000and 12 nodes exchanging gossip, slow runners need more time.Fix
Two changes to
tests/unit/cluster/faster-failover.tcl:wait_for_condition 1000 10→1000 50(10s → 50s)1200→2400normal (120s → 240s),6000→12000valgrind (600s → 1200s)Testing
Ran
unit/cluster/faster-failover100 loops on valgrind, sanitizer, and ubuntu runners — all passed:--loops 100 --single unit/cluster/faster-failoveronsanitizer-address (clang/gcc),sanitizer-undefined (clang/gcc),sanitizer-force-defrag,ubuntu-jemalloc,ubuntu-arm