Increase timeouts in faster-failover test for slow CI runners by roshkhatri · Pull Request #3463 · valkey-io/valkey

roshkhatri · 2026-04-08T21:38:35Z

Problem

The test The best replica can initiate an election immediately in an automatic failover in tests/unit/cluster/faster-failover.tcl has been flaky since it was introduced on March 27, 2026 by #2227.

Frequency: 8 out of 15 days (Mar 27 – Apr 8), across valgrind, sanitizer, and slow CI runners.

Common errors:

log message of "Successful partial resynchronization with primary" not found (timeout waiting for psync)
expected pattern found in srv -N log file: *best ranked replica* (timeout waiting for FAIL propagation)

Example failing runs:

A previous fix attempt (#3424) increased the psync timeout from 50s to 120s (600s valgrind), which reduced frequency but did not eliminate it.

Root Cause

The test spins up a 12-node cluster (5 primaries + 7 replicas), pauses nodes, and waits for FAIL detection to propagate across all nodes before failover + partial resync. The original timeouts were too tight for slow CI environments:

FAIL propagation: wait_for_condition 1000 10 = 10s max. With cluster-node-timeout 5000 and 12 nodes exchanging gossip, slow runners need more time.
Partial resync: Even after Increase timeout in flaky "failover immediately" test case #3424 bumped to 120s/600s, sanitizer runners still occasionally exceed this.

Fix

Two changes to tests/unit/cluster/faster-failover.tcl:

FAIL detection timeout: wait_for_condition 1000 10 → 1000 50 (10s → 50s)
psync_max_retries: 1200 → 2400 normal (120s → 240s), 6000 → 12000 valgrind (600s → 1200s)

Testing

Ran unit/cluster/faster-failover 100 loops on valgrind, sanitizer, and ubuntu runners — all passed:

Workflow run: https://github.com/roshkhatri/valkey/actions/runs/24158848194
Config: --loops 100 --single unit/cluster/faster-failover on sanitizer-address (clang/gcc), sanitizer-undefined (clang/gcc), sanitizer-force-defrag, ubuntu-jemalloc, ubuntu-arm
Result: 7/7 jobs ✅, zero failures across 700 total test iterations

… runners The test 'The best replica can initiate an election immediately in an automatic failover' has been flaky on sanitizer and slow CI runners. Changes: - Increase FAIL detection wait from 10s to 50s (wait_for_condition 1000 50) since FAIL propagation takes longer on slow runners. - Double psync_max_retries from 1200 to 2400 (normal) and 6000 to 12000 (valgrind) to give more time for partial resync log messages. Fixes flaky test introduced by commit 6822a67. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>

codecov · 2026-04-08T22:02:53Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.51%. Comparing base (dba128f) to head (9d76223).
⚠️ Report is 4 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #3463      +/-   ##
============================================
- Coverage     76.52%   76.51%   -0.01%     
============================================
  Files           157      157              
  Lines         79035    79035              
============================================
- Hits          60478    60475       -3     
- Misses        18557    18560       +3

see 14 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

dvkashapov

I suggest we modify timeouts similar to this PR #3462 where we multiply timeouts if we're under sanitizer or valgrind, WDYT?

zuiderkwast · 2026-04-09T13:06:25Z

I suggest we modify timeouts similar to this PR #3462 where we multiply timeouts if we're under sanitizer or valgrind, WDYT?

@dvkashapov Should we scale also server configs such as cluster-node-timeout and failover timeouts?

dvkashapov · 2026-04-09T14:34:47Z

@dvkashapov Should we scale also server configs such as cluster-node-timeout and failover timeouts?

Yes, that would make sense for me, what's your opinion on that?

zuiderkwast · 2026-04-09T14:54:26Z

Should we scale also server configs such as cluster-node-timeout and failover timeouts?

Yes, that would make sense for me, what's your opinion on that?

It can be complex to implement it. :)

roshkhatri · 2026-04-09T17:15:17Z

This seems like a good idea to scale the timeouts according to the environments. I can see if I can implement it

…-io#3463) Two changes to tests/unit/cluster/faster-failover.tcl: 1. FAIL detection timeout: `wait_for_condition 1000 10` → `1000 50` (10s → 50s) 2. psync_max_retries: 1200 → 2400 normal (120s → 240s), 6000 → 12000 valgrind (600s → 1200s) The test `The best replica can initiate an election immediately in an automatic failover` in `tests/unit/cluster/faster-failover.tcl` has been flaky since it was introduced on March 27, 2026 by valkey-io#2227. **Frequency:** 8 out of 15 days (Mar 27 – Apr 8), across valgrind, sanitizer, and slow CI runners. **Common errors:** - `log message of "Successful partial resynchronization with primary" not found` (timeout waiting for psync) - `expected pattern found in srv -N log file: *best ranked replica*` (timeout waiting for FAIL propagation) The test spins up a 12-node cluster (5 primaries + 7 replicas), pauses nodes, and waits for FAIL detection to propagate across all nodes before failover + partial resync. A previous fix attempt valkey-io#3424 increased the psync timeout from 50s to 120s (600s valgrind), which reduced frequency but did not eliminate it. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>

Two changes to tests/unit/cluster/faster-failover.tcl: 1. FAIL detection timeout: `wait_for_condition 1000 10` → `1000 50` (10s → 50s) 2. psync_max_retries: 1200 → 2400 normal (120s → 240s), 6000 → 12000 valgrind (600s → 1200s) The test `The best replica can initiate an election immediately in an automatic failover` in `tests/unit/cluster/faster-failover.tcl` has been flaky since it was introduced on March 27, 2026 by #2227. **Frequency:** 8 out of 15 days (Mar 27 – Apr 8), across valgrind, sanitizer, and slow CI runners. **Common errors:** - `log message of "Successful partial resynchronization with primary" not found` (timeout waiting for psync) - `expected pattern found in srv -N log file: *best ranked replica*` (timeout waiting for FAIL propagation) The test spins up a 12-node cluster (5 primaries + 7 replicas), pauses nodes, and waits for FAIL detection to propagate across all nodes before failover + partial resync. A previous fix attempt #3424 increased the psync timeout from 50s to 120s (600s valgrind), which reduced frequency but did not eliminate it. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>

github-actions Bot assigned roshkhatri Apr 8, 2026

roshkhatri force-pushed the fix/faster-failover-flaky branch from 8ce488e to 9d76223 Compare April 8, 2026 21:40

roshkhatri requested a review from zuiderkwast April 8, 2026 21:44

roshkhatri marked this pull request as ready for review April 8, 2026 21:45

roshkhatri requested a review from sarthakaggarwal97 April 8, 2026 21:45

enjoy-binbin approved these changes Apr 9, 2026

View reviewed changes

sarthakaggarwal97 approved these changes Apr 9, 2026

View reviewed changes

dvkashapov reviewed Apr 9, 2026

View reviewed changes

zuiderkwast approved these changes Apr 9, 2026

View reviewed changes

zuiderkwast merged commit b020b03 into valkey-io:unstable Apr 9, 2026
58 checks passed

zuiderkwast added the test-failure An issue indicating a test failure label Apr 9, 2026

sarthakaggarwal97 mentioned this pull request Apr 14, 2026

Merge unstable into 9.1 #3507

Closed

sarthakaggarwal97 mentioned this pull request Apr 16, 2026

Backport Unstable to 9.1 for RC2 #3519

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Increase timeouts in faster-failover test for slow CI runners#3463

Increase timeouts in faster-failover test for slow CI runners#3463
zuiderkwast merged 1 commit into
valkey-io:unstablefrom
roshkhatri:fix/faster-failover-flaky

roshkhatri commented Apr 8, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

dvkashapov left a comment

Uh oh!

zuiderkwast commented Apr 9, 2026

Uh oh!

Uh oh!

dvkashapov commented Apr 9, 2026

Uh oh!

zuiderkwast commented Apr 9, 2026

Uh oh!

roshkhatri commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

roshkhatri commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Fix

Testing

Uh oh!

codecov Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dvkashapov left a comment

Choose a reason for hiding this comment

Uh oh!

zuiderkwast commented Apr 9, 2026

Uh oh!

Uh oh!

dvkashapov commented Apr 9, 2026

Uh oh!

zuiderkwast commented Apr 9, 2026

Uh oh!

roshkhatri commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

roshkhatri commented Apr 8, 2026 •

edited

Loading

codecov Bot commented Apr 8, 2026 •

edited

Loading