Increase timeout in flaky "failover immediately" test case by zuiderkwast · Pull Request #3424 · valkey-io/valkey

zuiderkwast · 2026-03-31T20:52:05Z

The test case "The best replica can initiate an election immediately test" has been failing in CI jobs.

Increase the timeout to account for slow runners.

Old waiting time: 50 seconds. New waiting time: 120 seconds, with valgrind: 600 seconds.

Intoduced in #2227.

…tely test Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

codecov · 2026-03-31T21:17:27Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.52%. Comparing base (9586093) to head (6b464da).
⚠️ Report is 2 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #3424      +/-   ##
============================================
- Coverage     76.55%   76.52%   -0.03%     
============================================
  Files           159      159              
  Lines         79703    79703              
============================================
- Hits          61017    60995      -22     
- Misses        18686    18708      +22

see 22 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

rainsupreme

LGTM!

Nikhil-Manglore · 2026-03-31T22:31:49Z

Looks like the memory efficiency tests are still failing. Another option is to revert the commit. What are your thoughts on that?

enjoy-binbin

Thanks.

…#3424) The test case "The best replica can initiate an election immediately test" has been failing in CI jobs. Increase the timeout to account for slow runners. Old waiting time: 50 seconds. New waiting time: 120 seconds, with valgrind: 600 seconds. Intoduced in valkey-io#2227. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

Two changes to tests/unit/cluster/faster-failover.tcl: 1. FAIL detection timeout: `wait_for_condition 1000 10` → `1000 50` (10s → 50s) 2. psync_max_retries: 1200 → 2400 normal (120s → 240s), 6000 → 12000 valgrind (600s → 1200s) The test `The best replica can initiate an election immediately in an automatic failover` in `tests/unit/cluster/faster-failover.tcl` has been flaky since it was introduced on March 27, 2026 by #2227. **Frequency:** 8 out of 15 days (Mar 27 – Apr 8), across valgrind, sanitizer, and slow CI runners. **Common errors:** - `log message of "Successful partial resynchronization with primary" not found` (timeout waiting for psync) - `expected pattern found in srv -N log file: *best ranked replica*` (timeout waiting for FAIL propagation) The test spins up a 12-node cluster (5 primaries + 7 replicas), pauses nodes, and waits for FAIL detection to propagate across all nodes before failover + partial resync. A previous fix attempt #3424 increased the psync timeout from 50s to 120s (600s valgrind), which reduced frequency but did not eliminate it. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>

…#3424) The test case "The best replica can initiate an election immediately test" has been failing in CI jobs. Increase the timeout to account for slow runners. Old waiting time: 50 seconds. New waiting time: 120 seconds, with valgrind: 600 seconds. Intoduced in valkey-io#2227. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

…-io#3463) Two changes to tests/unit/cluster/faster-failover.tcl: 1. FAIL detection timeout: `wait_for_condition 1000 10` → `1000 50` (10s → 50s) 2. psync_max_retries: 1200 → 2400 normal (120s → 240s), 6000 → 12000 valgrind (600s → 1200s) The test `The best replica can initiate an election immediately in an automatic failover` in `tests/unit/cluster/faster-failover.tcl` has been flaky since it was introduced on March 27, 2026 by valkey-io#2227. **Frequency:** 8 out of 15 days (Mar 27 – Apr 8), across valgrind, sanitizer, and slow CI runners. **Common errors:** - `log message of "Successful partial resynchronization with primary" not found` (timeout waiting for psync) - `expected pattern found in srv -N log file: *best ranked replica*` (timeout waiting for FAIL propagation) The test spins up a 12-node cluster (5 primaries + 7 replicas), pauses nodes, and waits for FAIL detection to propagate across all nodes before failover + partial resync. A previous fix attempt valkey-io#3424 increased the psync timeout from 50s to 120s (600s valgrind), which reduced frequency but did not eliminate it. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>

The test case "The best replica can initiate an election immediately test" has been failing in CI jobs. Increase the timeout to account for slow runners. Old waiting time: 50 seconds. New waiting time: 120 seconds, with valgrind: 600 seconds. Intoduced in #2227. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

Two changes to tests/unit/cluster/faster-failover.tcl: 1. FAIL detection timeout: `wait_for_condition 1000 10` → `1000 50` (10s → 50s) 2. psync_max_retries: 1200 → 2400 normal (120s → 240s), 6000 → 12000 valgrind (600s → 1200s) The test `The best replica can initiate an election immediately in an automatic failover` in `tests/unit/cluster/faster-failover.tcl` has been flaky since it was introduced on March 27, 2026 by #2227. **Frequency:** 8 out of 15 days (Mar 27 – Apr 8), across valgrind, sanitizer, and slow CI runners. **Common errors:** - `log message of "Successful partial resynchronization with primary" not found` (timeout waiting for psync) - `expected pattern found in srv -N log file: *best ranked replica*` (timeout waiting for FAIL propagation) The test spins up a 12-node cluster (5 primaries + 7 replicas), pauses nodes, and waits for FAIL detection to propagate across all nodes before failover + partial resync. A previous fix attempt #3424 increased the psync timeout from 50s to 120s (600s valgrind), which reduced frequency but did not eliminate it. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>

…#3424) The test case "The best replica can initiate an election immediately test" has been failing in CI jobs. Increase the timeout to account for slow runners. Old waiting time: 50 seconds. New waiting time: 120 seconds, with valgrind: 600 seconds. Intoduced in valkey-io#2227. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>

Increase timeout in The best replica can initiate an election immedia…

6b464da

…tely test Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>

github-actions Bot assigned zuiderkwast Mar 31, 2026

zuiderkwast requested a review from enjoy-binbin March 31, 2026 20:52

rainsupreme mentioned this pull request Mar 31, 2026

flaky immediate-failover test fix: match full sync as well as partial sync #3426

Closed

zuiderkwast marked this pull request as ready for review March 31, 2026 21:41

zuiderkwast added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Mar 31, 2026

rainsupreme approved these changes Mar 31, 2026

View reviewed changes

enjoy-binbin approved these changes Apr 1, 2026

View reviewed changes

zuiderkwast merged commit 2e12d27 into valkey-io:unstable Apr 1, 2026
80 of 101 checks passed

zuiderkwast deleted the flaky-failover-immediately branch April 1, 2026 07:55

rainsupreme mentioned this pull request Apr 1, 2026

Revert "Do the failover immediately if the replica is the best ranked replica" #3431

Closed

nmvk mentioned this pull request Apr 6, 2026

Skip faster-failover test under TLS #3444

Merged

roshkhatri mentioned this pull request Apr 8, 2026

Increase timeouts in faster-failover test for slow CI runners #3463

Merged

sarthakaggarwal97 mentioned this pull request Apr 14, 2026

Merge unstable into 9.1 #3507

Closed

sarthakaggarwal97 mentioned this pull request Apr 16, 2026

Backport Unstable to 9.1 for RC2 #3519

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Increase timeout in flaky "failover immediately" test case#3424

Increase timeout in flaky "failover immediately" test case#3424
zuiderkwast merged 1 commit into
valkey-io:unstablefrom
zuiderkwast:flaky-failover-immediately

zuiderkwast commented Mar 31, 2026

Uh oh!

codecov Bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

rainsupreme left a comment

Uh oh!

Nikhil-Manglore commented Mar 31, 2026 •

edited

Loading

Uh oh!

enjoy-binbin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

zuiderkwast commented Mar 31, 2026

Uh oh!

codecov Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rainsupreme left a comment

Choose a reason for hiding this comment

Uh oh!

Nikhil-Manglore commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enjoy-binbin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented Mar 31, 2026 •

edited

Loading

Nikhil-Manglore commented Mar 31, 2026 •

edited

Loading