Skip to content

Increase timeouts in faster-failover test for slow CI runners#3463

Merged
zuiderkwast merged 1 commit into
valkey-io:unstablefrom
roshkhatri:fix/faster-failover-flaky
Apr 9, 2026
Merged

Increase timeouts in faster-failover test for slow CI runners#3463
zuiderkwast merged 1 commit into
valkey-io:unstablefrom
roshkhatri:fix/faster-failover-flaky

Conversation

@roshkhatri

@roshkhatri roshkhatri commented Apr 8, 2026

Copy link
Copy Markdown
Member

Problem

The test The best replica can initiate an election immediately in an automatic failover in tests/unit/cluster/faster-failover.tcl has been flaky since it was introduced on March 27, 2026 by #2227.

Frequency: 8 out of 15 days (Mar 27 – Apr 8), across valgrind, sanitizer, and slow CI runners.

Common errors:

  • log message of "Successful partial resynchronization with primary" not found (timeout waiting for psync)
  • expected pattern found in srv -N log file: *best ranked replica* (timeout waiting for FAIL propagation)

Example failing runs:

A previous fix attempt (#3424) increased the psync timeout from 50s to 120s (600s valgrind), which reduced frequency but did not eliminate it.

Root Cause

The test spins up a 12-node cluster (5 primaries + 7 replicas), pauses nodes, and waits for FAIL detection to propagate across all nodes before failover + partial resync. The original timeouts were too tight for slow CI environments:

  1. FAIL propagation: wait_for_condition 1000 10 = 10s max. With cluster-node-timeout 5000 and 12 nodes exchanging gossip, slow runners need more time.
  2. Partial resync: Even after Increase timeout in flaky "failover immediately" test case #3424 bumped to 120s/600s, sanitizer runners still occasionally exceed this.

Fix

Two changes to tests/unit/cluster/faster-failover.tcl:

  1. FAIL detection timeout: wait_for_condition 1000 101000 50 (10s → 50s)
  2. psync_max_retries: 12002400 normal (120s → 240s), 600012000 valgrind (600s → 1200s)

Testing

Ran unit/cluster/faster-failover 100 loops on valgrind, sanitizer, and ubuntu runners — all passed:

  • Workflow run: https://github.com/roshkhatri/valkey/actions/runs/24158848194
  • Config: --loops 100 --single unit/cluster/faster-failover on sanitizer-address (clang/gcc), sanitizer-undefined (clang/gcc), sanitizer-force-defrag, ubuntu-jemalloc, ubuntu-arm
  • Result: 7/7 jobs ✅, zero failures across 700 total test iterations

… runners

The test 'The best replica can initiate an election immediately in an
automatic failover' has been flaky on sanitizer and slow CI runners.

Changes:
- Increase FAIL detection wait from 10s to 50s (wait_for_condition
  1000 50) since FAIL propagation takes longer on slow runners.
- Double psync_max_retries from 1200 to 2400 (normal) and 6000 to
  12000 (valgrind) to give more time for partial resync log messages.

Fixes flaky test introduced by commit 6822a67.

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
@roshkhatri roshkhatri force-pushed the fix/faster-failover-flaky branch from 8ce488e to 9d76223 Compare April 8, 2026 21:40
@roshkhatri roshkhatri requested a review from zuiderkwast April 8, 2026 21:44
@roshkhatri roshkhatri marked this pull request as ready for review April 8, 2026 21:45
@codecov

codecov Bot commented Apr 8, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.51%. Comparing base (dba128f) to head (9d76223).
⚠️ Report is 4 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3463      +/-   ##
============================================
- Coverage     76.52%   76.51%   -0.01%     
============================================
  Files           157      157              
  Lines         79035    79035              
============================================
- Hits          60478    60475       -3     
- Misses        18557    18560       +3     

see 14 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@dvkashapov dvkashapov left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we modify timeouts similar to this PR #3462 where we multiply timeouts if we're under sanitizer or valgrind, WDYT?

@zuiderkwast

Copy link
Copy Markdown
Contributor

I suggest we modify timeouts similar to this PR #3462 where we multiply timeouts if we're under sanitizer or valgrind, WDYT?

@dvkashapov Should we scale also server configs such as cluster-node-timeout and failover timeouts?

@zuiderkwast zuiderkwast merged commit b020b03 into valkey-io:unstable Apr 9, 2026
58 checks passed
@zuiderkwast zuiderkwast added the test-failure An issue indicating a test failure label Apr 9, 2026
@dvkashapov

Copy link
Copy Markdown
Member

@dvkashapov Should we scale also server configs such as cluster-node-timeout and failover timeouts?

Yes, that would make sense for me, what's your opinion on that?

@zuiderkwast

Copy link
Copy Markdown
Contributor

Should we scale also server configs such as cluster-node-timeout and failover timeouts?

Yes, that would make sense for me, what's your opinion on that?

It can be complex to implement it. :)

@roshkhatri

Copy link
Copy Markdown
Member Author

This seems like a good idea to scale the timeouts according to the environments. I can see if I can implement it

sarthakaggarwal97 pushed a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 16, 2026
…-io#3463)

Two changes to tests/unit/cluster/faster-failover.tcl:

1. FAIL detection timeout: `wait_for_condition 1000 10` → `1000 50`
   (10s → 50s)
2. psync_max_retries: 1200 → 2400 normal (120s → 240s),
   6000 → 12000 valgrind (600s → 1200s)

The test `The best replica can initiate an election immediately in an
automatic failover` in `tests/unit/cluster/faster-failover.tcl` has been
flaky since it was introduced on March 27, 2026 by valkey-io#2227.

**Frequency:** 8 out of 15 days (Mar 27 – Apr 8), across valgrind,
sanitizer, and slow CI runners.

**Common errors:**
- `log message of "Successful partial resynchronization with primary"
  not found` (timeout waiting for psync)
- `expected pattern found in srv -N log file: *best ranked replica*`
  (timeout waiting for FAIL propagation)

The test spins up a 12-node cluster (5 primaries + 7 replicas), pauses
nodes, and waits for FAIL detection to propagate across all nodes before
failover + partial resync. 

A previous fix attempt valkey-io#3424 increased the psync timeout from 50s to
120s (600s valgrind), which reduced frequency but did not eliminate it.

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
madolson pushed a commit that referenced this pull request Apr 27, 2026
Two changes to tests/unit/cluster/faster-failover.tcl:

1. FAIL detection timeout: `wait_for_condition 1000 10` → `1000 50`
   (10s → 50s)
2. psync_max_retries: 1200 → 2400 normal (120s → 240s),
   6000 → 12000 valgrind (600s → 1200s)

The test `The best replica can initiate an election immediately in an
automatic failover` in `tests/unit/cluster/faster-failover.tcl` has been
flaky since it was introduced on March 27, 2026 by #2227.

**Frequency:** 8 out of 15 days (Mar 27 – Apr 8), across valgrind,
sanitizer, and slow CI runners.

**Common errors:**
- `log message of "Successful partial resynchronization with primary"
  not found` (timeout waiting for psync)
- `expected pattern found in srv -N log file: *best ranked replica*`
  (timeout waiting for FAIL propagation)

The test spins up a 12-node cluster (5 primaries + 7 replicas), pauses
nodes, and waits for FAIL detection to propagate across all nodes before
failover + partial resync. 

A previous fix attempt #3424 increased the psync timeout from 50s to
120s (600s valgrind), which reduced frequency but did not eliminate it.

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test-failure An issue indicating a test failure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants