Skip to content

Increase timeout in flaky "failover immediately" test case#3424

Merged
zuiderkwast merged 1 commit into
valkey-io:unstablefrom
zuiderkwast:flaky-failover-immediately
Apr 1, 2026
Merged

Increase timeout in flaky "failover immediately" test case#3424
zuiderkwast merged 1 commit into
valkey-io:unstablefrom
zuiderkwast:flaky-failover-immediately

Conversation

@zuiderkwast

Copy link
Copy Markdown
Contributor

The test case "The best replica can initiate an election immediately test" has been failing in CI jobs.

Increase the timeout to account for slow runners.

Old waiting time: 50 seconds. New waiting time: 120 seconds, with valgrind: 600 seconds.

Intoduced in #2227.

…tely test

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
@codecov

codecov Bot commented Mar 31, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.52%. Comparing base (9586093) to head (6b464da).
⚠️ Report is 2 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3424      +/-   ##
============================================
- Coverage     76.55%   76.52%   -0.03%     
============================================
  Files           159      159              
  Lines         79703    79703              
============================================
- Hits          61017    60995      -22     
- Misses        18686    18708      +22     

see 22 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@zuiderkwast zuiderkwast marked this pull request as ready for review March 31, 2026 21:41
@zuiderkwast zuiderkwast added the run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP) label Mar 31, 2026

@rainsupreme rainsupreme left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Nikhil-Manglore

Nikhil-Manglore commented Mar 31, 2026

Copy link
Copy Markdown
Member

Looks like the memory efficiency tests are still failing. Another option is to revert the commit. What are your thoughts on that?

@enjoy-binbin enjoy-binbin left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@zuiderkwast zuiderkwast merged commit 2e12d27 into valkey-io:unstable Apr 1, 2026
80 of 101 checks passed
@zuiderkwast zuiderkwast deleted the flaky-failover-immediately branch April 1, 2026 07:55
Nikhil-Manglore pushed a commit to Nikhil-Manglore/valkey that referenced this pull request Apr 7, 2026
…#3424)

The test case "The best replica can initiate an election immediately
test" has been failing in CI jobs.

Increase the timeout to account for slow runners.

Old waiting time: 50 seconds. New waiting time: 120 seconds, with
valgrind: 600 seconds.

Intoduced in valkey-io#2227.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
zuiderkwast pushed a commit that referenced this pull request Apr 9, 2026
Two changes to tests/unit/cluster/faster-failover.tcl:

1. FAIL detection timeout: `wait_for_condition 1000 10` → `1000 50`
   (10s → 50s)
2. psync_max_retries: 1200 → 2400 normal (120s → 240s),
   6000 → 12000 valgrind (600s → 1200s)

The test `The best replica can initiate an election immediately in an
automatic failover` in `tests/unit/cluster/faster-failover.tcl` has been
flaky since it was introduced on March 27, 2026 by #2227.

**Frequency:** 8 out of 15 days (Mar 27 – Apr 8), across valgrind,
sanitizer, and slow CI runners.

**Common errors:**
- `log message of "Successful partial resynchronization with primary"
  not found` (timeout waiting for psync)
- `expected pattern found in srv -N log file: *best ranked replica*`
  (timeout waiting for FAIL propagation)

The test spins up a 12-node cluster (5 primaries + 7 replicas), pauses
nodes, and waits for FAIL detection to propagate across all nodes before
failover + partial resync. 

A previous fix attempt #3424 increased the psync timeout from 50s to
120s (600s valgrind), which reduced frequency but did not eliminate it.

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
sarthakaggarwal97 pushed a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 16, 2026
…#3424)

The test case "The best replica can initiate an election immediately
test" has been failing in CI jobs.

Increase the timeout to account for slow runners.

Old waiting time: 50 seconds. New waiting time: 120 seconds, with
valgrind: 600 seconds.

Intoduced in valkey-io#2227.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
sarthakaggarwal97 pushed a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 16, 2026
…-io#3463)

Two changes to tests/unit/cluster/faster-failover.tcl:

1. FAIL detection timeout: `wait_for_condition 1000 10` → `1000 50`
   (10s → 50s)
2. psync_max_retries: 1200 → 2400 normal (120s → 240s),
   6000 → 12000 valgrind (600s → 1200s)

The test `The best replica can initiate an election immediately in an
automatic failover` in `tests/unit/cluster/faster-failover.tcl` has been
flaky since it was introduced on March 27, 2026 by valkey-io#2227.

**Frequency:** 8 out of 15 days (Mar 27 – Apr 8), across valgrind,
sanitizer, and slow CI runners.

**Common errors:**
- `log message of "Successful partial resynchronization with primary"
  not found` (timeout waiting for psync)
- `expected pattern found in srv -N log file: *best ranked replica*`
  (timeout waiting for FAIL propagation)

The test spins up a 12-node cluster (5 primaries + 7 replicas), pauses
nodes, and waits for FAIL detection to propagate across all nodes before
failover + partial resync. 

A previous fix attempt valkey-io#3424 increased the psync timeout from 50s to
120s (600s valgrind), which reduced frequency but did not eliminate it.

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
madolson pushed a commit that referenced this pull request Apr 27, 2026
The test case "The best replica can initiate an election immediately
test" has been failing in CI jobs.

Increase the timeout to account for slow runners.

Old waiting time: 50 seconds. New waiting time: 120 seconds, with
valgrind: 600 seconds.

Intoduced in #2227.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
madolson pushed a commit that referenced this pull request Apr 27, 2026
Two changes to tests/unit/cluster/faster-failover.tcl:

1. FAIL detection timeout: `wait_for_condition 1000 10` → `1000 50`
   (10s → 50s)
2. psync_max_retries: 1200 → 2400 normal (120s → 240s),
   6000 → 12000 valgrind (600s → 1200s)

The test `The best replica can initiate an election immediately in an
automatic failover` in `tests/unit/cluster/faster-failover.tcl` has been
flaky since it was introduced on March 27, 2026 by #2227.

**Frequency:** 8 out of 15 days (Mar 27 – Apr 8), across valgrind,
sanitizer, and slow CI runners.

**Common errors:**
- `log message of "Successful partial resynchronization with primary"
  not found` (timeout waiting for psync)
- `expected pattern found in srv -N log file: *best ranked replica*`
  (timeout waiting for FAIL propagation)

The test spins up a 12-node cluster (5 primaries + 7 replicas), pauses
nodes, and waits for FAIL detection to propagate across all nodes before
failover + partial resync. 

A previous fix attempt #3424 increased the psync timeout from 50s to
120s (600s valgrind), which reduced frequency but did not eliminate it.

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
roshkhatri pushed a commit to roshkhatri/valkey that referenced this pull request May 26, 2026
…#3424)

The test case "The best replica can initiate an election immediately
test" has been failing in CI jobs.

Increase the timeout to account for slow runners.

Old waiting time: 50 seconds. New waiting time: 120 seconds, with
valgrind: 600 seconds.

Intoduced in valkey-io#2227.

Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-extra-tests Run extra tests on this PR (Runs all tests from daily except valgrind and RESP)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants