Increase timeout in flaky "failover immediately" test case#3424
Merged
zuiderkwast merged 1 commit intoApr 1, 2026
Merged
Conversation
…tely test Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #3424 +/- ##
============================================
- Coverage 76.55% 76.52% -0.03%
============================================
Files 159 159
Lines 79703 79703
============================================
- Hits 61017 60995 -22
- Misses 18686 18708 +22 🚀 New features to boost your workflow:
|
Member
|
Looks like the memory efficiency tests are still failing. Another option is to revert the commit. What are your thoughts on that? |
Nikhil-Manglore
pushed a commit
to Nikhil-Manglore/valkey
that referenced
this pull request
Apr 7, 2026
…#3424) The test case "The best replica can initiate an election immediately test" has been failing in CI jobs. Increase the timeout to account for slow runners. Old waiting time: 50 seconds. New waiting time: 120 seconds, with valgrind: 600 seconds. Intoduced in valkey-io#2227. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
zuiderkwast
pushed a commit
that referenced
this pull request
Apr 9, 2026
Two changes to tests/unit/cluster/faster-failover.tcl: 1. FAIL detection timeout: `wait_for_condition 1000 10` → `1000 50` (10s → 50s) 2. psync_max_retries: 1200 → 2400 normal (120s → 240s), 6000 → 12000 valgrind (600s → 1200s) The test `The best replica can initiate an election immediately in an automatic failover` in `tests/unit/cluster/faster-failover.tcl` has been flaky since it was introduced on March 27, 2026 by #2227. **Frequency:** 8 out of 15 days (Mar 27 – Apr 8), across valgrind, sanitizer, and slow CI runners. **Common errors:** - `log message of "Successful partial resynchronization with primary" not found` (timeout waiting for psync) - `expected pattern found in srv -N log file: *best ranked replica*` (timeout waiting for FAIL propagation) The test spins up a 12-node cluster (5 primaries + 7 replicas), pauses nodes, and waits for FAIL detection to propagate across all nodes before failover + partial resync. A previous fix attempt #3424 increased the psync timeout from 50s to 120s (600s valgrind), which reduced frequency but did not eliminate it. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
sarthakaggarwal97
pushed a commit
to sarthakaggarwal97/valkey
that referenced
this pull request
Apr 16, 2026
…#3424) The test case "The best replica can initiate an election immediately test" has been failing in CI jobs. Increase the timeout to account for slow runners. Old waiting time: 50 seconds. New waiting time: 120 seconds, with valgrind: 600 seconds. Intoduced in valkey-io#2227. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
sarthakaggarwal97
pushed a commit
to sarthakaggarwal97/valkey
that referenced
this pull request
Apr 16, 2026
…-io#3463) Two changes to tests/unit/cluster/faster-failover.tcl: 1. FAIL detection timeout: `wait_for_condition 1000 10` → `1000 50` (10s → 50s) 2. psync_max_retries: 1200 → 2400 normal (120s → 240s), 6000 → 12000 valgrind (600s → 1200s) The test `The best replica can initiate an election immediately in an automatic failover` in `tests/unit/cluster/faster-failover.tcl` has been flaky since it was introduced on March 27, 2026 by valkey-io#2227. **Frequency:** 8 out of 15 days (Mar 27 – Apr 8), across valgrind, sanitizer, and slow CI runners. **Common errors:** - `log message of "Successful partial resynchronization with primary" not found` (timeout waiting for psync) - `expected pattern found in srv -N log file: *best ranked replica*` (timeout waiting for FAIL propagation) The test spins up a 12-node cluster (5 primaries + 7 replicas), pauses nodes, and waits for FAIL detection to propagate across all nodes before failover + partial resync. A previous fix attempt valkey-io#3424 increased the psync timeout from 50s to 120s (600s valgrind), which reduced frequency but did not eliminate it. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
madolson
pushed a commit
that referenced
this pull request
Apr 27, 2026
The test case "The best replica can initiate an election immediately test" has been failing in CI jobs. Increase the timeout to account for slow runners. Old waiting time: 50 seconds. New waiting time: 120 seconds, with valgrind: 600 seconds. Intoduced in #2227. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech>
madolson
pushed a commit
that referenced
this pull request
Apr 27, 2026
Two changes to tests/unit/cluster/faster-failover.tcl: 1. FAIL detection timeout: `wait_for_condition 1000 10` → `1000 50` (10s → 50s) 2. psync_max_retries: 1200 → 2400 normal (120s → 240s), 6000 → 12000 valgrind (600s → 1200s) The test `The best replica can initiate an election immediately in an automatic failover` in `tests/unit/cluster/faster-failover.tcl` has been flaky since it was introduced on March 27, 2026 by #2227. **Frequency:** 8 out of 15 days (Mar 27 – Apr 8), across valgrind, sanitizer, and slow CI runners. **Common errors:** - `log message of "Successful partial resynchronization with primary" not found` (timeout waiting for psync) - `expected pattern found in srv -N log file: *best ranked replica*` (timeout waiting for FAIL propagation) The test spins up a 12-node cluster (5 primaries + 7 replicas), pauses nodes, and waits for FAIL detection to propagate across all nodes before failover + partial resync. A previous fix attempt #3424 increased the psync timeout from 50s to 120s (600s valgrind), which reduced frequency but did not eliminate it. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
roshkhatri
pushed a commit
to roshkhatri/valkey
that referenced
this pull request
May 26, 2026
…#3424) The test case "The best replica can initiate an election immediately test" has been failing in CI jobs. Increase the timeout to account for slow runners. Old waiting time: 50 seconds. New waiting time: 120 seconds, with valgrind: 600 seconds. Intoduced in valkey-io#2227. Signed-off-by: Viktor Söderqvist <viktor.soderqvist@est.tech> Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The test case "The best replica can initiate an election immediately test" has been failing in CI jobs.
Increase the timeout to account for slow runners.
Old waiting time: 50 seconds. New waiting time: 120 seconds, with valgrind: 600 seconds.
Intoduced in #2227.