Deflake replica selection test by relaxing cluster configurations#2672
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #2672 +/- ##
============================================
+ Coverage 72.18% 72.62% +0.44%
============================================
Files 128 128
Lines 70994 71273 +279
============================================
+ Hits 51246 51762 +516
+ Misses 19748 19511 -237 🚀 New features to boost your workflow:
|
zuiderkwast
left a comment
There was a problem hiding this comment.
Do you think it will help or is it a wild guess?
3d1a7a3 to
0d3b28f
Compare
|
@zuiderkwast The test fails with only valgrind in the past couple of weeks, so it should be related to general slowness with valgrind. Also, I have few passing valgrind runs in my local repo after this change, so it should work! |
enjoy-binbin
left a comment
There was a problem hiding this comment.
5000 200 is quite a huge timeout and look odd to me, we have a lot of the same cluster test (i belive) under the daily, have you measured its testing time in daily ci? Do you think adjusting cluster-ping-interval and cluster-node-timeout would help?
|
Valgrind tests take about 3hrs 50mins ~ something that we see in daily tests too. Let me explore cluster-ping-interval and cluster-node-timeout. |
0d3b28f to
65d857e
Compare
|
@enjoy-binbin I think your suggestion has worked. I somehow didn't notice that the values for ping internal and node timeout are less by default. Just increasing for this test have gotten me 2-3 successful runs together. |
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
65d857e to
2525015
Compare
enjoy-binbin
left a comment
There was a problem hiding this comment.
Just increasing for this test have gotten me 2-3 successful runs together.
thanks, please try running it a few more times before we merge it.
|
@enjoy-binbin the test is green for 6 last runs in my local repo! |
…lkey-io#2672) We have relaxed the `cluster-ping-interval` and `cluster-node-timeout` so that cluster has enough time to stabilize and propagate changes. Fixes this test occasional failure when running with valgrind: [err]: Node #10 should eventually replicate node #5 in tests/unit/cluster/slave-selection.tcl #10 didn't became slave of #5 Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
) We have relaxed the `cluster-ping-interval` and `cluster-node-timeout` so that cluster has enough time to stabilize and propagate changes. Fixes this test occasional failure when running with valgrind: [err]: Node #10 should eventually replicate node #5 in tests/unit/cluster/slave-selection.tcl #10 didn't became slave of #5 Backported to the 9.0 branch in #2731. Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
…lkey-io#2672) We have relaxed the `cluster-ping-interval` and `cluster-node-timeout` so that cluster has enough time to stabilize and propagate changes. Fixes this test occasional failure when running with valgrind: [err]: Node #10 should eventually replicate node #5 in tests/unit/cluster/slave-selection.tcl #10 didn't became slave of #5 Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
…ations (#3261) The "New Master down consecutively" test was sometimes failing under Valgrind by timing out. The new overrides match those used for the cluster in the first part of the file - see #2672 Under Valgrind's 10-20x slowdown, a single failover requiring ~15 seconds of server time can exceed the test's 100-second wall-clock wait. Error text: ``` *** [err]: New Master down consecutively in tests/unit/cluster/slave-selection.tcl No failover detected when master 12 fails ``` Daily run failure: https://github.com/valkey-io/valkey/actions/runs/22421982161/job/64921545936#logs --------- Signed-off-by: Rain Valentine <rsg000@gmail.com> Signed-off-by: Rain Valentine <rainval@amazon.com>
…ations (#3261) The "New Master down consecutively" test was sometimes failing under Valgrind by timing out. The new overrides match those used for the cluster in the first part of the file - see #2672 Under Valgrind's 10-20x slowdown, a single failover requiring ~15 seconds of server time can exceed the test's 100-second wall-clock wait. Error text: ``` *** [err]: New Master down consecutively in tests/unit/cluster/slave-selection.tcl No failover detected when master 12 fails ``` Daily run failure: https://github.com/valkey-io/valkey/actions/runs/22421982161/job/64921545936#logs --------- Signed-off-by: Rain Valentine <rsg000@gmail.com> Signed-off-by: Rain Valentine <rainval@amazon.com>
…ations (valkey-io#3261) The "New Master down consecutively" test was sometimes failing under Valgrind by timing out. The new overrides match those used for the cluster in the first part of the file - see valkey-io#2672 Under Valgrind's 10-20x slowdown, a single failover requiring ~15 seconds of server time can exceed the test's 100-second wall-clock wait. Error text: ``` *** [err]: New Master down consecutively in tests/unit/cluster/slave-selection.tcl No failover detected when master 12 fails ``` Daily run failure: https://github.com/valkey-io/valkey/actions/runs/22421982161/job/64921545936#logs --------- Signed-off-by: Rain Valentine <rsg000@gmail.com> Signed-off-by: Rain Valentine <rainval@amazon.com>
…ations (valkey-io#3261) The "New Master down consecutively" test was sometimes failing under Valgrind by timing out. The new overrides match those used for the cluster in the first part of the file - see valkey-io#2672 Under Valgrind's 10-20x slowdown, a single failover requiring ~15 seconds of server time can exceed the test's 100-second wall-clock wait. Error text: ``` *** [err]: New Master down consecutively in tests/unit/cluster/slave-selection.tcl No failover detected when master 12 fails ``` Daily run failure: https://github.com/valkey-io/valkey/actions/runs/22421982161/job/64921545936#logs --------- Signed-off-by: Rain Valentine <rsg000@gmail.com> Signed-off-by: Rain Valentine <rainval@amazon.com> (cherry picked from commit 09a13ec)
We have relaxed the
cluster-ping-intervalandcluster-node-timeoutso that cluster has enough time to stabilize and propagate changes.Today's failed test run: https://github.com/valkey-io/valkey/actions/runs/18179260254/job/51751751729#step:6:11262