Deflake many-slot-migration under valgrind#3462
Conversation
2dc2827 to
5039819
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #3462 +/- ##
============================================
- Coverage 76.48% 76.38% -0.10%
============================================
Files 159 159
Lines 79815 79815
============================================
- Hits 61043 60967 -76
- Misses 18772 18848 +76 🚀 New features to boost your workflow:
|
zuiderkwast
left a comment
There was a problem hiding this comment.
This is the [TIMEOUT] error, right?
It is 20 minutes by default, so we'll make it 200 minutes? It's a long time... It kicks in if a single test case didn't finish within this time.
I've experimented with a way for long-running test cases to seen keep-alive messages to the test runner server to avoid the timeout for slow test cases. It's doable, actually quite easy, but it requires modifying test cases an insert the keep-alive heartbeat at various places.
I'm OK with increasing it to 200 minutes for valgrind.
Can you explain why there are more heap allocations? AFAICT, it's the same number... Before: After: |
Yes, that makes sense, you are right, there should not more heap allocations. I am looking into why is this happening now I have also updated the top comment |
The dict→hashtable refactor (fb655db) introduced additional overhead that causes the global test timeout (20 minutes) to be exceeded under valgrind. This results in daily failures with: TIMEOUT: clients state report follows. Scale ::timeout by 10x for valgrind runs (1200s → 12000s), consistent with the existing valgrind scaling pattern used elsewhere in the test infrastructure (e.g., max_wait 120000 vs 10000, retrynum 1000 vs 100). The scaling is applied after option parsing so it also respects any custom --timeout value. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
5039819 to
8797ac8
Compare
Yes, this is the I do agree that the keep-alive heartbeat approach is the better long-term solution. |
|
@zuiderkwast Also, the overhead for slowness what we see is coming from The per-operation overhead is small, but it compounds across 10 valgrind processes running cluster operations for 20+ minutes. The test was already at ~1,200s out of a 2,400s timeout before the refactor, the extra function call overhead is pushing the test to timeout |
Reduce test workload under valgrind (10K keys / 250 slots instead of 40K / 1000) to avoid timeout in the Fix cluster test. Normal runs are unchanged. Co-authored-by: sarthakaggarwal97 <sarthakaggarwal97@users.noreply.github.com> Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
sarthakaggarwal97
left a comment
There was a problem hiding this comment.
Looks good to me. We decided to port fix from #3497 as it targets the single test.
|
Reducing the migrations for the specific tests for valgrind seems to be a better option as @sarthakaggarwal97 mentioned in #3497 and backported those changes If this continues we can look into reducing the load for valgrind gloabally in some way |
|
@hpatro @zuiderkwast @enjoy-binbin it will be good if any one of you can take a look. This PR should deflake one of the frequently occuring flaky test. |
|
Look solid, thanks |
## Problem `Fix cluster` in `tests/unit/cluster/many-slot-migration.tcl` has been timing out daily on valgrind jobs since April 3, 2026. The test runs 10 cluster nodes under valgrind, migrating 40,000 keys across 1,000 slots — too much work for valgrind-instrumented builds. The slowdown is caused by valkey-io#3366 (dict→hashtable wrapper). Under `-O0` (valgrind builds), the `static inline` wrappers become real function calls that valgrind instruments, adding ~75% overhead to hot paths like `dictSize`. This compounds across 10 valgrind processes over a 20-minute migration test. No impact on production builds (`-O2` inlines everything). ## Fix Scale the test workload down under valgrind: 10,000 keys / 250 slots instead of 40,000 / 1,000. Normal runs are unchanged. Still exercises the same cluster repair path. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com> Co-authored-by: sarthakaggarwal97 <sarthakaggarwal97@users.noreply.github.com>
## Problem `Fix cluster` in `tests/unit/cluster/many-slot-migration.tcl` has been timing out daily on valgrind jobs since April 3, 2026. The test runs 10 cluster nodes under valgrind, migrating 40,000 keys across 1,000 slots — too much work for valgrind-instrumented builds. The slowdown is caused by valkey-io#3366 (dict→hashtable wrapper). Under `-O0` (valgrind builds), the `static inline` wrappers become real function calls that valgrind instruments, adding ~75% overhead to hot paths like `dictSize`. This compounds across 10 valgrind processes over a 20-minute migration test. No impact on production builds (`-O2` inlines everything). ## Fix Scale the test workload down under valgrind: 10,000 keys / 250 slots instead of 40,000 / 1,000. Normal runs are unchanged. Still exercises the same cluster repair path. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com> Co-authored-by: sarthakaggarwal97 <sarthakaggarwal97@users.noreply.github.com> (cherry picked from commit 66b50d8)
## Problem `Fix cluster` in `tests/unit/cluster/many-slot-migration.tcl` has been timing out daily on valgrind jobs since April 3, 2026. The test runs 10 cluster nodes under valgrind, migrating 40,000 keys across 1,000 slots — too much work for valgrind-instrumented builds. The slowdown is caused by valkey-io#3366 (dict→hashtable wrapper). Under `-O0` (valgrind builds), the `static inline` wrappers become real function calls that valgrind instruments, adding ~75% overhead to hot paths like `dictSize`. This compounds across 10 valgrind processes over a 20-minute migration test. No impact on production builds (`-O2` inlines everything). ## Fix Scale the test workload down under valgrind: 10,000 keys / 250 slots instead of 40,000 / 1,000. Normal runs are unchanged. Still exercises the same cluster repair path. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com> Co-authored-by: sarthakaggarwal97 <sarthakaggarwal97@users.noreply.github.com> (cherry picked from commit 66b50d8) (cherry picked from commit a104a94)
## Problem `Fix cluster` in `tests/unit/cluster/many-slot-migration.tcl` has been timing out daily on valgrind jobs since April 3, 2026. The test runs 10 cluster nodes under valgrind, migrating 40,000 keys across 1,000 slots — too much work for valgrind-instrumented builds. The slowdown is caused by #3366 (dict→hashtable wrapper). Under `-O0` (valgrind builds), the `static inline` wrappers become real function calls that valgrind instruments, adding ~75% overhead to hot paths like `dictSize`. This compounds across 10 valgrind processes over a 20-minute migration test. No impact on production builds (`-O2` inlines everything). ## Fix Scale the test workload down under valgrind: 10,000 keys / 250 slots instead of 40,000 / 1,000. Normal runs are unchanged. Still exercises the same cluster repair path. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com> Co-authored-by: sarthakaggarwal97 <sarthakaggarwal97@users.noreply.github.com>
Problem
Fix clusterintests/unit/cluster/many-slot-migration.tclhas been timing out daily on valgrind jobs since April 3, 2026. The test runs 10 cluster nodes under valgrind, migrating 40,000 keys across 1,000 slots — too much work for valgrind-instrumented builds.Fix clustertimeThe slowdown is caused by #3366 (dict→hashtable wrapper). Under
-O0(valgrind builds), thestatic inlinewrappers become real function calls that valgrind instruments, adding ~75% overhead to hot paths likedictSize. This compounds across 10 valgrind processes over a 20-minute migration test. No impact on production builds (-O2inlines everything).Fix
Scale the test workload down under valgrind: 10,000 keys / 250 slots instead of 40,000 / 1,000. Normal runs are unchanged. Still exercises the same cluster repair path.
Drafted by AI