Deflake many-slot-migration under valgrind by roshkhatri · Pull Request #3462 · valkey-io/valkey

roshkhatri · 2026-04-08T21:38:10Z

Problem

Fix cluster in tests/unit/cluster/many-slot-migration.tcl has been timing out daily on valgrind jobs since April 3, 2026. The test runs 10 cluster nodes under valgrind, migrating 40,000 keys across 1,000 slots — too much work for valgrind-instrumented builds.

Date	`Fix cluster` time	Result
Mar 30–Apr 2	~1,200s	✅ Pass
Apr 3+ (after #3366)	>2,400s	❌ TIMEOUT

The slowdown is caused by #3366 (dict→hashtable wrapper). Under -O0 (valgrind builds), the static inline wrappers become real function calls that valgrind instruments, adding ~75% overhead to hot paths like dictSize. This compounds across 10 valgrind processes over a 20-minute migration test. No impact on production builds (-O2 inlines everything).

Fix

Scale the test workload down under valgrind: 10,000 keys / 250 slots instead of 40,000 / 1,000. Normal runs are unchanged. Still exercises the same cluster repair path.

Drafted by AI

codecov · 2026-04-08T22:02:53Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.38%. Comparing base (baa5a48) to head (ee271a7).
⚠️ Report is 16 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #3462      +/-   ##
============================================
- Coverage     76.48%   76.38%   -0.10%     
============================================
  Files           159      159              
  Lines         79815    79815              
============================================
- Hits          61043    60967      -76     
- Misses        18772    18848      +76

see 25 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

zuiderkwast

This is the [TIMEOUT] error, right?

It is 20 minutes by default, so we'll make it 200 minutes? It's a long time... It kicks in if a single test case didn't finish within this time.

I've experimented with a way for long-running test cases to seen keep-alive messages to the test runner server to avoid the timeout for slow test cases. It's doable, actually quite easy, but it requires modifying test cases an insert the keep-alive heartbeat at various places.

I'm OK with increasing it to 200 minutes for valgrind.

zuiderkwast · 2026-04-09T09:32:57Z

This only happens on valgrind runners, where the dict→hashtable refactor introduced additional overhead (more heap allocations for dictEntry wrappers, which valgrind instruments).

Can you explain why there are more heap allocations? AFAICT, it's the same number...

Before:

 dict ht[0]
 ----+-----+-----+----
 ... |  *  |     | ...
 ----+--|--+-----+----
        |
   +----V------+
   | dictEntry |
   +-----------+
   | key       |
   | value     |
   | next      |
   +-----------+

After:

 hashtable table[0]
 ----+---------+---------+----
 ... | x*xxxxx | xxxxxxx | ...
 ----+--|------+---------+----
        |
   +----V------+
   | dictEntry |
   +-----------+
   | key       |
   | value     |
   +-----------+

roshkhatri · 2026-04-10T20:12:57Z

Can you explain why there are more heap allocations? AFAICT, it's the same number...

Yes, that makes sense, you are right, there should not more heap allocations. I am looking into why is this happening now

I have also updated the top comment

The dict→hashtable refactor (fb655db) introduced additional overhead that causes the global test timeout (20 minutes) to be exceeded under valgrind. This results in daily failures with: TIMEOUT: clients state report follows. Scale ::timeout by 10x for valgrind runs (1200s → 12000s), consistent with the existing valgrind scaling pattern used elsewhere in the test infrastructure (e.g., max_wait 120000 vs 10000, retrynum 1000 vs 100). The scaling is applied after option parsing so it also respects any custom --timeout value. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>

roshkhatri · 2026-04-13T21:20:23Z

This is the [TIMEOUT] error, right?
It is 20 minutes by default, so we'll make it 200 minutes? It's a long time... It kicks in if a single test case didn't finish within this time.

Yes, this is the [TIMEOUT] error. I've reduced the multiplier from 10x to 3x based on profiling. With the CI default of --timeout 2400, this gives 7,200s (120 min) which is about 3x headroom from the current Fix cluster time of ~2,400s.

I do agree that the keep-alive heartbeat approach is the better long-term solution.

roshkhatri · 2026-04-13T21:27:59Z

@zuiderkwast Also, the overhead for slowness what we see is coming from -O0 function call overhead under valgrind and will not be visible in production builds(-O2). For -O0 (valgrind builds) they become real function calls that valgrind instruments.

The per-operation overhead is small, but it compounds across 10 valgrind processes running cluster operations for 20+ minutes. The test was already at ~1,200s out of a 2,400s timeout before the refactor, the extra function call overhead is pushing the test to timeout

Reduce test workload under valgrind (10K keys / 250 slots instead of 40K / 1000) to avoid timeout in the Fix cluster test. Normal runs are unchanged. Co-authored-by: sarthakaggarwal97 <sarthakaggarwal97@users.noreply.github.com> Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>

sarthakaggarwal97

Looks good to me. We decided to port fix from #3497 as it targets the single test.

roshkhatri · 2026-04-14T22:40:35Z

Reducing the migrations for the specific tests for valgrind seems to be a better option as @sarthakaggarwal97 mentioned in #3497 and backported those changes

If this continues we can look into reducing the load for valgrind gloabally in some way

sarthakaggarwal97 · 2026-04-22T20:03:00Z

@hpatro @zuiderkwast @enjoy-binbin it will be good if any one of you can take a look. This PR should deflake one of the frequently occuring flaky test.

enjoy-binbin · 2026-04-23T04:31:03Z

Look solid, thanks

## Problem `Fix cluster` in `tests/unit/cluster/many-slot-migration.tcl` has been timing out daily on valgrind jobs since April 3, 2026. The test runs 10 cluster nodes under valgrind, migrating 40,000 keys across 1,000 slots — too much work for valgrind-instrumented builds. The slowdown is caused by valkey-io#3366 (dict→hashtable wrapper). Under `-O0` (valgrind builds), the `static inline` wrappers become real function calls that valgrind instruments, adding ~75% overhead to hot paths like `dictSize`. This compounds across 10 valgrind processes over a 20-minute migration test. No impact on production builds (`-O2` inlines everything). ## Fix Scale the test workload down under valgrind: 10,000 keys / 250 slots instead of 40,000 / 1,000. Normal runs are unchanged. Still exercises the same cluster repair path. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com> Co-authored-by: sarthakaggarwal97 <sarthakaggarwal97@users.noreply.github.com>

## Problem `Fix cluster` in `tests/unit/cluster/many-slot-migration.tcl` has been timing out daily on valgrind jobs since April 3, 2026. The test runs 10 cluster nodes under valgrind, migrating 40,000 keys across 1,000 slots — too much work for valgrind-instrumented builds. The slowdown is caused by valkey-io#3366 (dict→hashtable wrapper). Under `-O0` (valgrind builds), the `static inline` wrappers become real function calls that valgrind instruments, adding ~75% overhead to hot paths like `dictSize`. This compounds across 10 valgrind processes over a 20-minute migration test. No impact on production builds (`-O2` inlines everything). ## Fix Scale the test workload down under valgrind: 10,000 keys / 250 slots instead of 40,000 / 1,000. Normal runs are unchanged. Still exercises the same cluster repair path. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com> Co-authored-by: sarthakaggarwal97 <sarthakaggarwal97@users.noreply.github.com> (cherry picked from commit 66b50d8)

## Problem `Fix cluster` in `tests/unit/cluster/many-slot-migration.tcl` has been timing out daily on valgrind jobs since April 3, 2026. The test runs 10 cluster nodes under valgrind, migrating 40,000 keys across 1,000 slots — too much work for valgrind-instrumented builds. The slowdown is caused by valkey-io#3366 (dict→hashtable wrapper). Under `-O0` (valgrind builds), the `static inline` wrappers become real function calls that valgrind instruments, adding ~75% overhead to hot paths like `dictSize`. This compounds across 10 valgrind processes over a 20-minute migration test. No impact on production builds (`-O2` inlines everything). ## Fix Scale the test workload down under valgrind: 10,000 keys / 250 slots instead of 40,000 / 1,000. Normal runs are unchanged. Still exercises the same cluster repair path. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com> Co-authored-by: sarthakaggarwal97 <sarthakaggarwal97@users.noreply.github.com> (cherry picked from commit 66b50d8) (cherry picked from commit a104a94)

## Problem `Fix cluster` in `tests/unit/cluster/many-slot-migration.tcl` has been timing out daily on valgrind jobs since April 3, 2026. The test runs 10 cluster nodes under valgrind, migrating 40,000 keys across 1,000 slots — too much work for valgrind-instrumented builds. The slowdown is caused by #3366 (dict→hashtable wrapper). Under `-O0` (valgrind builds), the `static inline` wrappers become real function calls that valgrind instruments, adding ~75% overhead to hot paths like `dictSize`. This compounds across 10 valgrind processes over a 20-minute migration test. No impact on production builds (`-O2` inlines everything). ## Fix Scale the test workload down under valgrind: 10,000 keys / 250 slots instead of 40,000 / 1,000. Normal runs are unchanged. Still exercises the same cluster repair path. Signed-off-by: Roshan Khatri <rvkhatri@amazon.com> Co-authored-by: sarthakaggarwal97 <sarthakaggarwal97@users.noreply.github.com>

github-actions Bot assigned roshkhatri Apr 8, 2026

roshkhatri force-pushed the fix/clients-state-timeout branch from 2dc2827 to 5039819 Compare April 8, 2026 21:40

roshkhatri requested review from sarthakaggarwal97 and zuiderkwast April 8, 2026 21:45

roshkhatri marked this pull request as ready for review April 8, 2026 21:45

dvkashapov approved these changes Apr 9, 2026

View reviewed changes

zuiderkwast approved these changes Apr 9, 2026

View reviewed changes

dvkashapov mentioned this pull request Apr 9, 2026

Increase timeouts in faster-failover test for slow CI runners #3463

Merged

roshkhatri requested a review from zuiderkwast April 13, 2026 21:10

roshkhatri force-pushed the fix/clients-state-timeout branch from 5039819 to 8797ac8 Compare April 13, 2026 21:16

dvkashapov mentioned this pull request Apr 14, 2026

[Flaky Tests] Deflake many-slot-migration under valgrind #3497

Closed

roshkhatri and others added 2 commits April 14, 2026 12:45

Merge branch 'unstable' into fix/clients-state-timeout

94c68e1

roshkhatri changed the title ~~Scale global test timeout for valgrind runs~~ Deflake many-slot-migration under valgrind Apr 14, 2026

sarthakaggarwal97 approved these changes Apr 14, 2026

View reviewed changes

sarthakaggarwal97 requested a review from hpatro April 22, 2026 20:01

sarthakaggarwal97 requested a review from enjoy-binbin April 22, 2026 20:03

enjoy-binbin approved these changes Apr 23, 2026

View reviewed changes

enjoy-binbin merged commit 04896c1 into valkey-io:unstable Apr 23, 2026
58 checks passed

sarthakaggarwal97 mentioned this pull request Apr 23, 2026

Backport Unstable to 9.1 for RC2 #3519

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deflake many-slot-migration under valgrind#3462

Deflake many-slot-migration under valgrind#3462
enjoy-binbin merged 3 commits into
valkey-io:unstablefrom
roshkhatri:fix/clients-state-timeout

roshkhatri commented Apr 8, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

zuiderkwast left a comment

Uh oh!

zuiderkwast commented Apr 9, 2026 •

edited

Loading

Uh oh!

roshkhatri commented Apr 10, 2026 •

edited

Loading

Uh oh!

roshkhatri commented Apr 13, 2026

Uh oh!

roshkhatri commented Apr 13, 2026

Uh oh!

sarthakaggarwal97 left a comment

Uh oh!

roshkhatri commented Apr 14, 2026 •

edited

Loading

Uh oh!

sarthakaggarwal97 commented Apr 22, 2026

Uh oh!

enjoy-binbin commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

roshkhatri commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Uh oh!

codecov Bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zuiderkwast left a comment

Choose a reason for hiding this comment

Uh oh!

zuiderkwast commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roshkhatri commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roshkhatri commented Apr 13, 2026

Uh oh!

roshkhatri commented Apr 13, 2026

Uh oh!

sarthakaggarwal97 left a comment

Choose a reason for hiding this comment

Uh oh!

roshkhatri commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sarthakaggarwal97 commented Apr 22, 2026

Uh oh!

enjoy-binbin commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

roshkhatri commented Apr 8, 2026 •

edited

Loading

codecov Bot commented Apr 8, 2026 •

edited

Loading

zuiderkwast commented Apr 9, 2026 •

edited

Loading

roshkhatri commented Apr 10, 2026 •

edited

Loading

roshkhatri commented Apr 14, 2026 •

edited

Loading