Skip to content

Deflake many-slot-migration under valgrind#3462

Merged
enjoy-binbin merged 3 commits into
valkey-io:unstablefrom
roshkhatri:fix/clients-state-timeout
Apr 23, 2026
Merged

Deflake many-slot-migration under valgrind#3462
enjoy-binbin merged 3 commits into
valkey-io:unstablefrom
roshkhatri:fix/clients-state-timeout

Conversation

@roshkhatri

@roshkhatri roshkhatri commented Apr 8, 2026

Copy link
Copy Markdown
Member

Problem

Fix cluster in tests/unit/cluster/many-slot-migration.tcl has been timing out daily on valgrind jobs since April 3, 2026. The test runs 10 cluster nodes under valgrind, migrating 40,000 keys across 1,000 slots — too much work for valgrind-instrumented builds.

Date Fix cluster time Result
Mar 30–Apr 2 ~1,200s ✅ Pass
Apr 3+ (after #3366) >2,400s ❌ TIMEOUT

The slowdown is caused by #3366 (dict→hashtable wrapper). Under -O0 (valgrind builds), the static inline wrappers become real function calls that valgrind instruments, adding ~75% overhead to hot paths like dictSize. This compounds across 10 valgrind processes over a 20-minute migration test. No impact on production builds (-O2 inlines everything).

Fix

Scale the test workload down under valgrind: 10,000 keys / 250 slots instead of 40,000 / 1,000. Normal runs are unchanged. Still exercises the same cluster repair path.


Drafted by AI

@roshkhatri roshkhatri force-pushed the fix/clients-state-timeout branch from 2dc2827 to 5039819 Compare April 8, 2026 21:40
@roshkhatri roshkhatri marked this pull request as ready for review April 8, 2026 21:45
@codecov

codecov Bot commented Apr 8, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.38%. Comparing base (baa5a48) to head (ee271a7).
⚠️ Report is 16 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3462      +/-   ##
============================================
- Coverage     76.48%   76.38%   -0.10%     
============================================
  Files           159      159              
  Lines         79815    79815              
============================================
- Hits          61043    60967      -76     
- Misses        18772    18848      +76     

see 25 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@zuiderkwast zuiderkwast left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the [TIMEOUT] error, right?

It is 20 minutes by default, so we'll make it 200 minutes? It's a long time... It kicks in if a single test case didn't finish within this time.

I've experimented with a way for long-running test cases to seen keep-alive messages to the test runner server to avoid the timeout for slow test cases. It's doable, actually quite easy, but it requires modifying test cases an insert the keep-alive heartbeat at various places.

I'm OK with increasing it to 200 minutes for valgrind.

@zuiderkwast

zuiderkwast commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

This only happens on valgrind runners, where the dict→hashtable refactor introduced additional overhead (more heap allocations for dictEntry wrappers, which valgrind instruments).

Can you explain why there are more heap allocations? AFAICT, it's the same number...

Before:

 dict ht[0]
 ----+-----+-----+----
 ... |  *  |     | ...
 ----+--|--+-----+----
        |
   +----V------+
   | dictEntry |
   +-----------+
   | key       |
   | value     |
   | next      |
   +-----------+

After:

 hashtable table[0]
 ----+---------+---------+----
 ... | x*xxxxx | xxxxxxx | ...
 ----+--|------+---------+----
        |
   +----V------+
   | dictEntry |
   +-----------+
   | key       |
   | value     |
   +-----------+

@roshkhatri

roshkhatri commented Apr 10, 2026

Copy link
Copy Markdown
Member Author

Can you explain why there are more heap allocations? AFAICT, it's the same number...

Yes, that makes sense, you are right, there should not more heap allocations. I am looking into why is this happening now

I have also updated the top comment

@roshkhatri roshkhatri requested a review from zuiderkwast April 13, 2026 21:10
The dict→hashtable refactor (fb655db) introduced additional overhead
that causes the global test timeout (20 minutes) to be exceeded under
valgrind. This results in daily failures with:
  TIMEOUT: clients state report follows.

Scale ::timeout by 10x for valgrind runs (1200s → 12000s), consistent
with the existing valgrind scaling pattern used elsewhere in the test
infrastructure (e.g., max_wait 120000 vs 10000, retrynum 1000 vs 100).

The scaling is applied after option parsing so it also respects any
custom --timeout value.

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
@roshkhatri roshkhatri force-pushed the fix/clients-state-timeout branch from 5039819 to 8797ac8 Compare April 13, 2026 21:16
@roshkhatri

Copy link
Copy Markdown
Member Author

This is the [TIMEOUT] error, right?
It is 20 minutes by default, so we'll make it 200 minutes? It's a long time... It kicks in if a single test case didn't finish within this time.

Yes, this is the [TIMEOUT] error. I've reduced the multiplier from 10x to 3x based on profiling. With the CI default of --timeout 2400, this gives 7,200s (120 min) which is about 3x headroom from the current Fix cluster time of ~2,400s.

I do agree that the keep-alive heartbeat approach is the better long-term solution.

@roshkhatri

Copy link
Copy Markdown
Member Author

@zuiderkwast Also, the overhead for slowness what we see is coming from -O0 function call overhead under valgrind and will not be visible in production builds(-O2). For -O0 (valgrind builds) they become real function calls that valgrind instruments.

The per-operation overhead is small, but it compounds across 10 valgrind processes running cluster operations for 20+ minutes. The test was already at ~1,200s out of a 2,400s timeout before the refactor, the extra function call overhead is pushing the test to timeout

roshkhatri and others added 2 commits April 14, 2026 12:45
Reduce test workload under valgrind (10K keys / 250 slots instead of
40K / 1000) to avoid timeout in the Fix cluster test. Normal runs are
unchanged.

Co-authored-by: sarthakaggarwal97 <sarthakaggarwal97@users.noreply.github.com>
Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
@roshkhatri roshkhatri changed the title Scale global test timeout for valgrind runs Deflake many-slot-migration under valgrind Apr 14, 2026

@sarthakaggarwal97 sarthakaggarwal97 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. We decided to port fix from #3497 as it targets the single test.

@roshkhatri

roshkhatri commented Apr 14, 2026

Copy link
Copy Markdown
Member Author

Reducing the migrations for the specific tests for valgrind seems to be a better option as @sarthakaggarwal97 mentioned in #3497 and backported those changes

If this continues we can look into reducing the load for valgrind gloabally in some way

@sarthakaggarwal97

Copy link
Copy Markdown
Contributor

@hpatro @zuiderkwast @enjoy-binbin it will be good if any one of you can take a look. This PR should deflake one of the frequently occuring flaky test.

@enjoy-binbin

Copy link
Copy Markdown
Member

Look solid, thanks

@enjoy-binbin enjoy-binbin merged commit 04896c1 into valkey-io:unstable Apr 23, 2026
58 checks passed
sarthakaggarwal97 added a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 23, 2026
## Problem
`Fix cluster` in `tests/unit/cluster/many-slot-migration.tcl` has been
timing out daily on valgrind jobs since April 3, 2026. The test runs 10
cluster nodes under valgrind, migrating 40,000 keys across 1,000 slots —
too much work for valgrind-instrumented builds.

The slowdown is caused by valkey-io#3366 (dict→hashtable wrapper). Under `-O0`
(valgrind builds), the `static inline` wrappers become real function
calls that valgrind instruments, adding ~75% overhead to hot paths like
`dictSize`. This compounds across 10 valgrind processes over a 20-minute
migration test. No impact on production builds (`-O2` inlines everything).

## Fix
Scale the test workload down under valgrind: 10,000 keys / 250 slots
instead of 40,000 / 1,000. Normal runs are unchanged. Still exercises
the same cluster repair path.

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: sarthakaggarwal97 <sarthakaggarwal97@users.noreply.github.com>
sarthakaggarwal97 added a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 24, 2026
## Problem
`Fix cluster` in `tests/unit/cluster/many-slot-migration.tcl` has been
timing out daily on valgrind jobs since April 3, 2026. The test runs 10
cluster nodes under valgrind, migrating 40,000 keys across 1,000 slots —
too much work for valgrind-instrumented builds.

The slowdown is caused by valkey-io#3366 (dict→hashtable wrapper). Under `-O0`
(valgrind builds), the `static inline` wrappers become real function
calls that valgrind instruments, adding ~75% overhead to hot paths like
`dictSize`. This compounds across 10 valgrind processes over a 20-minute
migration test. No impact on production builds (`-O2` inlines everything).

## Fix
Scale the test workload down under valgrind: 10,000 keys / 250 slots
instead of 40,000 / 1,000. Normal runs are unchanged. Still exercises
the same cluster repair path.

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: sarthakaggarwal97 <sarthakaggarwal97@users.noreply.github.com>
(cherry picked from commit 66b50d8)
sarthakaggarwal97 added a commit to sarthakaggarwal97/valkey that referenced this pull request Apr 27, 2026
## Problem
`Fix cluster` in `tests/unit/cluster/many-slot-migration.tcl` has been
timing out daily on valgrind jobs since April 3, 2026. The test runs 10
cluster nodes under valgrind, migrating 40,000 keys across 1,000 slots —
too much work for valgrind-instrumented builds.

The slowdown is caused by valkey-io#3366 (dict→hashtable wrapper). Under `-O0`
(valgrind builds), the `static inline` wrappers become real function
calls that valgrind instruments, adding ~75% overhead to hot paths like
`dictSize`. This compounds across 10 valgrind processes over a 20-minute
migration test. No impact on production builds (`-O2` inlines everything).

## Fix
Scale the test workload down under valgrind: 10,000 keys / 250 slots
instead of 40,000 / 1,000. Normal runs are unchanged. Still exercises
the same cluster repair path.

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: sarthakaggarwal97 <sarthakaggarwal97@users.noreply.github.com>
(cherry picked from commit 66b50d8)
(cherry picked from commit a104a94)
madolson pushed a commit that referenced this pull request Apr 27, 2026
## Problem
`Fix cluster` in `tests/unit/cluster/many-slot-migration.tcl` has been
timing out daily on valgrind jobs since April 3, 2026. The test runs 10
cluster nodes under valgrind, migrating 40,000 keys across 1,000 slots —
too much work for valgrind-instrumented builds.

The slowdown is caused by #3366 (dict→hashtable wrapper). Under `-O0`
(valgrind builds), the `static inline` wrappers become real function
calls that valgrind instruments, adding ~75% overhead to hot paths like
`dictSize`. This compounds across 10 valgrind processes over a 20-minute
migration test. No impact on production builds (`-O2` inlines everything).

## Fix
Scale the test workload down under valgrind: 10,000 keys / 250 slots
instead of 40,000 / 1,000. Normal runs are unchanged. Still exercises
the same cluster repair path.

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Co-authored-by: sarthakaggarwal97 <sarthakaggarwal97@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants