Deflake io-threads test by anchoring active-time bound to a per-reactivation baseline by Taeknology · Pull Request #3764 · valkey-io/valkey

Taeknology · 2026-05-19T02:48:47Z

Why this fails

The test "Force the use of IO threads and assert active IO thread usage" in tests/unit/io-threads.tcl checked that each used_active_time_io_thread_* stayed below 1 second after a 1-second sleep. The counter at src/io_threads.c:27 is a _Atomic long long that is monotonic across the server lifetime and never reset — line 96 of the test (assert_morethan after the io-threads 1 -> 5 toggle round-trip) already relies on exactly that property.

By the time of the post-sleep assertion the counter includes:

The first activate_io_threads_and_wait (line 67) — 16 clients opened, client 0 silent, 15 senders × 32 INCRs = 480 ops.
The io-threads 1 -> 5 toggle.
The second activate_io_threads_and_wait (line 108) — another 480 ops.

On test-sanitizer-address-gcc and test-almalinux8-tls-module the cumulative total exceeds 1.0 s; that's the 1.619448 observed in the failure message. expr {$sleep_time_ms/1000} is also Tcl integer division (= 1), so the bound is tighter than the comment suggests.

Fix

Capture per-thread used_active_time immediately before after $sleep_time_ms and bound the post-sleep delta against that snapshot. Preserves the global "IO threads should idle when not activated" invariant that line 116 was protecting (introduced in #2463), but anchors it to this reactivation. Also flips /1000 to /1000.0 to avoid the integer-division surprise.

The new r info snapshot itself ticks workers by sub-millisecond because io-threads-always-active yes stays on for this test; the /1000.0 conversion preserves enough precision for the < 1.0 s tolerance to absorb it.

Follow-up commit (6537310) also drops the now-redundant delta_from_initial assertion: it measures cumulative active time across the whole test and is fragile by the same monotonic-counter mechanism — PR #3124 hit this exact form under valgrind (Expected (7.94 - 5.87) < (1000/1000)) and worked around it with valgrind:skip. The per-reactivation delta_from_pre_sleep bound is now the single source of truth for this flake class.

Alternatives considered

sanitizer:skip (like Skip active io time test if valgrind is used #3124) — erodes CI coverage on two important lanes; Skip active io time test if valgrind is used #3124 removed a flake without a causal model, but here one is available.
Bump the threshold constant — masks the cause, not the symptom.
Delete the absolute bound only (no snapshot) — would preserve the delta_from_initial assertion that is itself fragile.

Verification

Locally on darwin/arm64: ./runtest --single unit/io-threads → Force the use of IO threads and assert active IO thread usage (6164 ms) passes under both vanilla and SANITIZER=address builds, plus the existing memory-leak check.

Verified the monotonic-accumulation model with an instrumented test run (4 extra activations before the pre-sleep snapshot, puts stderr of cumulative / pre-sleep / post values); per-thread numbers in a follow-up comment below. I don't have an ASan/TLS-module Linux environment locally, so the absolute < 1.0 s threshold crossing itself is left to the CI matrix.

…ivation baseline The test "Force the use of IO threads and assert active IO thread usage" asserted that each io_thread's used_active_time_io_thread counter stayed below 1 second after a 1-second sleep. The counter is monotonic across the server lifetime and never reset (src/io_threads.c), so by the time of that assertion it already includes work from the earlier activate_io_threads_and_wait call plus the io-threads 1 -> 5 toggle. On test-sanitizer-address-gcc and test-almalinux8-tls-module CI lanes the cumulative total exceeds 1.0 s and the assertion fails (e.g. "Expected 1.619448 < (1000/1000)"). Capture per-thread used_active_time immediately before the sleep and bound the post-sleep delta against that snapshot. This preserves the original "IO threads should idle when not activated" invariant from valkey-io#2463 while anchoring it to this reactivation. Also flip /1000 to /1000.0 to avoid Tcl integer-division surprises. Fixes valkey-io#3727. Signed-off-by: Taeknology <20297177+Taeknology@users.noreply.github.com>

Taeknology · 2026-05-19T02:48:57Z

Local model verification — per-thread numbers from an instrumented run with 4 extra activate_io_threads_and_wait calls inserted before the pre-sleep snapshot. The instrumentation was reverted unconditionally after the run; HEAD remains at the single patch commit.

metric (thread 1, busiest)	seconds
after 1 activation (initial)	0.003069
after 5 activations (1 + 4 extras)	0.014908
post (= 6 activations, 1 + 4 + final)	0.018538
`delta_from_initial` (5 activations cumulative)	0.015469
`delta_from_pre_sleep` (1 final activation only)	0.003608

Three observations:

The counter grows linearly with the activation count (~5x across 4 extras), confirming the monotonic accumulation argued above.
delta_from_pre_sleep — what the patch asserts on — stays in 0.001-0.004 s on every thread regardless of the inflated cumulative.
delta_from_initial — close to what the pre-patch was asserting on — grows with prior work and is the quantity that crosses 1.0 s on instrumented CI.

Local arm64 cost per activation is ~0.003 s; the failing CI message (1.619448 across two activations) implies ~0.81 s per activation, i.e. ~270x slower than my local. That's why the absolute < 1.0 s bound trips on sanitizer/TLS lanes but not locally, and why local reproduction without a Linux ASan environment was not feasible. The patched assertion (delta-from-pre-sleep) is invariant to that scaling by construction.

coderabbitai · 2026-05-19T02:49:00Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 93a0ef7e-f9c1-454a-930c-ff0b6c37c844

📥 Commits

Reviewing files that changed from the base of the PR and between d1d97b4 and 6537310.

📒 Files selected for processing (1)

tests/unit/io-threads.tcl

📝 Walkthrough

Walkthrough

The test tests/unit/io-threads.tcl snapshots per-thread used_active_time immediately before sleeping and then asserts that each thread’s post-sleep increase is below the sleep duration in seconds using /1000.0, removing earlier comparisons to initial snapshots.

Changes

IO thread active time measurement fix

Layer / File(s)	Summary
Pre-sleep snapshot and post-reactivation assertion `tests/unit/io-threads.tcl`	Removes the earlier `initial_active_times` snapshot (lines 70–73), records `used_active_time_io_thread_$i` into `pre_sleep_active_times` immediately before `after $sleep_time_ms` (lines 103–110), and after wake asserts `(used_active_time - pre_sleep_active_times($i)) < (sleep_time_ms/1000.0)` for each non-final IO thread, replacing previous assertions (lines 120–122).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly and accurately describes the main change: anchoring the active-time bound to a per-reactivation baseline to deflake the test.
Description check	✅ Passed	The description clearly relates to the changeset, explaining the root cause of the test flake and how the snapshot-before-sleep fix addresses it.
Linked Issues check	✅ Passed	The code changes directly address issue `#3727`: snapshot per-thread active_time before sleep and bound post-sleep delta against that baseline, while fixing Tcl integer-division by using /1000.0.
Out of Scope Changes check	✅ Passed	All changes in the test file are directly related to fixing the flaky test by snapshotting active-time and adjusting assertions, with no unrelated modifications.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/unit/io-threads.tcl`:
- Around line 122-125: Remove the stale baseline assertion that compares
used_active_time against initial_active_times (the monotonic counter makes this
flaky); in the test block containing used_active_time, initial_active_times,
pre_sleep_active_times and sleep_time_ms, delete the assertion line referencing
initial_active_times and keep only the assertion that checks (used_active_time -
pre_sleep_active_times($i)) < (sleep_time_ms/1000.0), and update or simplify the
surrounding comment so it only documents bounding activity attributable to this
reactivation.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 295c4bd3-df16-402e-bb9a-84d1a4ab079b

📥 Commits

Reviewing files that changed from the base of the PR and between 54fbe4a and d1d97b4.

📒 Files selected for processing (1)

tests/unit/io-threads.tcl

The delta-from-initial-baseline assertion measures cumulative active time across the entire test (toggle + sleep + final activation). As the monotonic counter (src/io_threads.c:27) accumulates, this bound is fragile: PR valkey-io#3124 had to add valgrind:skip after the same form failed under valgrind. The remaining delta-from-pre-sleep assertion covers the real invariant — io_threads should idle during sleep and only the final reactivation contributes to the post-sleep delta. Addresses review feedback on valkey-io#3764. Signed-off-by: Taeknology <20297177+Taeknology@users.noreply.github.com>

Taeknology · 2026-05-19T04:51:04Z

Good point. Dropped the delta_from_initial assertion in 6537310 — it shares the same monotonic-counter root cause (PR #3124 hit this exact form under valgrind: Expected (7.94 - 5.87) < (1000/1000)). The per-reactivation delta_from_pre_sleep bound is now the single source of truth for this flake class.

rainsupreme

Looks good! Fixes the measurement methodology.

nit: would like to see 100x github workflow run showing the test is no longer flaky (not a blocker for me though)

sarthakaggarwal97

This looks good to me. I would although like to see a 100x run of the flaky tests in that environment to be sure this solves it.

Please refer to this section on how to run daily tests multiple times: https://github.com/valkey-io/valkey/blob/unstable/CONTRIBUTING.md#running-the-daily-workflow-on-demand-for-your-branch

codecov · 2026-06-24T18:06:09Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.78%. Comparing base (79bca53) to head (5edebba).

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #3764      +/-   ##
============================================
+ Coverage     76.68%   76.78%   +0.10%     
============================================
  Files           162      162              
  Lines         81021    81021              
============================================
+ Hits          62129    62212      +83     
+ Misses        18892    18809      -83

see 14 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

sarthakaggarwal97 · 2026-06-29T18:03:42Z

I ran a targeted Daily workflow in my fork to stress-test this PR’s io-threads change with sanitizer jobs and repeated loops:

https://github.com/sarthakaggarwal97/valkey/actions/runs/28386791964/job/84103406744

The failure is different from the original issue this PR was trying to fix.

The original problem was that the test compared used_active_time_io_thread_* directly against a 1 second bound even though that counter is monotonic over the server lifetime. The PR fixes that part by snapshotting pre_sleep_active_times and comparing only the delta.

However, the targeted Daily run shows that the new delta-based assertion is still too strict under sanitizer stress runs. The failing assertion is:

($used_active_time - $pre_sleep_active_times($i)) < ($sleep_time_ms/1000.0)

With --single unit/io-threads --loops 100, this failed repeatedly across sanitizer jobs. The smallest failing delta was just over 1 second, and some deltas were several seconds.

So I think the PR is directionally fixing the right bug, but the replacement invariant still needs adjustment.

github-actions Bot assigned Taeknology May 19, 2026

Taeknology mentioned this pull request May 19, 2026

[TEST-FAILURE] Force the use of IO threads and assert active IO thread usage in tests/unit/io-threads.tcl #3727

Open

coderabbitai Bot reviewed May 19, 2026

View reviewed changes

Comment thread tests/unit/io-threads.tcl Outdated

rainsupreme approved these changes May 21, 2026

View reviewed changes

Taeknology and others added 3 commits June 11, 2026 23:12

Merge branch 'unstable' into fix/io-threads-3727

e8209b5

Merge branch 'unstable' into fix/io-threads-3727

11ca0b6

Merge branch 'unstable' into fix/io-threads-3727

5edebba

sarthakaggarwal97 approved these changes Jun 24, 2026

View reviewed changes

sarthakaggarwal97 mentioned this pull request Jun 24, 2026

Deflake diskless timeout rdb pipe test by tolerating either timeout-disconnect branch #3772

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deflake io-threads test by anchoring active-time bound to a per-reactivation baseline#3764

Deflake io-threads test by anchoring active-time bound to a per-reactivation baseline#3764
Taeknology wants to merge 5 commits into
valkey-io:unstablefrom
Taeknology:fix/io-threads-3727

Taeknology commented May 19, 2026 •

edited

Loading

Uh oh!

Taeknology commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Taeknology commented May 19, 2026

Uh oh!

rainsupreme left a comment

Uh oh!

sarthakaggarwal97 left a comment

Uh oh!

codecov Bot commented Jun 24, 2026

Uh oh!

sarthakaggarwal97 commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Taeknology commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why this fails

Fix

Alternatives considered

Verification

Uh oh!

Taeknology commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Taeknology commented May 19, 2026

Uh oh!

rainsupreme left a comment

Choose a reason for hiding this comment

Uh oh!

sarthakaggarwal97 left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 24, 2026

Codecov Report

Uh oh!

sarthakaggarwal97 commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Taeknology commented May 19, 2026 •

edited

Loading

coderabbitai Bot commented May 19, 2026 •

edited

Loading