Skip to content

Deflake io-threads test by anchoring active-time bound to a per-reactivation baseline#3764

Open
Taeknology wants to merge 5 commits into
valkey-io:unstablefrom
Taeknology:fix/io-threads-3727
Open

Deflake io-threads test by anchoring active-time bound to a per-reactivation baseline#3764
Taeknology wants to merge 5 commits into
valkey-io:unstablefrom
Taeknology:fix/io-threads-3727

Conversation

@Taeknology

@Taeknology Taeknology commented May 19, 2026

Copy link
Copy Markdown
Contributor

Fixes #3727.

Why this fails

The test "Force the use of IO threads and assert active IO thread usage" in tests/unit/io-threads.tcl checked that each used_active_time_io_thread_* stayed below 1 second after a 1-second sleep. The counter at src/io_threads.c:27 is a _Atomic long long that is monotonic across the server lifetime and never reset — line 96 of the test (assert_morethan after the io-threads 1 -> 5 toggle round-trip) already relies on exactly that property.

By the time of the post-sleep assertion the counter includes:

  1. The first activate_io_threads_and_wait (line 67) — 16 clients opened, client 0 silent, 15 senders × 32 INCRs = 480 ops.
  2. The io-threads 1 -> 5 toggle.
  3. The second activate_io_threads_and_wait (line 108) — another 480 ops.

On test-sanitizer-address-gcc and test-almalinux8-tls-module the cumulative total exceeds 1.0 s; that's the 1.619448 observed in the failure message. expr {$sleep_time_ms/1000} is also Tcl integer division (= 1), so the bound is tighter than the comment suggests.

Fix

Capture per-thread used_active_time immediately before after $sleep_time_ms and bound the post-sleep delta against that snapshot. Preserves the global "IO threads should idle when not activated" invariant that line 116 was protecting (introduced in #2463), but anchors it to this reactivation. Also flips /1000 to /1000.0 to avoid the integer-division surprise.

The new r info snapshot itself ticks workers by sub-millisecond because io-threads-always-active yes stays on for this test; the /1000.0 conversion preserves enough precision for the < 1.0 s tolerance to absorb it.

Follow-up commit (6537310) also drops the now-redundant delta_from_initial assertion: it measures cumulative active time across the whole test and is fragile by the same monotonic-counter mechanism — PR #3124 hit this exact form under valgrind (Expected (7.94 - 5.87) < (1000/1000)) and worked around it with valgrind:skip. The per-reactivation delta_from_pre_sleep bound is now the single source of truth for this flake class.

Alternatives considered

Verification

Locally on darwin/arm64: ./runtest --single unit/io-threadsForce the use of IO threads and assert active IO thread usage (6164 ms) passes under both vanilla and SANITIZER=address builds, plus the existing memory-leak check.

Verified the monotonic-accumulation model with an instrumented test run (4 extra activations before the pre-sleep snapshot, puts stderr of cumulative / pre-sleep / post values); per-thread numbers in a follow-up comment below. I don't have an ASan/TLS-module Linux environment locally, so the absolute < 1.0 s threshold crossing itself is left to the CI matrix.

…ivation baseline

The test "Force the use of IO threads and assert active IO thread usage"
asserted that each io_thread's used_active_time_io_thread counter stayed
below 1 second after a 1-second sleep. The counter is monotonic across
the server lifetime and never reset (src/io_threads.c), so by the time
of that assertion it already includes work from the earlier
activate_io_threads_and_wait call plus the io-threads 1 -> 5 toggle.

On test-sanitizer-address-gcc and test-almalinux8-tls-module CI lanes the
cumulative total exceeds 1.0 s and the assertion fails
(e.g. "Expected 1.619448 < (1000/1000)").

Capture per-thread used_active_time immediately before the sleep and
bound the post-sleep delta against that snapshot. This preserves the
original "IO threads should idle when not activated" invariant from
valkey-io#2463 while anchoring it to this reactivation. Also flip /1000 to
/1000.0 to avoid Tcl integer-division surprises.

Fixes valkey-io#3727.

Signed-off-by: Taeknology <20297177+Taeknology@users.noreply.github.com>
@Taeknology

Copy link
Copy Markdown
Contributor Author

Local model verification — per-thread numbers from an instrumented run with 4 extra activate_io_threads_and_wait calls inserted before the pre-sleep snapshot. The instrumentation was reverted unconditionally after the run; HEAD remains at the single patch commit.

metric (thread 1, busiest) seconds
after 1 activation (initial) 0.003069
after 5 activations (1 + 4 extras) 0.014908
post (= 6 activations, 1 + 4 + final) 0.018538
delta_from_initial (5 activations cumulative) 0.015469
delta_from_pre_sleep (1 final activation only) 0.003608

Three observations:

  1. The counter grows linearly with the activation count (~5x across 4 extras), confirming the monotonic accumulation argued above.
  2. delta_from_pre_sleep — what the patch asserts on — stays in 0.001-0.004 s on every thread regardless of the inflated cumulative.
  3. delta_from_initial — close to what the pre-patch was asserting on — grows with prior work and is the quantity that crosses 1.0 s on instrumented CI.

Local arm64 cost per activation is ~0.003 s; the failing CI message (1.619448 across two activations) implies ~0.81 s per activation, i.e. ~270x slower than my local. That's why the absolute < 1.0 s bound trips on sanitizer/TLS lanes but not locally, and why local reproduction without a Linux ASan environment was not feasible. The patched assertion (delta-from-pre-sleep) is invariant to that scaling by construction.

@coderabbitai

coderabbitai Bot commented May 19, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 93a0ef7e-f9c1-454a-930c-ff0b6c37c844

📥 Commits

Reviewing files that changed from the base of the PR and between d1d97b4 and 6537310.

📒 Files selected for processing (1)
  • tests/unit/io-threads.tcl

📝 Walkthrough

Walkthrough

The test tests/unit/io-threads.tcl snapshots per-thread used_active_time immediately before sleeping and then asserts that each thread’s post-sleep increase is below the sleep duration in seconds using /1000.0, removing earlier comparisons to initial snapshots.

Changes

IO thread active time measurement fix

Layer / File(s) Summary
Pre-sleep snapshot and post-reactivation assertion
tests/unit/io-threads.tcl
Removes the earlier initial_active_times snapshot (lines 70–73), records used_active_time_io_thread_$i into pre_sleep_active_times immediately before after $sleep_time_ms (lines 103–110), and after wake asserts (used_active_time - pre_sleep_active_times($i)) < (sleep_time_ms/1000.0) for each non-final IO thread, replacing previous assertions (lines 120–122).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and accurately describes the main change: anchoring the active-time bound to a per-reactivation baseline to deflake the test.
Description check ✅ Passed The description clearly relates to the changeset, explaining the root cause of the test flake and how the snapshot-before-sleep fix addresses it.
Linked Issues check ✅ Passed The code changes directly address issue #3727: snapshot per-thread active_time before sleep and bound post-sleep delta against that baseline, while fixing Tcl integer-division by using /1000.0.
Out of Scope Changes check ✅ Passed All changes in the test file are directly related to fixing the flaky test by snapshotting active-time and adjusting assertions, with no unrelated modifications.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/unit/io-threads.tcl`:
- Around line 122-125: Remove the stale baseline assertion that compares
used_active_time against initial_active_times (the monotonic counter makes this
flaky); in the test block containing used_active_time, initial_active_times,
pre_sleep_active_times and sleep_time_ms, delete the assertion line referencing
initial_active_times and keep only the assertion that checks (used_active_time -
pre_sleep_active_times($i)) < (sleep_time_ms/1000.0), and update or simplify the
surrounding comment so it only documents bounding activity attributable to this
reactivation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 295c4bd3-df16-402e-bb9a-84d1a4ab079b

📥 Commits

Reviewing files that changed from the base of the PR and between 54fbe4a and d1d97b4.

📒 Files selected for processing (1)
  • tests/unit/io-threads.tcl

Comment thread tests/unit/io-threads.tcl Outdated
The delta-from-initial-baseline assertion measures cumulative active
time across the entire test (toggle + sleep + final activation). As
the monotonic counter (src/io_threads.c:27) accumulates, this bound
is fragile: PR valkey-io#3124 had to add valgrind:skip after the same form
failed under valgrind. The remaining delta-from-pre-sleep assertion
covers the real invariant — io_threads should idle during sleep and
only the final reactivation contributes to the post-sleep delta.

Addresses review feedback on valkey-io#3764.

Signed-off-by: Taeknology <20297177+Taeknology@users.noreply.github.com>
@Taeknology

Copy link
Copy Markdown
Contributor Author

Good point. Dropped the delta_from_initial assertion in 6537310 — it shares the same monotonic-counter root cause (PR #3124 hit this exact form under valgrind: Expected (7.94 - 5.87) < (1000/1000)). The per-reactivation delta_from_pre_sleep bound is now the single source of truth for this flake class.

@rainsupreme rainsupreme left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Fixes the measurement methodology.

nit: would like to see 100x github workflow run showing the test is no longer flaky (not a blocker for me though)

@sarthakaggarwal97 sarthakaggarwal97 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I would although like to see a 100x run of the flaky tests in that environment to be sure this solves it.

Please refer to this section on how to run daily tests multiple times: https://github.com/valkey-io/valkey/blob/unstable/CONTRIBUTING.md#running-the-daily-workflow-on-demand-for-your-branch

@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.78%. Comparing base (79bca53) to head (5edebba).

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3764      +/-   ##
============================================
+ Coverage     76.68%   76.78%   +0.10%     
============================================
  Files           162      162              
  Lines         81021    81021              
============================================
+ Hits          62129    62212      +83     
+ Misses        18892    18809      -83     

see 14 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sarthakaggarwal97

Copy link
Copy Markdown
Contributor

I ran a targeted Daily workflow in my fork to stress-test this PR’s io-threads change with sanitizer jobs and repeated loops:

https://github.com/sarthakaggarwal97/valkey/actions/runs/28386791964/job/84103406744

The failure is different from the original issue this PR was trying to fix.

The original problem was that the test compared used_active_time_io_thread_* directly against a 1 second bound even though that counter is monotonic over the server lifetime. The PR fixes that part by snapshotting pre_sleep_active_times and comparing only the delta.

However, the targeted Daily run shows that the new delta-based assertion is still too strict under sanitizer stress runs. The failing assertion is:

($used_active_time - $pre_sleep_active_times($i)) < ($sleep_time_ms/1000.0)

With --single unit/io-threads --loops 100, this failed repeatedly across sanitizer jobs. The smallest failing delta was just over 1 second, and some deltas were several seconds.

So I think the PR is directionally fixing the right bug, but the replacement invariant still needs adjustment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[TEST-FAILURE] Force the use of IO threads and assert active IO thread usage in tests/unit/io-threads.tcl

3 participants