Skip to content

test(dev): debug instrumentation for node-dev-server-test CI flake#9735

Closed
h-a-n-a wants to merge 3 commits into
mainfrom
test/dev-server-flake-debug-logs
Closed

test(dev): debug instrumentation for node-dev-server-test CI flake#9735
h-a-n-a wants to merge 3 commits into
mainfrom
test/dev-server-flake-debug-logs

Conversation

@h-a-n-a

@h-a-n-a h-a-n-a commented Jun 13, 2026

Copy link
Copy Markdown
Member

What

Temporary, greppable instrumentation to diagnose the intermittent node-dev-server-test timeout (the job exceeds the 15-min ceiling and is canceled). Seen on node 20 while node 22/24 pass the same commit in <2 min, so it's a nondeterministic hang. Can't reproduce locally, hence pushing instrumentation to CI.

Working theory

The browser playground specs mutate a shared on-disk fixture (playground-temp/<name>/) via anchored string replacement, with no per-test/per-retry reset (the playground is copied once in global setup). A flaky first-attempt failure leaves the fixture off-baseline; after that, retries and sibling specs can never recover because editFile silently no-ops when its anchor string is gone — so the expected rebuild/HMR never fires and the test burns its full testTimeout × 3 retries. With specs running sequentially and sharing the dir, one flaky failure snowballs into the 15-min cancellation.

Evidence from the failing run: rebuild-error.spec.ts is the first to fail, then hmr-full-bundle-mode.spec.ts fails all 5 tests (~11 min), with repeated [editFile] No changes detected … warnings throughout.

Instrumentation (grep prefixes)

  • [envdbg] — node version, cpus/mem/loadavg, resolved vitest config, spec files, copied playgrounds (compare node 20 vs 22/24).
  • [fixtdbg] ── BEFORE — full fixture-tree snapshot at the start of every attempt, with attempt#/retryCount. Confirms whether retries/sibling specs inherit a corrupted fixture.
  • [fixtdbg] ── FAILED (onTestFailed) — error + fixture snapshot + server/browser log tails, printed inline so it survives the 15-min cancellation (Vitest's batched summary doesn't).
  • [editFile] — logs APPLIED before -> after; the NOOP case now prints the current file content.
  • hanging-process reporter on both suites — surfaces the close timed out after 10000ms / orphaned-process symptom.
  • RD_TEST_RETRY / RD_TEST_TIMEOUT env overrides — set RD_TEST_RETRY=0 to fail fast and let the whole suite finish inside the job budget on a repro run.

Note

This is diagnostic-only and meant to be reverted (or trimmed to whatever turns out to be load-bearing) once the root cause is confirmed. No production code touched.

🤖 Generated with Claude Code

The node-dev-server-test job intermittently times out at the 15-min ceiling
(seen on node 20 while node 22/24 pass the same commit in <2min). The
playground specs mutate a shared on-disk fixture (playground-temp/<name>/) via
anchored string replacement with no per-test/per-retry reset; the working
theory is that a flaky first-attempt failure leaves the fixture off-baseline,
after which retries and sibling specs can never recover because editFile
silently no-ops on a missing anchor.

Can't reproduce locally, so add temporary, greppable instrumentation to confirm
it on CI:

- [envdbg]: node version, cpus/mem/loadavg, resolved vitest config, spec files
- [fixtdbg] BEFORE: full fixture-tree snapshot at the start of every attempt,
  with attempt#/retryCount
- [fixtdbg] FAILED (onTestFailed): error + fixture snapshot + server/browser
  log tails, printed inline so it survives the 15-min cancellation
- [editFile]: log APPLIED before->after, and NOOP now prints current content
- hanging-process reporter on both suites (the "close timed out" symptom)
- RD_TEST_RETRY / RD_TEST_TIMEOUT env overrides for faster repro runs

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@netlify

netlify Bot commented Jun 13, 2026

Copy link
Copy Markdown

Deploy Preview for rolldown-rs canceled.

Name Link
🔨 Latest commit f1e583b
🔍 Latest deploy log https://app.netlify.com/projects/rolldown-rs/deploys/6a2d28ce309e0500080b7d4b

h-a-n-a and others added 2 commits June 13, 2026 17:26
…poll trace

The broad [envdbg]/[fixtdbg]/[editFile] snapshot logging confirmed the cascade
(rebuild-error.spec leaks flag.txt=broken-1 into the shared playground-temp,
poisoning every sibling build). What's still unexplained is the *trigger*: why
attempt#1's recovery poll never sees 'rebuild-error: recovered'
(rebuild-error.spec.ts:77).

Strip the noisy instrumentation and replace it with a focused probe on that one
poll:
- bump the poll 15s -> 30s to learn whether it recovers late or is truly stuck
- each tick print `.rebuild-error` text + overlay count + any new server logs,
  so server-side rebuild activity lines up against what the browser renders

This disambiguates: page never reloaded (overlay clears, text stuck at 'ok') vs
recovery build never succeeded (overlay stays, errors keep logging) vs simply
too slow (flips to 'recovered' past 15s).

Reverts the [envdbg] env dump, [fixtdbg] fixture snapshots, [editFile]
APPLIED/NOOP verbosity, and the hanging-process reporters. Keeps the
RD_TEST_RETRY / RD_TEST_TIMEOUT knobs (run with RD_TEST_RETRY=0 to keep
attempt#1's trace out from under retries + the cascade).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The [recover] probe showed the failing case has zero server activity for 30s
(overlay stuck at 1, value stuck at 'ok', no server logs) while retries prove
the recovery write reached disk — i.e. the watched module.js change never
triggered a rebuild. Add two more per-tick signals to pin the gap between disk
and server:

- read rebuild-error/module.js (its value field) and flag.txt from disk each
  tick, so the trace shows the recovery write landed and stays even while
  nothing rebuilds
- read buildSeq from /_dev/status each tick; a flat buildSeq across the whole
  poll confirms no build was ever triggered (a dropped file-watch event), vs
  the working case where it increments when the rebuild lands

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@h-a-n-a

h-a-n-a commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

Failed: https://github.com/rolldown/rolldown/actions/runs/27463457201/job/81182612946#step:9:604

Successful: https://github.com/rolldown/rolldown/actions/runs/27463457201/job/81182294575#step:9:607

Something in the watcher seems not to be working. File edits sometimes do not reflect in watcher updates even in poll mode. Will investigate. cc @sapphi-red @hyf0

@h-a-n-a h-a-n-a closed this Jun 13, 2026
@h-a-n-a h-a-n-a deleted the test/dev-server-flake-debug-logs branch June 13, 2026 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant