test(dev): debug instrumentation for node-dev-server-test CI flake#9735
Closed
h-a-n-a wants to merge 3 commits into
Closed
test(dev): debug instrumentation for node-dev-server-test CI flake#9735h-a-n-a wants to merge 3 commits into
h-a-n-a wants to merge 3 commits into
Conversation
The node-dev-server-test job intermittently times out at the 15-min ceiling (seen on node 20 while node 22/24 pass the same commit in <2min). The playground specs mutate a shared on-disk fixture (playground-temp/<name>/) via anchored string replacement with no per-test/per-retry reset; the working theory is that a flaky first-attempt failure leaves the fixture off-baseline, after which retries and sibling specs can never recover because editFile silently no-ops on a missing anchor. Can't reproduce locally, so add temporary, greppable instrumentation to confirm it on CI: - [envdbg]: node version, cpus/mem/loadavg, resolved vitest config, spec files - [fixtdbg] BEFORE: full fixture-tree snapshot at the start of every attempt, with attempt#/retryCount - [fixtdbg] FAILED (onTestFailed): error + fixture snapshot + server/browser log tails, printed inline so it survives the 15-min cancellation - [editFile]: log APPLIED before->after, and NOOP now prints current content - hanging-process reporter on both suites (the "close timed out" symptom) - RD_TEST_RETRY / RD_TEST_TIMEOUT env overrides for faster repro runs Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
✅ Deploy Preview for rolldown-rs canceled.
|
…poll trace The broad [envdbg]/[fixtdbg]/[editFile] snapshot logging confirmed the cascade (rebuild-error.spec leaks flag.txt=broken-1 into the shared playground-temp, poisoning every sibling build). What's still unexplained is the *trigger*: why attempt#1's recovery poll never sees 'rebuild-error: recovered' (rebuild-error.spec.ts:77). Strip the noisy instrumentation and replace it with a focused probe on that one poll: - bump the poll 15s -> 30s to learn whether it recovers late or is truly stuck - each tick print `.rebuild-error` text + overlay count + any new server logs, so server-side rebuild activity lines up against what the browser renders This disambiguates: page never reloaded (overlay clears, text stuck at 'ok') vs recovery build never succeeded (overlay stays, errors keep logging) vs simply too slow (flips to 'recovered' past 15s). Reverts the [envdbg] env dump, [fixtdbg] fixture snapshots, [editFile] APPLIED/NOOP verbosity, and the hanging-process reporters. Keeps the RD_TEST_RETRY / RD_TEST_TIMEOUT knobs (run with RD_TEST_RETRY=0 to keep attempt#1's trace out from under retries + the cascade). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The [recover] probe showed the failing case has zero server activity for 30s (overlay stuck at 1, value stuck at 'ok', no server logs) while retries prove the recovery write reached disk — i.e. the watched module.js change never triggered a rebuild. Add two more per-tick signals to pin the gap between disk and server: - read rebuild-error/module.js (its value field) and flag.txt from disk each tick, so the trace shows the recovery write landed and stays even while nothing rebuilds - read buildSeq from /_dev/status each tick; a flat buildSeq across the whole poll confirms no build was ever triggered (a dropped file-watch event), vs the working case where it increments when the rebuild lands Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Member
Author
|
Failed: https://github.com/rolldown/rolldown/actions/runs/27463457201/job/81182612946#step:9:604 Successful: https://github.com/rolldown/rolldown/actions/runs/27463457201/job/81182294575#step:9:607 Something in the watcher seems not to be working. File edits sometimes do not reflect in watcher updates even in poll mode. Will investigate. cc @sapphi-red @hyf0 |
This was referenced Jun 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Temporary, greppable instrumentation to diagnose the intermittent
node-dev-server-testtimeout (the job exceeds the 15-min ceiling and is canceled). Seen on node 20 while node 22/24 pass the same commit in <2 min, so it's a nondeterministic hang. Can't reproduce locally, hence pushing instrumentation to CI.Working theory
The browser playground specs mutate a shared on-disk fixture (
playground-temp/<name>/) via anchored string replacement, with no per-test/per-retry reset (the playground is copied once in global setup). A flaky first-attempt failure leaves the fixture off-baseline; after that, retries and sibling specs can never recover becauseeditFilesilently no-ops when its anchor string is gone — so the expected rebuild/HMR never fires and the test burns its fulltestTimeout × 3 retries. With specs running sequentially and sharing the dir, one flaky failure snowballs into the 15-min cancellation.Evidence from the failing run:
rebuild-error.spec.tsis the first to fail, thenhmr-full-bundle-mode.spec.tsfails all 5 tests (~11 min), with repeated[editFile] No changes detected …warnings throughout.Instrumentation (grep prefixes)
[envdbg]— node version, cpus/mem/loadavg, resolved vitest config, spec files, copied playgrounds (compare node 20 vs 22/24).[fixtdbg] ── BEFORE— full fixture-tree snapshot at the start of every attempt, withattempt#/retryCount. Confirms whether retries/sibling specs inherit a corrupted fixture.[fixtdbg] ── FAILED(onTestFailed) — error + fixture snapshot + server/browser log tails, printed inline so it survives the 15-min cancellation (Vitest's batched summary doesn't).[editFile]— logsAPPLIED before -> after; theNOOPcase now prints the current file content.hanging-processreporter on both suites — surfaces theclose timed out after 10000ms/ orphaned-process symptom.RD_TEST_RETRY/RD_TEST_TIMEOUTenv overrides — setRD_TEST_RETRY=0to fail fast and let the whole suite finish inside the job budget on a repro run.Note
This is diagnostic-only and meant to be reverted (or trimmed to whatever turns out to be load-bearing) once the root cause is confirmed. No production code touched.
🤖 Generated with Claude Code