fix(pool): prevent test run hang on worker crash#10543
Conversation
6b85143 to
9d4f59c
Compare
There was a problem hiding this comment.
Looks good, thanks @jaxalo 💯
Noticed that the test case was still hanging in pool.close() after worker crash was properly handled, so added couple of more fixes to address that. Now test case runs quickly without any promise hangs. All these issues are strictly limited to isolate: false runs.
| runner.on('error', (error) => { | ||
| resolver.reject( | ||
| new Error(`[vitest-pool]: Worker ${task.worker} emitted error.`, { cause: error }), | ||
| ) | ||
| }) |
There was a problem hiding this comment.
This was the cause of task hanging. resolver here references the previous test file, while resolver outside this scope is the current test file. Emitted error "cancelled" previous test file and left current one hanging.
|
|
||
| if ( | ||
| !task.isolate | ||
| && !runner.isTerminated |
There was a problem hiding this comment.
Now recycling crashed workers instead of re-using, so that next test files can start properly instead of failing.
|
|
||
| private emitUnexpectedExit = (): void => { | ||
| const error = new Error(`Worker exited unexpectedly during ${this._state} state`) | ||
| this._state = RunnerState.STOPPED |
There was a problem hiding this comment.
This will prevent Timeout terminating worker crashed-worker.test.ts cases. No need to attempt terminating crashed workers.
Also makes pool close quickly without teardown timeout activating.
|
Thanks a lot for the thorough review and for taking the time to push the follow-up fixes directly on the branch, @AriPerkkio learned a ton reading through your commits. Really appreciate it 🙏 |
`Pool.schedule()` only registers a `runner.on('error', ...)` listener
inside `if (!runner.isStarted)`, so the listener exists for the FIRST
task on a runner but not for subsequent tasks reusing the same runner
under `isolate: false`. When a fork worker exits unexpectedly mid-test
(e.g. OOM), `PoolRunner.emitUnexpectedExit` emits an 'error' event with
no listener, the throw escapes into a libuv-callback context, and the
active task's resolver stays pending forever.
The visible symptoms vary:
- `pool.runTests` hangs waiting on `Promise.allSettled` of the pending
resolver
- `runFiles.finally` never runs, so coverage and reporter teardown are
skipped
- depending on the surrounding code path, main exits naturally with
code 0 (silent failure on CI) or hits a teardown-timeout fallback
Add a permanent per-task `onTaskError` listener that rejects the
resolver when an error event fires after startup. `onFinished` and
`onTaskError` clean each other up on resolution so listener count
doesn't grow.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5313d12 to
e608eb9
Compare
…o v4] (#10564) Co-authored-by: Jattioui Ismail <44307645+jaxalo@users.noreply.github.com>
Description
Pool.schedule()registers the per-taskrunner.on('error', ...)listener only inside theif (!runner.isStarted)branch. The listener closes over the currentresolver, but it's added once when the runner first transitions out of IDLE. For any subsequent task that picks up the same runner fromsharedRunners(the common case underisolate: false), the listener already exists, closed over the first task's resolver and the active task's resolver has no error path.When a fork worker exits unexpectedly mid-test (the original report's case was V8 OOM in a 50-file batch),
PoolRunner.emitUnexpectedExitemits an'error'event. The stale listener fires and callsrejecton a long-settled resolver (no-op). The active task's resolver stays pending forever. Downstream:pool.runhangs onawait testFinish.promisepool.runTests'sPromise.allSettledhangsrunFiles.finallynever runs, sogenerateCoverage/reportCoverageare skippedprocess.on('exit')cleanup, often with code 0 making the failure invisible in CIThe fix is a permanent per-task
onTaskErrorlistener that rejects the resolver on error and pairs withonFinishedfor cleanup so the listener count stays bounded.Reproduction
The new e2e test in
test/e2e/test/pool-worker-exit.test.tsbuilds the exact path:pool: 'forks',isolate: false,maxWorkers: 2(so files become separate tasks instead of batched)crash.test.js) runs on a runner already in STARTED state because the 1st's runner was released intosharedRunnerscrash.test.jsSIGKILLs its own worker mid-run viaqueueMicrotask(() => process.kill(process.pid, 'SIGKILL'))coverage-final.jsonexists and contains src.js coverage5/5 runs WITHOUT the fix → test FAILS (no
coverage-final.jsonproduced; vitest exits in ~1s).5/5 runs WITH the fix → test PASSES (
coverage-final.jsonwritten; vitest exits ~10s after the worker dies, triggered by the now-properly-rejected resolver flowing through torunFiles.finally).Real-world context
I hit this on a ~3000-test backend repo running vitest 4.0.15 with
pool: 'forks',isolate: false, and a 1.1 GB heap limit. One fork worker per CI run would V8-OOM on a long-running test file. Pre-fix that manifested as silent coverage loss. The failing batch'scoverage-final.jsonwas simply never written becauserunFiles.finallydidn't fire and the Test job was reported green by CircleCI because main exited with code 0. The downstream Coverage step then failed with per-file 0% thresholds on a seemingly random subset of files, decoupled from the actual cause.Please don't delete this checklist! Before submitting the PR, please make sure you do the following:
Tests
Documentation
Changesets
AI disclosure
This PR was prepared with Claude (Anthropic) as a coding assistant. I hit the root cause on a real codebase, instrumented vitest's pool to confirm the mechanism, and iterated on the reproducer until it was deterministic.