Skip to content

fix(pool): prevent test run hang on worker crash#10543

Merged
AriPerkkio merged 6 commits into
vitest-dev:mainfrom
jaxalo:fix/pool-runner-reject-on-worker-exit
Jun 10, 2026
Merged

fix(pool): prevent test run hang on worker crash#10543
AriPerkkio merged 6 commits into
vitest-dev:mainfrom
jaxalo:fix/pool-runner-reject-on-worker-exit

Conversation

@jaxalo

@jaxalo jaxalo commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Description

Pool.schedule() registers the per-task runner.on('error', ...) listener only inside the if (!runner.isStarted) branch. The listener closes over the current resolver, but it's added once when the runner first transitions out of IDLE. For any subsequent task that picks up the same runner from sharedRunners (the common case under isolate: false), the listener already exists, closed over the first task's resolver and the active task's resolver has no error path.

When a fork worker exits unexpectedly mid-test (the original report's case was V8 OOM in a 50-file batch), PoolRunner.emitUnexpectedExit emits an 'error' event. The stale listener fires and calls reject on a long-settled resolver (no-op). The active task's resolver stays pending forever. Downstream:

  • pool.run hangs on await testFinish.promise
  • pool.runTests's Promise.allSettled hangs
  • runFiles.finally never runs, so generateCoverage / reportCoverage are skipped
  • coverage report for the affected batch is silently never written, even though the worker's earlier-completed test files had their per-suite coverage chunks on disk
  • main process eventually exits via libuv drain + process.on('exit') cleanup, often with code 0 making the failure invisible in CI

The fix is a permanent per-task onTaskError listener that rejects the resolver on error and pairs with onFinished for cleanup so the listener count stays bounded.

Reproduction

The new e2e test in test/e2e/test/pool-worker-exit.test.ts builds the exact path:

  • pool: 'forks', isolate: false, maxWorkers: 2 (so files become separate tasks instead of batched)
  • 3 test files — the 3rd (crash.test.js) runs on a runner already in STARTED state because the 1st's runner was released into sharedRunners
  • crash.test.js SIGKILLs its own worker mid-run via queueMicrotask(() => process.kill(process.pid, 'SIGKILL'))
  • assert: coverage-final.json exists and contains src.js coverage

5/5 runs WITHOUT the fix → test FAILS (no coverage-final.json produced; vitest exits in ~1s).
5/5 runs WITH the fix → test PASSES (coverage-final.json written; vitest exits ~10s after the worker dies, triggered by the now-properly-rejected resolver flowing through to runFiles.finally).

Real-world context

I hit this on a ~3000-test backend repo running vitest 4.0.15 with pool: 'forks', isolate: false, and a 1.1 GB heap limit. One fork worker per CI run would V8-OOM on a long-running test file. Pre-fix that manifested as silent coverage loss. The failing batch's coverage-final.json was simply never written because runFiles.finally didn't fire and the Test job was reported green by CircleCI because main exited with code 0. The downstream Coverage step then failed with per-file 0% thresholds on a seemingly random subset of files, decoupled from the actual cause.

Please don't delete this checklist! Before submitting the PR, please make sure you do the following:

  • It's really useful if your PR references an issue where it is discussed ahead of time. If the feature is substantial or introduces breaking changes without a discussion, PR might be closed.
  • Ideally, include a test that fails without this PR but passes with it.
  • Please, don't make changes to `pnpm-lock.yaml` unless you introduce a new test example.
  • Please check Allow edits by maintainers to make review process faster.

Tests

  • e2e test passes 5/5 with the fix and fails 5/5 without (no flakiness).

Documentation

  • No public API change.

Changesets

  • PR title prefixed with `fix:`.

AI disclosure

This PR was prepared with Claude (Anthropic) as a coding assistant. I hit the root cause on a real codebase, instrumented vitest's pool to confirm the mechanism, and iterated on the reproducer until it was deterministic.

@jaxalo jaxalo force-pushed the fix/pool-runner-reject-on-worker-exit branch 2 times, most recently from 6b85143 to 9d4f59c Compare June 8, 2026 15:23
@sheremet-va sheremet-va requested a review from AriPerkkio June 9, 2026 07:46
@AriPerkkio AriPerkkio changed the title fix(pool): reject pending task resolver when worker exits unexpectedly fix(pool): prevent test run hang on worker crash Jun 9, 2026

@AriPerkkio AriPerkkio left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @jaxalo 💯

Noticed that the test case was still hanging in pool.close() after worker crash was properly handled, so added couple of more fixes to address that. Now test case runs quickly without any promise hangs. All these issues are strictly limited to isolate: false runs.

Comment on lines -111 to -115
runner.on('error', (error) => {
resolver.reject(
new Error(`[vitest-pool]: Worker ${task.worker} emitted error.`, { cause: error }),
)
})

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the cause of task hanging. resolver here references the previous test file, while resolver outside this scope is the current test file. Emitted error "cancelled" previous test file and left current one hanging.


if (
!task.isolate
&& !runner.isTerminated

@AriPerkkio AriPerkkio Jun 9, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now recycling crashed workers instead of re-using, so that next test files can start properly instead of failing.


private emitUnexpectedExit = (): void => {
const error = new Error(`Worker exited unexpectedly during ${this._state} state`)
this._state = RunnerState.STOPPED

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will prevent Timeout terminating worker crashed-worker.test.ts cases. No need to attempt terminating crashed workers.

Also makes pool close quickly without teardown timeout activating.

@jaxalo

jaxalo commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Thanks a lot for the thorough review and for taking the time to push the follow-up fixes directly on the branch, @AriPerkkio learned a ton reading through your commits. Really appreciate it 🙏

jaxalo and others added 6 commits June 10, 2026 10:14
`Pool.schedule()` only registers a `runner.on('error', ...)` listener
inside `if (!runner.isStarted)`, so the listener exists for the FIRST
task on a runner but not for subsequent tasks reusing the same runner
under `isolate: false`. When a fork worker exits unexpectedly mid-test
(e.g. OOM), `PoolRunner.emitUnexpectedExit` emits an 'error' event with
no listener, the throw escapes into a libuv-callback context, and the
active task's resolver stays pending forever.

The visible symptoms vary:
- `pool.runTests` hangs waiting on `Promise.allSettled` of the pending
  resolver
- `runFiles.finally` never runs, so coverage and reporter teardown are
  skipped
- depending on the surrounding code path, main exits naturally with
  code 0 (silent failure on CI) or hits a teardown-timeout fallback

Add a permanent per-task `onTaskError` listener that rejects the
resolver when an error event fires after startup. `onFinished` and
`onTaskError` clean each other up on resolution so listener count
doesn't grow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jaxalo jaxalo force-pushed the fix/pool-runner-reject-on-worker-exit branch from 5313d12 to e608eb9 Compare June 10, 2026 08:14
@AriPerkkio AriPerkkio merged commit 4087802 into vitest-dev:main Jun 10, 2026
17 of 18 checks passed
sheremet-va pushed a commit that referenced this pull request Jun 11, 2026
…o v4] (#10564)

Co-authored-by: Jattioui Ismail <44307645+jaxalo@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants