fix(pool): prevent test run hang on worker crash by jaxalo · Pull Request #10543 · vitest-dev/vitest

jaxalo · 2026-06-08T15:13:39Z

Description

Pool.schedule() registers the per-task runner.on('error', ...) listener only inside the if (!runner.isStarted) branch. The listener closes over the current resolver, but it's added once when the runner first transitions out of IDLE. For any subsequent task that picks up the same runner from sharedRunners (the common case under isolate: false), the listener already exists, closed over the first task's resolver and the active task's resolver has no error path.

When a fork worker exits unexpectedly mid-test (the original report's case was V8 OOM in a 50-file batch), PoolRunner.emitUnexpectedExit emits an 'error' event. The stale listener fires and calls reject on a long-settled resolver (no-op). The active task's resolver stays pending forever. Downstream:

pool.run hangs on await testFinish.promise
pool.runTests's Promise.allSettled hangs
runFiles.finally never runs, so generateCoverage / reportCoverage are skipped
coverage report for the affected batch is silently never written, even though the worker's earlier-completed test files had their per-suite coverage chunks on disk
main process eventually exits via libuv drain + process.on('exit') cleanup, often with code 0 making the failure invisible in CI

The fix is a permanent per-task onTaskError listener that rejects the resolver on error and pairs with onFinished for cleanup so the listener count stays bounded.

Reproduction

The new e2e test in test/e2e/test/pool-worker-exit.test.ts builds the exact path:

pool: 'forks', isolate: false, maxWorkers: 2 (so files become separate tasks instead of batched)
3 test files — the 3rd (crash.test.js) runs on a runner already in STARTED state because the 1st's runner was released into sharedRunners
crash.test.js SIGKILLs its own worker mid-run via queueMicrotask(() => process.kill(process.pid, 'SIGKILL'))
assert: coverage-final.json exists and contains src.js coverage

5/5 runs WITHOUT the fix → test FAILS (no coverage-final.json produced; vitest exits in ~1s).
5/5 runs WITH the fix → test PASSES (coverage-final.json written; vitest exits ~10s after the worker dies, triggered by the now-properly-rejected resolver flowing through to runFiles.finally).

Real-world context

I hit this on a ~3000-test backend repo running vitest 4.0.15 with pool: 'forks', isolate: false, and a 1.1 GB heap limit. One fork worker per CI run would V8-OOM on a long-running test file. Pre-fix that manifested as silent coverage loss. The failing batch's coverage-final.json was simply never written because runFiles.finally didn't fire and the Test job was reported green by CircleCI because main exited with code 0. The downstream Coverage step then failed with per-file 0% thresholds on a seemingly random subset of files, decoupled from the actual cause.

Please don't delete this checklist! Before submitting the PR, please make sure you do the following:

It's really useful if your PR references an issue where it is discussed ahead of time. If the feature is substantial or introduces breaking changes without a discussion, PR might be closed.
Ideally, include a test that fails without this PR but passes with it.
Please, don't make changes to `pnpm-lock.yaml` unless you introduce a new test example.
Please check Allow edits by maintainers to make review process faster.

Tests

e2e test passes 5/5 with the fix and fails 5/5 without (no flakiness).

Documentation

No public API change.

Changesets

PR title prefixed with `fix:`.

AI disclosure

This PR was prepared with Claude (Anthropic) as a coding assistant. I hit the root cause on a real codebase, instrumented vitest's pool to confirm the mechanism, and iterated on the reproducer until it was deterministic.

AriPerkkio

Looks good, thanks @jaxalo 💯

Noticed that the test case was still hanging in pool.close() after worker crash was properly handled, so added couple of more fixes to address that. Now test case runs quickly without any promise hangs. All these issues are strictly limited to isolate: false runs.

AriPerkkio · 2026-06-09T13:01:11Z

-        runner.on('error', (error) => {
-          resolver.reject(
-            new Error(`[vitest-pool]: Worker ${task.worker} emitted error.`, { cause: error }),
-          )
-        })


This was the cause of task hanging. resolver here references the previous test file, while resolver outside this scope is the current test file. Emitted error "cancelled" previous test file and left current one hanging.

AriPerkkio · 2026-06-09T13:01:52Z


      if (
        !task.isolate
+        && !runner.isTerminated


Now recycling crashed workers instead of re-using, so that next test files can start properly instead of failing.

AriPerkkio · 2026-06-09T13:02:28Z


  private emitUnexpectedExit = (): void => {
    const error = new Error(`Worker exited unexpectedly during ${this._state} state`)
+    this._state = RunnerState.STOPPED


This will prevent Timeout terminating worker crashed-worker.test.ts cases. No need to attempt terminating crashed workers.

Also makes pool close quickly without teardown timeout activating.

jaxalo · 2026-06-09T21:51:52Z

Thanks a lot for the thorough review and for taking the time to push the follow-up fixes directly on the branch, @AriPerkkio learned a ton reading through your commits. Really appreciate it 🙏

`Pool.schedule()` only registers a `runner.on('error', ...)` listener inside `if (!runner.isStarted)`, so the listener exists for the FIRST task on a runner but not for subsequent tasks reusing the same runner under `isolate: false`. When a fork worker exits unexpectedly mid-test (e.g. OOM), `PoolRunner.emitUnexpectedExit` emits an 'error' event with no listener, the throw escapes into a libuv-callback context, and the active task's resolver stays pending forever. The visible symptoms vary: - `pool.runTests` hangs waiting on `Promise.allSettled` of the pending resolver - `runFiles.finally` never runs, so coverage and reporter teardown are skipped - depending on the surrounding code path, main exits naturally with code 0 (silent failure on CI) or hits a teardown-timeout fallback Add a permanent per-task `onTaskError` listener that rejects the resolver when an error event fires after startup. `onFinished` and `onTaskError` clean each other up on resolution so listener count doesn't grow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…o v4] (#10564) Co-authored-by: Jattioui Ismail <44307645+jaxalo@users.noreply.github.com>

jaxalo force-pushed the fix/pool-runner-reject-on-worker-exit branch 2 times, most recently from 6b85143 to 9d4f59c Compare June 8, 2026 15:23

sheremet-va requested a review from AriPerkkio June 9, 2026 07:46

AriPerkkio changed the title ~~fix(pool): reject pending task resolver when worker exits unexpectedly~~ fix(pool): prevent test run hang on worker crash Jun 9, 2026

AriPerkkio approved these changes Jun 9, 2026

View reviewed changes

jaxalo and others added 6 commits June 10, 2026 10:14

fix(pool): don't hang on crashed worker termination

aec3a0a

test: refactor to use test-utils

9e2e84b

fix(pool): don't recycle crashed workers

f413e84

refactor: revert unrelated changes

1035936

fix(pool): non-isolated pool error listeners reference previous tests

e608eb9

jaxalo force-pushed the fix/pool-runner-reject-on-worker-exit branch from 5313d12 to e608eb9 Compare June 10, 2026 08:14

AriPerkkio merged commit 4087802 into vitest-dev:main Jun 10, 2026
17 of 18 checks passed

AriPerkkio mentioned this pull request Jun 10, 2026

fix(pool): prevent test run hang on worker crash (#10543) [backport to v4] #10564

Merged

sheremet-va pushed a commit that referenced this pull request Jun 11, 2026

fix(pool): prevent test run hang on worker crash (#10543) [backport t…

934b0f5

…o v4] (#10564) Co-authored-by: Jattioui Ismail <44307645+jaxalo@users.noreply.github.com>

cneller mentioned this pull request Jun 13, 2026

Worker "error" listener only registered for the first task on a runner — with isolate: false, a worker dying mid-task strands the run (silent hang / false exit 0 under Bun) #10590

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(pool): prevent test run hang on worker crash#10543

fix(pool): prevent test run hang on worker crash#10543
AriPerkkio merged 6 commits into
vitest-dev:mainfrom
jaxalo:fix/pool-runner-reject-on-worker-exit

jaxalo commented Jun 8, 2026 •

edited

Loading

Uh oh!

AriPerkkio left a comment •

edited

Loading

Uh oh!

AriPerkkio Jun 9, 2026

Uh oh!

AriPerkkio Jun 9, 2026 •

edited

Loading

Uh oh!

AriPerkkio Jun 9, 2026

Uh oh!

jaxalo commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

Conversation

jaxalo commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Reproduction

Real-world context

Please don't delete this checklist! Before submitting the PR, please make sure you do the following:

Tests

Documentation

Changesets

AI disclosure

Uh oh!

AriPerkkio left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AriPerkkio Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

AriPerkkio Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AriPerkkio Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

jaxalo commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jaxalo commented Jun 8, 2026 •

edited

Loading

AriPerkkio left a comment •

edited

Loading

AriPerkkio Jun 9, 2026 •

edited

Loading