[ci] Cap parallel test concurrency and fix fragile test timeouts#13604
Merged
Conversation
Follow-up to #13596 and #13601 which removed --concurrency=1 entirely. Unlimited concurrency caused CPU starvation on 4-vCPU CI runners when 20+ workerd-spawning fixtures or 3+ heavyweight package suites ran simultaneously, leading to test timeouts. Changes: - Cap fixture concurrency at 4 (was unlimited, previously 1) - Cap package concurrency at 3 (was unlimited, previously 1) - Add testTimeout: 50_000 to 6 vitest configs that were using Vitest's default 5000ms instead of the repo standard 50s from vitest.shared.ts: workers-shared/asset-worker, workers-shared/router-worker, vite-plugin-cloudflare, edge-preview-authenticated-proxy, kv-asset-handler, pages-shared - Increase start-worker-node-test timeout from 15s to 50s The timeout fixes address pre-existing fragility - these configs never extended vitest.shared.ts and relied on Vitest's 5s default, which is insufficient under any CPU load.
🦋 Changeset detectedLatest commit: 1b58095 The changes in this PR will be included in the next version bump. Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Contributor
|
Codeowners approval required for this PR:
Show detailed file reviewers
|
Contributor
|
UnknownError: ProviderInitError |
Contributor
|
@petebacondarwin Bonk workflow failed. Check the logs for details. View workflow run · To retry, trigger Bonk again. |
create-cloudflare
@cloudflare/kv-asset-handler
miniflare
@cloudflare/pages-shared
@cloudflare/unenv-preset
@cloudflare/vite-plugin
@cloudflare/vitest-pool-workers
@cloudflare/workers-editor-shared
wrangler
commit: |
Windows CI runners are significantly slower under parallel load. With --concurrency=4, start-worker-node timed out at 50s on Windows despite normally taking ~1.5s. Lowering to 2 avoids CPU starvation on the slowest platform while still providing ~2x speedup over serial.
dario-piotrowicz
approved these changes
Apr 20, 2026
The start-worker-node fixture's worker.dispose() cleanup hangs under any parallel load on Windows, causing the file-level node:test timeout to fire (50s) even though all 5 individual tests pass in <1s. This is a Windows-specific issue with workerd process cleanup under CPU pressure. Linux and macOS provide sufficient coverage for the startWorker API.
…13603) Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Wrangler automated PR updater <wrangler@cloudflare.com>
workers-devprod
approved these changes
Apr 20, 2026
workers-devprod
left a comment
Contributor
There was a problem hiding this comment.
Codeowners reviews satisfied
Revert the Windows-only skip — the test also fails on Linux under parallel load. The issue is that worker.dispose() cleanup hangs when any other fixture runs concurrently, causing node:test's file-level timeout to fire even though all 5 individual tests complete in <1s. Run it as a separate step after the parallel fixtures complete, so it gets the full CPU without contention.
Merged
Merged
5 tasks
petebacondarwin
added a commit
that referenced
this pull request
Apr 24, 2026
… fixed The previous PR #13604 worked around an intermittent flake in the `start-worker-node-test` fixture by: 1. Running the fixture as its own CI step, serialised from all other fixtures (because parallel load made the cleanup hang more likely). 2. Bumping the `node --test` file-level timeout from 15s to 50s so that node:test would sometimes let the subprocess finish before cancelling it. Neither addressed the underlying cause. The preceding commit fixes the actual esbuild resource leak in `unstable_startWorker` teardown, so the workaround can go: - Re-include `./fixtures/start-worker-node-test` in the main fixtures test run (restores parallelism). - Delete the separate `Run tests (start-worker-node)` step. - Drop the node:test timeout back to 15s.
petebacondarwin
added a commit
that referenced
this pull request
Apr 24, 2026
… fixed The previous PR #13604 worked around an intermittent flake in the `start-worker-node-test` fixture by: 1. Running the fixture as its own CI step, serialised from all other fixtures (because parallel load made the cleanup hang more likely). 2. Bumping the `node --test` file-level timeout from 15s to 50s so that node:test would sometimes let the subprocess finish before cancelling it. Neither addressed the underlying cause. The preceding commit fixes the actual esbuild resource leak in `unstable_startWorker` teardown, so the workaround can go: - Re-include `./fixtures/start-worker-node-test` in the main fixtures test run (restores parallelism). - Delete the separate `Run tests (start-worker-node)` step. - Drop the node:test timeout back to 15s.
petebacondarwin
added a commit
that referenced
this pull request
Apr 24, 2026
… fixed The previous PR #13604 worked around an intermittent flake in the `start-worker-node-test` fixture by: 1. Running the fixture as its own CI step, serialised from all other fixtures (because parallel load made the cleanup hang more likely). 2. Bumping the `node --test` file-level timeout from 15s to 50s so that node:test would sometimes let the subprocess finish before cancelling it. Neither addressed the underlying cause. The preceding commit fixes the actual esbuild resource leak in `unstable_startWorker` teardown, so the workaround can go: - Re-include `./fixtures/start-worker-node-test` in the main fixtures test run (restores parallelism). - Delete the separate `Run tests (start-worker-node)` step. - Drop the node:test timeout back to 15s.
petebacondarwin
added a commit
that referenced
this pull request
Apr 24, 2026
… fixed The previous PR #13604 worked around an intermittent flake in the `start-worker-node-test` fixture by: 1. Running the fixture as its own CI step, serialised from all other fixtures (because parallel load made the cleanup hang more likely). 2. Bumping the `node --test` file-level timeout from 15s to 50s so that node:test would sometimes let the subprocess finish before cancelling it. Neither addressed the underlying cause. The preceding commit fixes the actual esbuild resource leak in `unstable_startWorker` teardown, so the workaround can go: - Re-include `./fixtures/start-worker-node-test` in the main fixtures test run (restores parallelism). - Delete the separate `Run tests (start-worker-node)` step. - Drop the node:test timeout back to 15s.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up fix for #13596 and #13601, which are causing CI failures.
Problem
Removing
--concurrency=1entirely (unlimited parallelism) caused CPU starvation on GitHub Actions runners (4 vCPUs, 16GB RAM):start-worker-node(which normally takes 1.5s) timed out at 15s on all 3 OS platforms.wrangler(forks pool, 233 files),miniflare(workerd),vitest-pool-workers(Verdaccio + workerd), andworkers-shared(workerd pool) all ran simultaneously. The asset-worker "20,000 entry manifest" test timed out at 5s, and the vite-plugin HMR events test timed out at 5s.Every CI failure was a test timeout caused by CPU starvation — no correctness issues, no shared state conflicts.
Root Causes
1. Too many concurrent tasks. Unlimited concurrency on a 4-vCPU runner is too aggressive.
2. Several test configs have fragile default timeouts. Six package vitest configs don't extend
vitest.shared.ts(which provides 50s timeout), so they silently use Vitest's default 5000ms. One fixture uses a hardcoded 15000ms. These were pre-existing bugs that just didn't manifest with serial execution.Changes
Concurrency limits
--concurrency=4(was unlimited → 4 concurrent turbo tasks)--concurrency=3(was unlimited → 3 concurrent turbo tasks)Timeout fixes (6 vitest configs + 1 fixture)
Added
testTimeout: 50_000to match the repo standard fromvitest.shared.ts:packages/workers-shared/asset-worker/vitest.config.mtspackages/workers-shared/router-worker/vitest.config.mtspackages/vite-plugin-cloudflare/vitest.config.tspackages/edge-preview-authenticated-proxy/vitest.config.mtspackages/kv-asset-handler/vitest.config.mtspackages/pages-shared/vitest.config.mtsfixtures/start-worker-node-test/package.json(15s → 50s)Validation
Ran both test suites locally 3 times each with the new concurrency limits:
Expected CI Times