Skip to content

[ci] Cap parallel test concurrency and fix fragile test timeouts#13604

Merged
petebacondarwin merged 5 commits into
mainfrom
pbacondarwin/fix-parallel-ci-timeouts
Apr 20, 2026
Merged

[ci] Cap parallel test concurrency and fix fragile test timeouts#13604
petebacondarwin merged 5 commits into
mainfrom
pbacondarwin/fix-parallel-ci-timeouts

Conversation

@petebacondarwin

@petebacondarwin petebacondarwin commented Apr 20, 2026

Copy link
Copy Markdown
Contributor

Follow-up fix for #13596 and #13601, which are causing CI failures.

Problem

Removing --concurrency=1 entirely (unlimited parallelism) caused CPU starvation on GitHub Actions runners (4 vCPUs, 16GB RAM):

  • Fixtures: Turbo launched 23+ fixture tests simultaneously, each spawning workerd processes. start-worker-node (which normally takes 1.5s) timed out at 15s on all 3 OS platforms.
  • Packages: wrangler (forks pool, 233 files), miniflare (workerd), vitest-pool-workers (Verdaccio + workerd), and workers-shared (workerd pool) all ran simultaneously. The asset-worker "20,000 entry manifest" test timed out at 5s, and the vite-plugin HMR events test timed out at 5s.

Every CI failure was a test timeout caused by CPU starvation — no correctness issues, no shared state conflicts.

Root Causes

1. Too many concurrent tasks. Unlimited concurrency on a 4-vCPU runner is too aggressive.

2. Several test configs have fragile default timeouts. Six package vitest configs don't extend vitest.shared.ts (which provides 50s timeout), so they silently use Vitest's default 5000ms. One fixture uses a hardcoded 15000ms. These were pre-existing bugs that just didn't manifest with serial execution.

Changes

Concurrency limits

  • Fixtures: --concurrency=4 (was unlimited → 4 concurrent turbo tasks)
  • Packages: --concurrency=3 (was unlimited → 3 concurrent turbo tasks)

Timeout fixes (6 vitest configs + 1 fixture)

Added testTimeout: 50_000 to match the repo standard from vitest.shared.ts:

  • packages/workers-shared/asset-worker/vitest.config.mts
  • packages/workers-shared/router-worker/vitest.config.mts
  • packages/vite-plugin-cloudflare/vitest.config.ts
  • packages/edge-preview-authenticated-proxy/vitest.config.mts
  • packages/kv-asset-handler/vitest.config.mts
  • packages/pages-shared/vitest.config.mts
  • fixtures/start-worker-node-test/package.json (15s → 50s)

Validation

Ran both test suites locally 3 times each with the new concurrency limits:

Suite Run 1 Run 2 Run 3
Fixtures (--concurrency=4) 76/76, 1m50s 76/76, 1m55s 76/76, 1m42s
Packages (--concurrency=3) 34/34, 2m08s 34/34, 3m29s 34/34, 2m06s

Expected CI Times

Job Before (serial) After (capped parallel)
Linux fixtures ~10 min ~3-4 min
Linux packages ~9.5 min ~5-6 min
Windows packages ~20 min ~10-12 min

  • Tests
    • Tests included/updated
    • Automated tests not possible - manual testing has been completed as follows:
    • Additional testing not necessary because:
  • Public documentation
    • Cloudflare docs PR(s):
    • Documentation not necessary because: CI/test config changes, no user-facing impact

Open in Devin Review

Follow-up to #13596 and #13601 which removed --concurrency=1 entirely.
Unlimited concurrency caused CPU starvation on 4-vCPU CI runners when
20+ workerd-spawning fixtures or 3+ heavyweight package suites ran
simultaneously, leading to test timeouts.

Changes:
- Cap fixture concurrency at 4 (was unlimited, previously 1)
- Cap package concurrency at 3 (was unlimited, previously 1)
- Add testTimeout: 50_000 to 6 vitest configs that were using Vitest's
  default 5000ms instead of the repo standard 50s from vitest.shared.ts:
  workers-shared/asset-worker, workers-shared/router-worker,
  vite-plugin-cloudflare, edge-preview-authenticated-proxy,
  kv-asset-handler, pages-shared
- Increase start-worker-node-test timeout from 15s to 50s

The timeout fixes address pre-existing fragility - these configs never
extended vitest.shared.ts and relied on Vitest's 5s default, which is
insufficient under any CPU load.
@petebacondarwin petebacondarwin added the ci:no-changeset-required Skip pull request checks for a valid changeset label Apr 20, 2026
@changeset-bot

changeset-bot Bot commented Apr 20, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 1b58095

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@github-project-automation github-project-automation Bot moved this to Untriaged in workers-sdk Apr 20, 2026
@workers-devprod workers-devprod requested review from a team and dario-piotrowicz and removed request for a team April 20, 2026 07:51
@workers-devprod

workers-devprod commented Apr 20, 2026

Copy link
Copy Markdown
Contributor

Codeowners approval required for this PR:

  • @cloudflare/deploy-config
  • ✅ @cloudflare/wrangler
Show detailed file reviewers
  • packages/workers-shared/asset-worker/vitest.config.mts: [@cloudflare/deploy-config]
  • packages/workers-shared/router-worker/vitest.config.mts: [@cloudflare/deploy-config]

@ask-bonk

ask-bonk Bot commented Apr 20, 2026

Copy link
Copy Markdown
Contributor

UnknownError: ProviderInitError

github run

@ask-bonk

ask-bonk Bot commented Apr 20, 2026

Copy link
Copy Markdown
Contributor

@petebacondarwin Bonk workflow failed. Check the logs for details.

View workflow run · To retry, trigger Bonk again.

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 2 additional findings.

Open in Devin Review

@pkg-pr-new

pkg-pr-new Bot commented Apr 20, 2026

Copy link
Copy Markdown
create-cloudflare

npm i https://pkg.pr.new/create-cloudflare@13604

@cloudflare/kv-asset-handler

npm i https://pkg.pr.new/@cloudflare/kv-asset-handler@13604

miniflare

npm i https://pkg.pr.new/miniflare@13604

@cloudflare/pages-shared

npm i https://pkg.pr.new/@cloudflare/pages-shared@13604

@cloudflare/unenv-preset

npm i https://pkg.pr.new/@cloudflare/unenv-preset@13604

@cloudflare/vite-plugin

npm i https://pkg.pr.new/@cloudflare/vite-plugin@13604

@cloudflare/vitest-pool-workers

npm i https://pkg.pr.new/@cloudflare/vitest-pool-workers@13604

@cloudflare/workers-editor-shared

npm i https://pkg.pr.new/@cloudflare/workers-editor-shared@13604

wrangler

npm i https://pkg.pr.new/wrangler@13604

commit: 1b58095

Windows CI runners are significantly slower under parallel load.
With --concurrency=4, start-worker-node timed out at 50s on Windows
despite normally taking ~1.5s. Lowering to 2 avoids CPU starvation
on the slowest platform while still providing ~2x speedup over serial.
petebacondarwin and others added 2 commits April 20, 2026 09:33
The start-worker-node fixture's worker.dispose() cleanup hangs under
any parallel load on Windows, causing the file-level node:test timeout
to fire (50s) even though all 5 individual tests pass in <1s. This is
a Windows-specific issue with workerd process cleanup under CPU pressure.

Linux and macOS provide sufficient coverage for the startWorker API.
…13603)

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Wrangler automated PR updater <wrangler@cloudflare.com>

@dario-piotrowicz dario-piotrowicz left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codeowners Bypass

@workers-devprod workers-devprod left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codeowners reviews satisfied

@github-project-automation github-project-automation Bot moved this from Untriaged to Approved in workers-sdk Apr 20, 2026
Revert the Windows-only skip — the test also fails on Linux under
parallel load. The issue is that worker.dispose() cleanup hangs when
any other fixture runs concurrently, causing node:test's file-level
timeout to fire even though all 5 individual tests complete in <1s.

Run it as a separate step after the parallel fixtures complete, so it
gets the full CPU without contention.
@petebacondarwin petebacondarwin merged commit d8314c6 into main Apr 20, 2026
59 of 61 checks passed
@petebacondarwin petebacondarwin deleted the pbacondarwin/fix-parallel-ci-timeouts branch April 20, 2026 09:38
@github-project-automation github-project-automation Bot moved this from Approved to Done in workers-sdk Apr 20, 2026
petebacondarwin added a commit that referenced this pull request Apr 24, 2026
… fixed

The previous PR #13604 worked around an intermittent flake in the
`start-worker-node-test` fixture by:

1. Running the fixture as its own CI step, serialised from all other
   fixtures (because parallel load made the cleanup hang more likely).
2. Bumping the `node --test` file-level timeout from 15s to 50s so
   that node:test would sometimes let the subprocess finish before
   cancelling it.

Neither addressed the underlying cause. The preceding commit fixes the
actual esbuild resource leak in `unstable_startWorker` teardown, so
the workaround can go:

- Re-include `./fixtures/start-worker-node-test` in the main
  fixtures test run (restores parallelism).
- Delete the separate `Run tests (start-worker-node)` step.
- Drop the node:test timeout back to 15s.
petebacondarwin added a commit that referenced this pull request Apr 24, 2026
… fixed

The previous PR #13604 worked around an intermittent flake in the
`start-worker-node-test` fixture by:

1. Running the fixture as its own CI step, serialised from all other
   fixtures (because parallel load made the cleanup hang more likely).
2. Bumping the `node --test` file-level timeout from 15s to 50s so
   that node:test would sometimes let the subprocess finish before
   cancelling it.

Neither addressed the underlying cause. The preceding commit fixes the
actual esbuild resource leak in `unstable_startWorker` teardown, so
the workaround can go:

- Re-include `./fixtures/start-worker-node-test` in the main
  fixtures test run (restores parallelism).
- Delete the separate `Run tests (start-worker-node)` step.
- Drop the node:test timeout back to 15s.
petebacondarwin added a commit that referenced this pull request Apr 24, 2026
… fixed

The previous PR #13604 worked around an intermittent flake in the
`start-worker-node-test` fixture by:

1. Running the fixture as its own CI step, serialised from all other
   fixtures (because parallel load made the cleanup hang more likely).
2. Bumping the `node --test` file-level timeout from 15s to 50s so
   that node:test would sometimes let the subprocess finish before
   cancelling it.

Neither addressed the underlying cause. The preceding commit fixes the
actual esbuild resource leak in `unstable_startWorker` teardown, so
the workaround can go:

- Re-include `./fixtures/start-worker-node-test` in the main
  fixtures test run (restores parallelism).
- Delete the separate `Run tests (start-worker-node)` step.
- Drop the node:test timeout back to 15s.
petebacondarwin added a commit that referenced this pull request Apr 24, 2026
… fixed

The previous PR #13604 worked around an intermittent flake in the
`start-worker-node-test` fixture by:

1. Running the fixture as its own CI step, serialised from all other
   fixtures (because parallel load made the cleanup hang more likely).
2. Bumping the `node --test` file-level timeout from 15s to 50s so
   that node:test would sometimes let the subprocess finish before
   cancelling it.

Neither addressed the underlying cause. The preceding commit fixes the
actual esbuild resource leak in `unstable_startWorker` teardown, so
the workaround can go:

- Re-include `./fixtures/start-worker-node-test` in the main
  fixtures test run (restores parallelism).
- Delete the separate `Run tests (start-worker-node)` step.
- Drop the node:test timeout back to 15s.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci:no-changeset-required Skip pull request checks for a valid changeset

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants