Skip to content

[wrangler] Attempts to reduce remote e2e test flakiness#12896

Merged
petebacondarwin merged 18 commits intomainfrom
pbd/fix-remote-binding-flake
Mar 20, 2026
Merged

[wrangler] Attempts to reduce remote e2e test flakiness#12896
petebacondarwin merged 18 commits intomainfrom
pbd/fix-remote-binding-flake

Conversation

@petebacondarwin
Copy link
Copy Markdown
Contributor

@petebacondarwin petebacondarwin commented Mar 14, 2026

Several changes to reduce flakiness in the wrangler remote-binding and dev e2e tests:

Consistent waitFor/waitForLong helpers

  • Extract shared waitFor() and waitForLong() helpers that wrap vi.waitFor() with tuned defaults
    • waitFor(): 100ms interval, 5s timeout — for polling synchronous state (e.g. console output)
    • waitForLong(): 500ms interval, 10s timeout — for polling HTTP endpoints
  • Add ESLint rule to enforce using these helpers instead of bare vi.waitFor() in e2e tests
  • Migrate all e2e tests to use the shared helpers

Retry transient API failures in remote preview

  • Wrap createPreviewSession() and createWorkerPreview() in retryOnAPIFailure so transient 5xx errors are retried automatically (up to 3 attempts with linear backoff)
  • Add an optional abortSignal parameter to retryOnAPIFailure so backoff delays can be cancelled immediately when a new bundle arrives

Increase e2e timeouts for remote preview and add per-request API timeout

  • startWorker: use waitForLong (10s) instead of waitFor (5s) for remote reload polling, matching the convention for HTTP endpoint polling
  • start-worker-remote-bindings: increase beforeAll deploy timeout from 35s to 60s for slow Windows CI
  • dev.test Workers+Assets: increase waitForReady/waitForReload from 15s to 30s since remote mode involves session creation + asset upload + bundle upload to edge-preview
  • create-worker-preview: add 30s per-request timeout via AbortSignal.any so a hung Cloudflare API response doesn't block the reload indefinitely
  • dev-remote-bindings multi-worker test: increase waitForReady from 15s to 30s since two serialised remote proxy sessions must complete before "Ready on" appears
  • dev-remote-bindings error-log tests: use waitForLong instead of waitFor since error messages depend on Cloudflare API validation round-trips

Use shared preserve-e2e-* workers for remote-binding tests

  • Add ensureWorkerDeployed() helper to WranglerE2ETestHelper that checks whether a worker is already live (by fetching its workers.dev URL) and only deploys if it returns 404
  • Migrate dev-remote-bindings.test.ts, start-worker-remote-bindings.test.ts, and remote-bindings-api.test.ts to use fixed preserve-e2e-wrangler-remote-worker / preserve-e2e-wrangler-remote-worker-alt names instead of deploying fresh workers with random names on every run
  • These workers persist across test runs and are excluded from the periodic e2e cleanup job by their preserve-e2e- prefix

Fix EADDRINUSE port races (TOCTOU elimination)

  • stop() now awaits process exit — previously only waited for the kill signal to be sent, not for the process to actually terminate. Ports held by dying processes could cause EADDRINUSE in the next test.
  • Remove get-port TOCTOU pattern from e2e testsspawnLocalWorker and dev-registry pages dev tests now rely on --port 0 (OS-assigned at bind time) instead of get-port (check-then-use race). Extended getWranglerCommand to also auto-add --port 0 --inspector-port 0 for wrangler pages dev commands.
  • Fix getPort() TOCTOU in production codestart-remote-proxy-session.ts used get-port to allocate a port for the nested remote proxy DevEnv. Replaced with port: 0 so the OS assigns the port atomically at bind time, eliminating EADDRINUSE during wrangler dev with remote bindings.

Make workers.dev domain configurable for e2e tests

  • Add E2E_ACCOUNT_WORKERS_DEV_DOMAIN env var (defaults to devprod-testing7928.workers.dev) so anyone can run the e2e suite against their own Cloudflare account
  • Replace all hardcoded devprod-testing7928.workers.dev references across wrangler, vite-plugin-cloudflare, create-cloudflare, and get-platform-proxy-remote-bindings e2e tests
  • Add the env var to turbo.json so Turbo invalidates cache when it changes
  • Document the env var in packages/wrangler/e2e/README.md with instructions for running against a personal account

Follow-up items

  • pages-dev.test.ts still uses module-level getPort() shared across sequential tests — lower risk (single allocation at module load), but could be migrated to --port 0 in a follow-up
  • startWorker.test.ts keeps getPort() for its port-switch test — the TOCTOU window is small (in-process programmatic API) and the test genuinely needs distinct known ports to verify port changes

  • Tests
    • Tests included/updated
    • Automated tests not possible - manual testing has been completed as follows:
    • Additional testing not necessary because:
  • Public documentation
    • Cloudflare docs PR(s):
    • Documentation not necessary because: test infrastructure and internal retry logic only

A picture of a cute animal (not mandatory, but encouraged)

@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Mar 14, 2026

🦋 Changeset detected

Latest commit: 91deef0

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@workers-devprod
Copy link
Copy Markdown
Contributor

workers-devprod commented Mar 14, 2026

Codeowners approval required for this PR:

  • ✅ @cloudflare/wrangler
Show detailed file reviewers

devin-ai-integration[bot]

This comment was marked as resolved.

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new bot commented Mar 14, 2026

create-cloudflare

npm i https://pkg.pr.new/create-cloudflare@12896

@cloudflare/kv-asset-handler

npm i https://pkg.pr.new/@cloudflare/kv-asset-handler@12896

miniflare

npm i https://pkg.pr.new/miniflare@12896

@cloudflare/pages-shared

npm i https://pkg.pr.new/@cloudflare/pages-shared@12896

@cloudflare/unenv-preset

npm i https://pkg.pr.new/@cloudflare/unenv-preset@12896

@cloudflare/vite-plugin

npm i https://pkg.pr.new/@cloudflare/vite-plugin@12896

@cloudflare/vitest-pool-workers

npm i https://pkg.pr.new/@cloudflare/vitest-pool-workers@12896

@cloudflare/workers-editor-shared

npm i https://pkg.pr.new/@cloudflare/workers-editor-shared@12896

wrangler

npm i https://pkg.pr.new/wrangler@12896

commit: 91deef0

@petebacondarwin petebacondarwin force-pushed the pbd/fix-remote-binding-flake branch from bef408f to 3de2809 Compare March 14, 2026 11:12
@petebacondarwin petebacondarwin changed the title test: use vi.waitFor/vi.waitUntil to wait for logger output in e2e tests consistent use of vi.waitFor/vi.waitUntil to wait for things in e2e tests Mar 14, 2026
devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

@petebacondarwin petebacondarwin force-pushed the pbd/fix-remote-binding-flake branch from 1e6a215 to ea1dfd6 Compare March 16, 2026 10:20
@petebacondarwin petebacondarwin force-pushed the pbd/fix-remote-binding-flake branch from ea1dfd6 to 0471f8c Compare March 16, 2026 10:37
@petebacondarwin petebacondarwin changed the title consistent use of vi.waitFor/vi.waitUntil to wait for things in e2e tests [wrangler] use consistent waitFor/waitForLong helpers in e2e tests Mar 16, 2026
@petebacondarwin petebacondarwin force-pushed the pbd/fix-remote-binding-flake branch from 0471f8c to e1a9b31 Compare March 16, 2026 13:00
devin-ai-integration[bot]

This comment was marked as resolved.

@petebacondarwin petebacondarwin changed the title [wrangler] use consistent waitFor/waitForLong helpers in e2e tests [wrangler] Attempts to reduce remote e2e test flakiness Mar 17, 2026
@petebacondarwin petebacondarwin requested review from a team as code owners March 17, 2026 14:06
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 17, 2026

✅ All changesets look good

@petebacondarwin petebacondarwin force-pushed the pbd/fix-remote-binding-flake branch 2 times, most recently from a6c37fc to 961354c Compare March 18, 2026 13:00
devin-ai-integration[bot]

This comment was marked as resolved.

@petebacondarwin petebacondarwin force-pushed the pbd/fix-remote-binding-flake branch from dfb8330 to 215ecda Compare March 18, 2026 16:33
devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

@github-project-automation github-project-automation bot moved this from Untriaged to Approved in workers-sdk Mar 19, 2026
@workers-devprod
Copy link
Copy Markdown
Contributor

Codeowners approval required for this PR:

  • ✅ @cloudflare/wrangler
Show detailed file reviewers

- Extract shared waitFor() and waitForLong() helpers wrapping vi.waitFor() with tuned defaults
- waitFor(): 100ms interval, 5s timeout — for polling synchronous state (e.g. console output)
- waitForLong(): 500ms interval, 10s timeout — for polling HTTP endpoints
- Add ESLint rule to enforce using these helpers instead of bare vi.waitFor() in e2e tests
- Migrate all e2e tests to use the shared helpers
Wrap createPreviewSession() and createWorkerPreview() in
retryOnAPIFailure so transient 5xx errors are retried automatically
(up to 3 attempts with linear backoff). Also add an optional
abortSignal parameter to retryOnAPIFailure so backoff delays can be
cancelled immediately when a new bundle arrives.
- startWorker: use waitForLong (10s) instead of waitFor (5s) for remote
  reload polling, matching the convention for HTTP endpoint polling
- start-worker-remote-bindings: increase beforeAll deploy timeout from
  35s to 60s for slow Windows CI
- dev.test Workers+Assets: increase waitForReady/waitForReload from 15s
  to 30s since remote mode involves session creation + asset upload +
  bundle upload to edge-preview
- create-worker-preview: add 30s per-request timeout via AbortSignal.any
  so a hung Cloudflare API response doesn't block the reload indefinitely
Reuse pre-deployed workers instead of deploying fresh ones on every test
run. The new ensureWorkerDeployed() helper checks whether a worker is
already live before deploying, and the preserve-e2e- prefix keeps the
workers excluded from periodic cleanup.
Thread the abort signal through to the fetchResult call inside
getWorkersDevSubdomain so it gets the same withTimeout protection
as the other API calls in createPreviewSession.
The backoff computation backoff + (MAX_ATTEMPTS - attempts) * 1000
was off-by-one: the computed delay was passed to the next recursive
call but that call would throw before ever sleeping on it. Replace
with a simple backoff + 1000 so the first retry is immediate (0ms)
and the second waits 1000ms, matching the documented intent.
A timed-out API request (DOMException with name TimeoutError) is a
transient failure and should be retried, just like 5xx errors. User-
initiated aborts (AbortError) are still propagated immediately.
- stop() now awaits process exit (not just signal send) so ports are
  fully released before the next test starts
- Remove get-port TOCTOU pattern from spawnLocalWorker and dev-registry
  pages dev tests; rely on --port 0 (OS-assigned) instead
- Extend getWranglerCommand auto --port 0 to wrangler pages dev commands
- Increase multi-worker test waitForReady to 30s since two serialised
  remote proxy sessions must complete before Ready on appears
The 'shows helpful error logs' tests wait for error messages that only
appear after the remote proxy session validates bindings against the
Cloudflare API. waitFor (5s) is too short; waitForLong (10s) matches
the convention for anything involving HTTP round-trips.
…API-dependent error tests

- Replace getPort() with port: 0 in start-remote-proxy-session.ts so
  the OS assigns the port atomically at bind time, eliminating the
  TOCTOU window that could cause EADDRINUSE during wrangler dev
- Switch error-log assertion tests from waitFor (5s) to waitForLong
  (10s) since the error messages depend on Cloudflare API validation
This is needed because we (already) use the abort signal API.
This is needed because we (already) use the abort signal API.
@petebacondarwin petebacondarwin force-pushed the pbd/fix-remote-binding-flake branch from b716651 to f38b783 Compare March 19, 2026 15:59
devin-ai-integration[bot]

This comment was marked as resolved.

The ensureWorkerDeployed helper and the remote dev URL assertion now
use E2E_ACCOUNT_WORKERS_DEV_DOMAIN (defaults to devprod-testing7928.workers.dev).
Set this env var alongside CLOUDFLARE_ACCOUNT_ID and CLOUDFLARE_API_TOKEN
to run the e2e tests against a different account.
The buildAndPreview flow does a full build before starting the preview
server, so it needs more headroom than 20s — especially in CI.
@petebacondarwin petebacondarwin merged commit 451dae3 into main Mar 20, 2026
38 of 39 checks passed
@petebacondarwin petebacondarwin deleted the pbd/fix-remote-binding-flake branch March 20, 2026 07:23
@github-project-automation github-project-automation bot moved this from Approved to Done in workers-sdk Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants