Skip to content

Websocket broker#360

Merged
breardon2011 merged 5 commits into
mainfrom
websocket-broker
Jun 10, 2026
Merged

Websocket broker#360
breardon2011 merged 5 commits into
mainfrom
websocket-broker

Conversation

@breardon2011

@breardon2011 breardon2011 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a per-cell WebSocket broker — internal/wsgateway — that sits between
clients (SDK, dashboard) and workers for all PTY, exec, and agent
WebSocket traffic. Replaces the legacy SandboxAPIProxy.doWebSocket
hijack-and-io.Copy path with multi-session, redial on upstream close,
keepalive, exec-exit suppression, migration-aware backoff, and a flap
circuit breaker.

This is the cell-only design we chose over a CloudFlare Durable Object:
the broker layer structurally belongs where the worker is (the cell), not
at the edge. Same broker code now handles hosted and self-hosted cells
with no divergence.

Architecture

Client ──WS── CF Worker ──WS── Cell-CP broker ──WS── Worker ──gRPC── Agent (in-VM)
                  │                    │                  │
              auth, cell           redial,            handlers.go,
              lookup,              keepalive,         RebindFromAgent,
              cap-token            exec-exit,         (unchanged)
              mint,                multi-session,
              transparent          flap breaker,
              forward              migration-aware
                                   backoff

CF Worker becomes a thin auth + cell-routing layer that transparently
forwards the WS upgrade. Both proxyToCellSDK (SDK path) and proxyWebSocket
(dashboard path) now forward — the dashboard's previous 120-LOC edge-side
WebSocketPair bridge is gone, which was the only thing keeping the
dashboard from getting broker benefits.

What the broker does

Per-sandbox actor goroutine (created on first session, released when
idle) holding:

  • Set of sessions — multiple concurrent WSes per sandbox (dashboard +
    SDK, two dashboard tabs) each get their own Session instance.
  • Per-session redial loop with exponential backoff (250 ms → 4 s ×
    10). On upstream close, re-resolves the worker URL and re-dials. Client
    WS stays open across the redial window.
  • Migration-aware cadence — when the cell-CP returns 503 migrating
    during resolve, switch to a steadier 2 s × 30 (~60 s budget) instead of
    burning the fast ladder.
  • Terminal-aware close — when the cell-CP returns 404 not found or
    410 stopped, close client with 1000 / "sandbox stopped" (or
    equivalent) instead of cycling redials.
  • Exec-exit marker tracking — the worker emits a 5-byte 0x03 + exitCode frame before closing a completed exec session; broker
    recognizes it and closes the client with 1000 / "exec completed"
    instead of redialing into a now-done session.
  • 30 s keepalive — empty binary frame on both client and upstream every
    alarm tick. Defends the cell ↔ worker hop against any long-idle
    middlebox drops; empty frames are no-ops on every receiver.
  • Flap circuit breaker — > 3 redial cycles within 60 s closes the
    client with 1011 / "upstream flapping". Prevents a buggy upstream from
    burning resources indefinitely.

Worker-side foundation (cherry-picked from PR #350)

The Go broker depends on worker-side changes that make session state
survive worker swaps. These are cherry-picked into this branch:

  • PTYManager.RebindFromAgent + ExecSessionManager.RebindFromAgent:
    lazy reattach to the in-VM agent when the worker's local cache misses
    (live migration or worker process restart).
  • PTYManager.ReleaseForSandbox + ExecSessionManager.ReleaseForSandbox:
    source-side handle release on outgoing migration, without killing the
    in-VM session.
  • SetMigrationOutgoingCallback on the QEMU manager + matching wiring in
    cmd/worker/main.go.
  • Exec WS handler skips the fake 0x03 exit marker when ExitCode is
    nil (i.e. cancel-during-migration vs real process exit).

No agent changes, no proto changes. Lazy rebind uses the existing
PTYAttach / ExecSessionAttach RPCs.

What this means for users

  • Worker restarts become invisible if the VM survives them (same
    session ID rebinds against the in-VM agent on whichever worker now
    hosts the VM).
  • Live migrations transparent end-to-end: client typing through the
    migration window pauses for the redial duration (sub-second to a few
    seconds) then resumes. Same shell, same bash PID, same /tmp contents.
  • Multiple dashboards or dashboard + SDK on the same sandbox stop
    clobbering each other.
  • Long-idle PTYs don't get dropped by middlebox idle timeouts.
  • Exec completion closes cleanly with the right exit code instead of
    spinning into a redial loop.
  • DELETE /api/sandboxes/{id} while a client is connected closes the
    WS cleanly with code=1000 reason='sandbox stopped'.

Tested

In-repo integration tests in scripts/integration-tests/05-08*.py. All
four pass end-to-end against dev:

# Test Verifies
05 ws-multi-client.py Two concurrent PTYs on the same sandbox stay independent
06 ws-exec-exit-clean.py Exec exit closes WS once (one scrollback_end, one exit marker, 1000 'exec completed') — pre-fix logged 4000+
scrollback ends in a redial loop
07 ws-pty-survives-migration.py PTY survives POST /migrate with WS held open; file written pre-mig readable post-mig
08 ws-exec-survives-migration.py Exec output stream continues through migration; final close is the real exec completed

Plus manual scenarios verified:

  • 90 s idle PTY held cleanly with broker keepalive frames at t=30 s and
    t=60 s
  • DELETE mid-session closes with 1000 'sandbox stopped' (verified
    directly via /tmp/verify_terminal_close.py)
  • Long-running exec (90 s) with no spurious redial / scrollback re-send
    loop

Files

internal/wsgateway/         880 LOC  — the broker (4 files, new)
internal/api/ws_broker.go   159 LOC  — route handler
internal/proxy/sandbox_api_proxy.go +85  — public ResolveWorker
internal/api/router.go      +27 -2   — wire wsGateway, dynamic dispatch
cmd/server/main.go          +9       — always initialize the broker

internal/sandbox/{pty,exec_session}.go  +143 -8  — worker rebind + release + cancel
internal/qemu/manager.go              +14 -1     — migration callback
internal/qemu/migration.go            +8         — fire callback
internal/worker/handlers.go           +30 -9     — cache-miss → rebind, exit-marker fix
cmd/worker/main.go                    +137 -5    — factories + plumbing

cloudflare-workers/api-edge/src/dashboard.ts  +22 -124  — drop edge-side bridge

scripts/integration-tests/05-08*.py + _ws_common.py + README  +435 LOC  — tests

Total: +1957 / -159.

Out of scope

  • Cross-cell migration — today's /migrate is within-cell. Code path
    exists (broker re-resolves cell on every redial) but no end-to-end test.
  • Worker process restart end-to-end test — architecturally identical
    to migration (empty local map → lazy rebind picks up).
  • Deleting SandboxAPIProxy.doWebSocket — the legacy hijack path is
    unreachable now but still in the tree (~100 LOC). Can be removed in a
    follow-up after some prod burn-in.

Rollback

git revert + redeploy. There's no env-var gate — the broker is the
canonical WS path. Same operational story as for any other commit.

Brian Reardon added 5 commits June 4, 2026 17:21
@breardon2011 breardon2011 marked this pull request as ready for review June 8, 2026 22:18
@breardon2011 breardon2011 merged commit d3b3692 into main Jun 10, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants