Skip to content

Websocket edge#350

Closed
breardon2011 wants to merge 6 commits into
mainfrom
websocket-edge
Closed

Websocket edge#350
breardon2011 wants to merge 6 commits into
mainfrom
websocket-edge

Conversation

@breardon2011

@breardon2011 breardon2011 commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a per-sandbox Durable Object (SandboxWsGateway) that brokers all SDK
and dashboard WebSocket traffic — exec, PTY, and agent — between the client
and the sandbox's owning cell. The whole DO path is gated on WS_VIA_DO=1;
when unset, the legacy transparent CF-Tunnel forward stays in place, so the
rollback is a one-line var change.

What the DO does

Before: each WS upgrade was a transparent fetch passthrough. Any disruption
along the chain (worker restart, cell-CP blip, live migration, idle drop)
killed the user's terminal with an abrupt transport close. No state, no
recovery.

After: the DO terminates both ends with a WebSocketPair, bridges frames,
and persists across one side dying. Layered:

  • v2 alarm tick (30s) for liveness + idle-storage cleanup.
  • v3 redial on upstream close with exponential backoff. Closes the client
    cleanly when D1 says the sandbox is stopped/failed or the cell returns
    404/410.
  • v5 per-sandbox cap-token cache shared across sessions in the same DO.
  • v6 exec-exit suppression (worker's 5-byte 0x03+exitCode marker
    triggers a clean 1000 / "exec completed" close instead of an
    infinite-redial loop) + empty-frame keepalive on each alarm tick to defend
    the DO↔cell hop against the CF Workers fetch-WS idle drop (~100s).
  • v7 per-session refactor: each fetch() creates a Session and the
    gateway holds Set<Session>, so two clients on the same sandbox don't
    clobber each other. Migration-aware backoff (2s × 30 ≈ 60s on
    503 /migrating/). Per-session flap circuit breaker (>3 redial cycles in
    60s closes with 1011 "upstream flapping"). Dropped the v4 output ring
    buffer — cell-side scrollback (1 MB ring, resent on every reattach) is a
    strict superset and was duplicating output.

Worker-side: sessions survive live migration

Sessions stay alive end-to-end across a worker migration. The in-VM agent
owns the PTY/exec session map; what dies is the host-side bookkeeping on
the source worker. The fix:

  1. Source releaseSetMigrationOutgoingCallback fires
    PTYManager.ReleaseForSandbox and ExecSessionManager.ReleaseForSandbox
    after LiveMigrate completes. PTY closes its gRPC stream; exec cancels
    its stream context via a new Cancel field on the handle. Without this,
    the source worker's WS handler stayed blocked on a dead vsock and the DO
    never saw an upstream close.
  2. Lazy rebind on destination — when the destination worker's local
    session map misses, handlers.go falls back to
    RebindFromAgent(sandboxID, sessionID). The manager opens a fresh
    agent.PTYAttach / agent.ExecSessionAttach against the existing
    session_id; the agent's first message is the existence check (returns
    "not found" if the in-VM session is genuinely gone). On success a fresh
    local handle is registered and the WS upgrade returns 101.
  3. Don't fake exits on cancel — the exec WS handler only emits the
    0x03 exit-code marker when ExitCode != nil (a real process exit). A
    stream cancel during migration release no longer looks like
    "exec completed" to the edge DO.

No agent changes, no proto changes, no edge DO logic changes — pure
worker plumbing.

User experience: typing through a migration window pauses for ~500ms–2s
(the DO redial), then resumes on the destination worker. No reconnect, no
new session_id, no lost in-VM process.

Tested

End-to-end against dev (scripts/integration-tests/05-08*.py, exit 0/1):

# Test What it catches
05 ws-multi-client.py Two concurrent PTYs on the same sandbox stay independent (would have clobbered each other pre-v7)
06 ws-exec-exit-clean.py Exec exits cleanly → exactly one scrollback_end, exactly one exit marker, close 1000 'exec completed' (pre-v6
logged 4000+ scrollback_ends in a loop)
07 ws-pty-survives-migration.py PTY WS held open through POST /migrate; file written pre-mig is readable post-mig; client never closes
08 ws-exec-survives-migration.py Exec output stream continues through migration; final close is the real exec completed, not a fake one from
cancel

Plus the manual scenarios from earlier rounds: 75 s idle PTY survives the
keepalive, sandbox DELETE closes with 1000 'sandbox stopped', worker
process restart triggers the redial path, etc.

Files

cloudflare-workers/shared/sandbox_ws_gateway.ts   # the DO (new, ~700 LOC)
cloudflare-workers/api-edge/src/index.ts          # SDK pty/exec WS route → routeWsViaGateway
cloudflare-workers/api-edge/src/dashboard.ts      # dashboard pty WS route → same helper
cloudflare-workers/api-edge/wrangler.toml         # SANDBOX_WS binding, v1-v4 migrations, WS_VIA_DO=1
cloudflare-workers/api-edge/wrangler.prod.toml    # mirror on prod (WS_VIA_DO commented = off by default)
internal/sandbox/pty.go                           # + RebindFromAgent, ReleaseForSandbox
internal/sandbox/exec_session.go                  # + RebindFromAgent, ReleaseForSandbox, Cancel field
internal/qemu/manager.go                          # + SetMigrationOutgoingCallback
internal/qemu/migration.go                        # call the callback after LiveMigrate cleanup
internal/worker/handlers.go                       # cache-miss → Rebind fallback; skip 0x03 on non-exit
cmd/worker/main.go                                # rebind funcs + cancel plumbing + callback wiring
scripts/integration-tests/05-08*.py + _ws_common  # tests

Rollback / cutover

  • Edge: set WS_VIA_DO to anything other than "1" (or unset) and
    redeploy. proxyToCellSDK and proxyWebSocket fall back to the
    transparent forward immediately. No DO instances are torn down — they
    just stop being routed to and the alarm eventually releases their
    storage.
  • Worker: the Rebind path is a fallthrough on cache miss. With
    WS_VIA_DO=0 on the edge, the destination worker still rebinds
    correctly if a stale session_id happens to land on it after worker
    restart. Strict net-positive over current behavior.
  • Prod: wrangler.prod.toml ships the binding + migration but leaves
    WS_VIA_DO commented out, so promoting this PR is infrastructure-only.
    Flip the flag in a follow-up after dev burn-in.

Brian Reardon added 6 commits June 1, 2026 10:10
…ial backoff, flap circuit breaker, drop redundant scrollback buffer
…arker when handler exits without a real process exit; ws-edge integration tests 05-08
# Conflicts:
#	cloudflare-workers/api-edge/wrangler.toml
@breardon2011 breardon2011 marked this pull request as ready for review June 3, 2026 22:27
@breardon2011 breardon2011 marked this pull request as draft June 3, 2026 23:23
@breardon2011

Copy link
Copy Markdown
Contributor Author

going to make separate PR for the Cell gateway

@breardon2011 breardon2011 mentioned this pull request Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant