Websocket broker#360
Merged
Merged
Conversation
added 5 commits
June 4, 2026 17:21
…arker when handler exits without a real process exit; ws-edge integration tests 05-08
…503/migrating, clean 1000 close on 404/410 terminal
motatoes
approved these changes
Jun 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a per-cell WebSocket broker —
internal/wsgateway— that sits betweenclients (SDK, dashboard) and workers for all PTY, exec, and agent
WebSocket traffic. Replaces the legacy
SandboxAPIProxy.doWebSockethijack-and-
io.Copypath with multi-session, redial on upstream close,keepalive, exec-exit suppression, migration-aware backoff, and a flap
circuit breaker.
This is the cell-only design we chose over a CloudFlare Durable Object:
the broker layer structurally belongs where the worker is (the cell), not
at the edge. Same broker code now handles hosted and self-hosted cells
with no divergence.
Architecture
CF Worker becomes a thin auth + cell-routing layer that transparently
forwards the WS upgrade. Both
proxyToCellSDK(SDK path) andproxyWebSocket(dashboard path) now forward — the dashboard's previous 120-LOC edge-side
WebSocketPairbridge is gone, which was the only thing keeping thedashboard from getting broker benefits.
What the broker does
Per-sandbox actor goroutine (created on first session, released when
idle) holding:
SDK, two dashboard tabs) each get their own
Sessioninstance.10). On upstream close, re-resolves the worker URL and re-dials. Client
WS stays open across the redial window.
503 migratingduring resolve, switch to a steadier 2 s × 30 (~60 s budget) instead of
burning the fast ladder.
404 not foundor410 stopped, close client with1000 / "sandbox stopped"(orequivalent) instead of cycling redials.
0x03 + exitCodeframe before closing a completed exec session; brokerrecognizes it and closes the client with
1000 / "exec completed"instead of redialing into a now-done session.
alarm tick. Defends the cell ↔ worker hop against any long-idle
middlebox drops; empty frames are no-ops on every receiver.
client with
1011 / "upstream flapping". Prevents a buggy upstream fromburning resources indefinitely.
Worker-side foundation (cherry-picked from PR #350)
The Go broker depends on worker-side changes that make session state
survive worker swaps. These are cherry-picked into this branch:
PTYManager.RebindFromAgent+ExecSessionManager.RebindFromAgent:lazy reattach to the in-VM agent when the worker's local cache misses
(live migration or worker process restart).
PTYManager.ReleaseForSandbox+ExecSessionManager.ReleaseForSandbox:source-side handle release on outgoing migration, without killing the
in-VM session.
SetMigrationOutgoingCallbackon the QEMU manager + matching wiring incmd/worker/main.go.0x03exit marker whenExitCodeisnil (i.e. cancel-during-migration vs real process exit).
No agent changes, no proto changes. Lazy rebind uses the existing
PTYAttach/ExecSessionAttachRPCs.What this means for users
session ID rebinds against the in-VM agent on whichever worker now
hosts the VM).
migration window pauses for the redial duration (sub-second to a few
seconds) then resumes. Same shell, same bash PID, same
/tmpcontents.clobbering each other.
spinning into a redial loop.
DELETE /api/sandboxes/{id}while a client is connected closes theWS cleanly with
code=1000 reason='sandbox stopped'.Tested
In-repo integration tests in
scripts/integration-tests/05-08*.py. Allfour pass end-to-end against dev:
ws-multi-client.pyws-exec-exit-clean.py1000 'exec completed') — pre-fix logged 4000+ws-pty-survives-migration.pyPOST /migratewith WS held open; file written pre-mig readable post-migws-exec-survives-migration.pyexec completedPlus manual scenarios verified:
t=60 s
DELETEmid-session closes with1000 'sandbox stopped'(verifieddirectly via
/tmp/verify_terminal_close.py)loop
Files
Total: +1957 / -159.
Out of scope
/migrateis within-cell. Code pathexists (broker re-resolves cell on every redial) but no end-to-end test.
to migration (empty local map → lazy rebind picks up).
SandboxAPIProxy.doWebSocket— the legacy hijack path isunreachable now but still in the tree (~100 LOC). Can be removed in a
follow-up after some prod burn-in.
Rollback
git revert + redeploy. There's no env-var gate — the broker is thecanonical WS path. Same operational story as for any other commit.