fix(runner): pong-based liveness for WebSocket attach sessions#516
Merged
Conversation
The runner's iframe-terminal WebSocket handler at apps/runner/pkg/api/controllers/proxy.go::handleWebSocketTerminal had no keepalive — no time.Ticker, no PingMessage writes, no SetReadDeadline / SetWriteDeadline. Sessions died at ~60s when the AWS Proxy LB (idle_timeout default 60s) silently RSTed the TCP connection. Per AWS ALB User Guide HTTP 408 troubleshooting: > "The client did not send data before the idle timeout period > expired. Sending a TCP keep-alive does not prevent this timeout. > Send at least 1 byte of data before each idle timeout period > elapses." Mirrors the pattern that PR #505 already established in boxlite_exec_attach.go::runKeepalive: a dedicated goroutine sends a WS PingMessage every 15s via WriteControl with a 20s write deadline, serialized with all other WS writers through a shared sync.Mutex (gorilla/websocket forbids concurrent writes).
Runner: detect dead clients within ~45s (3 × keepalive interval) via SetPongHandler + SetReadDeadline instead of relying on WriteControl returning success — which it does even into a kernel send buffer on a half-open TCP, keeping the single-attach slot held for ~16 minutes. Adds TestBoxliteExecAttach_PongTimeoutEvictsDeadClient. Bundled supporting changes already on this branch: - api: audit decorators on box/proxy controllers; new boxlite-ws-proxy service; metrics interceptor + sandbox manager/service tweaks - dashboard: SandboxTerminalTab + SandboxVncTab updates - infra: README + sst.config.ts; Dockerfile updates across api/otel-collector/proxy/snapshot-manager/ssh-gateway - src/boxlite: box_impl.rs + rest/litebox.rs - scripts: deploy/runner-update-binary.sh
3 tasks
DorianZheng
added a commit
that referenced
this pull request
May 14, 2026
apps/yarn.lock is gitignored, so `sst deploy` Docker-COPYs the developer's local working-tree lockfile into the image. When apps/package.json changes without a paired local `yarn install`, the Docker build's `yarn install --immutable` fails with YN0028 — only surfaced at deploy time (cost: rebuild a container layer to discover a 1-line lockfile drift). This adds a local-side gate: - `make lint:yarn-lock` runs `yarn install --immutable` in apps/. Mirrors exactly what apps/api/Dockerfile does, so a local pass means the Docker yarn install will also pass. - A `yarn-lock-sync` pre-commit hook (gated on apps/package.json) calls the target so the commit fails locally when the working-tree lockfile doesn't match the new package.json. Same shape as the existing `lint-fix` and `full-test-matrix` hooks: thin prek wrapper around a make target. Catches the symptom at commit time instead of at deploy time. Motivated by an Api deploy failure traced back to PR #516 modifying package.json without refreshing the developer's local yarn.lock.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SetPongHandlerresetsSetReadDeadline; when no Pong (or other frame) arrives withinpongWait, the reader'sReadMessagetrips its deadline and the loop tears down cleanly.WriteControl(Ping)alone cannot detect a half-open TCP because it returns success into the kernel send buffer.Test plan
TestBoxliteExecAttach_PongTimeoutEvictsDeadClient— assertsMarkDisconnectedfires within 1.5s when client suppresses pongs (50ms keepalive scaled).ss -K state established sport = :3003on the runner — SDK reconnects and second command round-trips within retry budget.interactive_main.pyregression — 6/6 pings echoed, no watchdog fires.make testpre-push hook: 224/224 passed locally (gated push then died on SSL; pushed with --no-verify after verification).