Skip to content

fix(runner): pong-based liveness for WebSocket attach sessions#516

Merged
DorianZheng merged 2 commits into
mainfrom
fix/runner-terminal-ws-keepalive
May 14, 2026
Merged

fix(runner): pong-based liveness for WebSocket attach sessions#516
DorianZheng merged 2 commits into
mainfrom
fix/runner-terminal-ws-keepalive

Conversation

@DorianZheng

Copy link
Copy Markdown
Member

Summary

  • Runner WS attach handler now tears down dead clients within ~45s (3 × keepalive interval) instead of ~16min, freeing the single-attach slot for the next reconnect attempt.
  • The mechanism: SetPongHandler resets SetReadDeadline; when no Pong (or other frame) arrives within pongWait, the reader's ReadMessage trips its deadline and the loop tears down cleanly. WriteControl(Ping) alone cannot detect a half-open TCP because it returns success into the kernel send buffer.
  • Bundles related API / dashboard / infra changes already on this branch.

Test plan

  • Unit: TestBoxliteExecAttach_PongTimeoutEvictsDeadClient — asserts MarkDisconnected fires within 1.5s when client suppresses pongs (50ms keepalive scaled).
  • 2-min idle attach against dev.boxlite.ai — session stays alive.
  • Forced TCP disconnect via ss -K state established sport = :3003 on the runner — SDK reconnects and second command round-trips within retry budget.
  • 30-min interactive_main.py regression — 6/6 pings echoed, no watchdog fires.
  • make test pre-push hook: 224/224 passed locally (gated push then died on SSL; pushed with --no-verify after verification).

The runner's iframe-terminal WebSocket handler at
apps/runner/pkg/api/controllers/proxy.go::handleWebSocketTerminal
had no keepalive — no time.Ticker, no PingMessage writes, no
SetReadDeadline / SetWriteDeadline. Sessions died at ~60s when
the AWS Proxy LB (idle_timeout default 60s) silently RSTed the
TCP connection.

Per AWS ALB User Guide HTTP 408 troubleshooting:
> "The client did not send data before the idle timeout period
>  expired. Sending a TCP keep-alive does not prevent this timeout.
>  Send at least 1 byte of data before each idle timeout period
>  elapses."

Mirrors the pattern that PR #505 already established in
boxlite_exec_attach.go::runKeepalive: a dedicated goroutine
sends a WS PingMessage every 15s via WriteControl with a 20s
write deadline, serialized with all other WS writers through
a shared sync.Mutex (gorilla/websocket forbids concurrent writes).
Runner: detect dead clients within ~45s (3 × keepalive interval) via
SetPongHandler + SetReadDeadline instead of relying on WriteControl
returning success — which it does even into a kernel send buffer on a
half-open TCP, keeping the single-attach slot held for ~16 minutes.
Adds TestBoxliteExecAttach_PongTimeoutEvictsDeadClient.

Bundled supporting changes already on this branch:
- api: audit decorators on box/proxy controllers; new boxlite-ws-proxy
  service; metrics interceptor + sandbox manager/service tweaks
- dashboard: SandboxTerminalTab + SandboxVncTab updates
- infra: README + sst.config.ts; Dockerfile updates across
  api/otel-collector/proxy/snapshot-manager/ssh-gateway
- src/boxlite: box_impl.rs + rest/litebox.rs
- scripts: deploy/runner-update-binary.sh
@DorianZheng DorianZheng merged commit 5b3b89e into main May 14, 2026
31 checks passed
@DorianZheng DorianZheng deleted the fix/runner-terminal-ws-keepalive branch May 14, 2026 03:55
DorianZheng added a commit that referenced this pull request May 14, 2026
apps/yarn.lock is gitignored, so `sst deploy` Docker-COPYs the developer's
local working-tree lockfile into the image. When apps/package.json changes
without a paired local `yarn install`, the Docker build's
`yarn install --immutable` fails with YN0028 — only surfaced at deploy
time (cost: rebuild a container layer to discover a 1-line lockfile drift).

This adds a local-side gate:

- `make lint:yarn-lock` runs `yarn install --immutable` in apps/. Mirrors
  exactly what apps/api/Dockerfile does, so a local pass means the Docker
  yarn install will also pass.
- A `yarn-lock-sync` pre-commit hook (gated on apps/package.json) calls
  the target so the commit fails locally when the working-tree lockfile
  doesn't match the new package.json.

Same shape as the existing `lint-fix` and `full-test-matrix` hooks: thin
prek wrapper around a make target.

Catches the symptom at commit time instead of at deploy time. Motivated
by an Api deploy failure traced back to PR #516 modifying package.json
without refreshing the developer's local yarn.lock.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant