Skip to content

fix(runner): SSH gateway uses BoxLite exec (ssh -p 2222 back online)#524

Merged
DorianZheng merged 4 commits into
mainfrom
fix/runner-sshgateway-use-startexecution
May 14, 2026
Merged

fix(runner): SSH gateway uses BoxLite exec (ssh -p 2222 back online)#524
DorianZheng merged 4 commits into
mainfrom
fix/runner-sshgateway-use-startexecution

Conversation

@DorianZheng

Copy link
Copy Markdown
Member

Summary

Commit d7717290 ("replace Docker with BoxLite VM backend") removed Docker without replacing the userland DNS the SSH gateway depended on. apps/runner/pkg/sshgateway/service.go::connectToSandbox was still doing ssh.Dial("tcp", "<sandbox-uuid>:22220"), relying on Docker-era container-name DNS. Post-upgrade every ssh -p 2222 <token>@ssh.dev.boxlite.ai failed with lookup <uuid> on 127.0.0.53:53: server misbehaving; the user saw Connection closed.

Root cause

The original SSH-to-SSH bridge assumed a daemon SSH server listening inside each sandbox at port 22220, reachable by container-name DNS. With Docker gone there's no DNS, no per-container IP, and no in-VM SSH daemon — only libkrun's vsock. The dashboard WebSocket terminal at controllers/proxy.go:114 already routes through libkrun via boxlite.Client.StartExecution; the SSH gateway was the only caller still on the dead path.

Fix

Four commits, smallest unit each:

  • baca7b62 — replace dial-by-UUID with StartExecution bridge. SSH channel requests (pty-req/env/shell/exec/window-change/signal) map onto SDK calls. Drops connectToSandbox, getSandboxDetails, ssh.Password("sandbox-ssh"), InsecureIgnoreHostKey.
  • c42f2e7d — new apps/runner/pkg/shellutil/launcher.go with DefaultInteractiveShell() returning /bin/sh -c 'exec $(command -v bash || command -v ash || command -v sh) -l'. Shared by both sshgateway/service.go and controllers/proxy.go::handleWebSocketTerminal so dashboard terminal + iframe terminal + SSH all land on the same shell. Follows kubectl exec convention.
  • b2e722be — accept subsystem sftp requests (RFC 4254 §6.5; OpenSSH 9.0+ scp defaults to SFTP).
  • bada4280cd "${HOME:-/root}" before exec (matches OpenSSH chdir(pw_dir) so users land at ~, not /); explicit error when sftp-server binary is missing in the sandbox image.

Test plan

  • ssh -p 2222 <token>@ssh.dev.boxlite.ai lands at boxlite:~# prompt (verified post-deploy).
  • Dashboard sandbox-detail Terminal tab still works (unified launcher).
  • ssh user@host "ls /" one-shot exec returns content and exits cleanly.
  • scp -P 2222 file <token>@ssh.dev.boxlite.ai:/root/known limitation, see below.

Known limitations / follow-ups

  • scp against snapshots without sftp-server: shows scp: Connection closed. OpenSSH scp discards SSH channel stderr (extended-data type 1) in SFTP subsystem mode, so the fail-loud stderr from bada4280 does run but the user never sees it. Two workarounds: (a) install openssh-sftp-server in the sandbox image (apk add openssh-sftp-server on alpine), (b) scp -O -P 2222 … falls back to legacy SCP protocol over exec. Cleaner server-side fix is a pre-flight probe + Reply(false, nil) so scp prints subsystem request failed — deferred to a follow-up PR.
  • No port forwarding (-L/-R/Unix sockets). Daytona supports this via gliderlabs/ssh callbacks; we deliberately ship a minimal subset here.

apps/runner/pkg/sshgateway/service.go connectToSandbox did
ssh.Dial("tcp", "<sandbox-uuid>:22220"), relying on Docker's userland
DNS to resolve sandbox UUIDs to per-container IPs. Commit d771729
("feat(runner): replace Docker with BoxLite VM backend") removed
Docker without replacing that resolver, so post-upgrade every public
SSH session to a sandbox dies at the dial with
  "lookup <uuid> on 127.0.0.53:53: server misbehaving"
from systemd-resolved (no UUID->IP host DNS exists on the Runner EC2).

Replace the SSH-to-SSH bridge with an SSH-to-StartExecution bridge,
exactly the way apps/runner/pkg/api/controllers/proxy.go:114 already
does for the dashboard WebSocket terminal. Both code paths now use:
  s.boxlite.StartExecution(ctx, sandboxId, cmd, args, stdout, stderr, tty)
which routes through libkrun vsock - no IP, no DNS, no Docker.

Channel-request handling:
- pty-req       -> tty=true; remember rows/cols; ResizeTTY on start
- env           -> accept silently
- shell         -> StartExecution("/bin/bash")
- exec          -> StartExecution("/bin/sh", "-c", payload)
- window-change -> exec.ResizeTTY
- signal        -> best-effort no-op (Ctrl-C still works via pty)

Lifecycle:
- exec.Wait blocks teardown so exit-status SSH request lands before
  channel close
- exec.Stdin closes on SSH peer write-side close so commands see EOF

Drops: connectToSandbox, getSandboxDetails, SandboxDetails struct,
hardcoded ssh.Password("sandbox-ssh"), InsecureIgnoreHostKey.

May-13 logs show this same shape ("Starting exec in sandbox" /
"Exec completed" with component=ssh_gateway_service). May-14 main
regressed to the dial path; this restores the correct architecture.
All three entry points that drop the user into a shell inside a sandbox
VM now share one launcher. Previously they each hardcoded a shell, with
two failure modes:

  - apps/runner/pkg/api/controllers/proxy.go:114 (dashboard / iframe
    terminal) hardcoded /bin/sh -- correct for the default alpine
    snapshot but no bash for users who'd prefer it.
  - apps/runner/pkg/sshgateway/service.go (public SSH gateway, fixed
    in baca7b6 to use StartExecution) defaulted /bin/bash -- broke
    on the default snapshot because bash is not installed.

Introduce apps/runner/pkg/shellutil/launcher.go with one helper:

    func DefaultInteractiveShell() (cmd string, args []string)

Returns: /bin/sh, ["-c", "exec $(command -v bash || command -v ash || command -v sh) -l"]

Rationale (kubectl exec convention, per Kubernetes docs):
  - /bin/sh is POSIX-required -> launcher process itself always starts.
  - command -v is POSIX, works on busybox/alpine and full distros.
  - exec replaces the launcher sh -> no extra PID; chosen shell is the
    session's pid 1.
  - -l makes it a login shell -> ~/.profile is sourced, PWD=$HOME,
    PATH/HOME exported. Matches what `ssh user@host` users expect.
  - Tries bash first (preferred), falls back to ash (alpine), then sh
    (POSIX guarantee).

Wired into:
  - apps/runner/pkg/api/controllers/proxy.go:handleWebSocketTerminal
  - apps/runner/pkg/sshgateway/service.go (default for `shell` requests;
    `exec` requests still take user-supplied commands as before)

The exec request path keeps /bin/sh -c <payload> unchanged: when the
user explicitly types `ssh host "cmd args"`, OpenSSH-canonical behaviour
is to run it under sh -c, and there is no shell-preference ambiguity to
resolve in that case.
OpenSSH 9.0+ scp defaults to the SFTP subsystem (RFC 4254 §6.5) instead
of the legacy `exec scp -t` protocol. Without "subsystem" handling, the
runner replies false and the client sees:

  subsystem request failed on channel 0
  scp: Connection closed

Add a case for "subsystem":
- only "sftp" is supported (matches the OpenSSH default)
- spawn sftp-server inside the VM via the same StartExecution path
- probe install locations in order (alpine /usr/lib/ssh, debian-ish
  /usr/lib/openssh, RHEL-ish /usr/libexec, plus PATH) so unusual
  layouts still work without per-image hardcoding
- no TTY for binary-protocol subsystems

Workaround if the VM ships no sftp-server: `scp -O -P 2222 ...` falls
back to the legacy protocol over `exec`, which has worked since the
baca7b6 rewrite.

Verified shell path still lands at boxlite:/# via the same SSH host.
Two launcher bugs surfaced by interactive testing:

1. Shell session landed at / instead of $HOME. OpenSSH-canonical
   behaviour is chdir(pw_dir) before exec'ing the user's shell -
   that's what makes `ssh user@host` drop you at ~. The -l flag
   sources profile but doesn't itself cd. Add `cd "${HOME:-/root}"`
   to the launcher so the SSH and dashboard-iframe terminal sessions
   both land at the user's home (mirrors what sshd does internally).

2. SFTP subsystem silently exited 0 when no sftp-server binary was
   found in the VM. `exec $(empty)` is a POSIX no-op that returns 0,
   so the scp client saw a clean EOF and reported only "Connection
   closed" with no error message. Move the launcher logic into
   shellutil.SftpSubsystem and explicitly check the resolved path:
   if empty, write a clear stderr message ("sftp-server not found in
   sandbox VM; install openssh-sftp-server, or fall back to 'scp -O'")
   and exit 127. The client now sees that message via the SSH stderr
   stream, and users can act on it.

Same shellutil helper covers both code paths (sshgateway + the
dashboard's WebSocket terminal), so behaviour stays consistent.
@DorianZheng DorianZheng merged commit f052b69 into main May 14, 2026
22 checks passed
@DorianZheng DorianZheng deleted the fix/runner-sshgateway-use-startexecution branch May 14, 2026 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant