fix(runner): SSH gateway uses BoxLite exec (ssh -p 2222 back online)#524
Merged
Conversation
apps/runner/pkg/sshgateway/service.go connectToSandbox did
ssh.Dial("tcp", "<sandbox-uuid>:22220"), relying on Docker's userland
DNS to resolve sandbox UUIDs to per-container IPs. Commit d771729
("feat(runner): replace Docker with BoxLite VM backend") removed
Docker without replacing that resolver, so post-upgrade every public
SSH session to a sandbox dies at the dial with
"lookup <uuid> on 127.0.0.53:53: server misbehaving"
from systemd-resolved (no UUID->IP host DNS exists on the Runner EC2).
Replace the SSH-to-SSH bridge with an SSH-to-StartExecution bridge,
exactly the way apps/runner/pkg/api/controllers/proxy.go:114 already
does for the dashboard WebSocket terminal. Both code paths now use:
s.boxlite.StartExecution(ctx, sandboxId, cmd, args, stdout, stderr, tty)
which routes through libkrun vsock - no IP, no DNS, no Docker.
Channel-request handling:
- pty-req -> tty=true; remember rows/cols; ResizeTTY on start
- env -> accept silently
- shell -> StartExecution("/bin/bash")
- exec -> StartExecution("/bin/sh", "-c", payload)
- window-change -> exec.ResizeTTY
- signal -> best-effort no-op (Ctrl-C still works via pty)
Lifecycle:
- exec.Wait blocks teardown so exit-status SSH request lands before
channel close
- exec.Stdin closes on SSH peer write-side close so commands see EOF
Drops: connectToSandbox, getSandboxDetails, SandboxDetails struct,
hardcoded ssh.Password("sandbox-ssh"), InsecureIgnoreHostKey.
May-13 logs show this same shape ("Starting exec in sandbox" /
"Exec completed" with component=ssh_gateway_service). May-14 main
regressed to the dial path; this restores the correct architecture.
All three entry points that drop the user into a shell inside a sandbox
VM now share one launcher. Previously they each hardcoded a shell, with
two failure modes:
- apps/runner/pkg/api/controllers/proxy.go:114 (dashboard / iframe
terminal) hardcoded /bin/sh -- correct for the default alpine
snapshot but no bash for users who'd prefer it.
- apps/runner/pkg/sshgateway/service.go (public SSH gateway, fixed
in baca7b6 to use StartExecution) defaulted /bin/bash -- broke
on the default snapshot because bash is not installed.
Introduce apps/runner/pkg/shellutil/launcher.go with one helper:
func DefaultInteractiveShell() (cmd string, args []string)
Returns: /bin/sh, ["-c", "exec $(command -v bash || command -v ash || command -v sh) -l"]
Rationale (kubectl exec convention, per Kubernetes docs):
- /bin/sh is POSIX-required -> launcher process itself always starts.
- command -v is POSIX, works on busybox/alpine and full distros.
- exec replaces the launcher sh -> no extra PID; chosen shell is the
session's pid 1.
- -l makes it a login shell -> ~/.profile is sourced, PWD=$HOME,
PATH/HOME exported. Matches what `ssh user@host` users expect.
- Tries bash first (preferred), falls back to ash (alpine), then sh
(POSIX guarantee).
Wired into:
- apps/runner/pkg/api/controllers/proxy.go:handleWebSocketTerminal
- apps/runner/pkg/sshgateway/service.go (default for `shell` requests;
`exec` requests still take user-supplied commands as before)
The exec request path keeps /bin/sh -c <payload> unchanged: when the
user explicitly types `ssh host "cmd args"`, OpenSSH-canonical behaviour
is to run it under sh -c, and there is no shell-preference ambiguity to
resolve in that case.
OpenSSH 9.0+ scp defaults to the SFTP subsystem (RFC 4254 §6.5) instead of the legacy `exec scp -t` protocol. Without "subsystem" handling, the runner replies false and the client sees: subsystem request failed on channel 0 scp: Connection closed Add a case for "subsystem": - only "sftp" is supported (matches the OpenSSH default) - spawn sftp-server inside the VM via the same StartExecution path - probe install locations in order (alpine /usr/lib/ssh, debian-ish /usr/lib/openssh, RHEL-ish /usr/libexec, plus PATH) so unusual layouts still work without per-image hardcoding - no TTY for binary-protocol subsystems Workaround if the VM ships no sftp-server: `scp -O -P 2222 ...` falls back to the legacy protocol over `exec`, which has worked since the baca7b6 rewrite. Verified shell path still lands at boxlite:/# via the same SSH host.
Two launcher bugs surfaced by interactive testing:
1. Shell session landed at / instead of $HOME. OpenSSH-canonical
behaviour is chdir(pw_dir) before exec'ing the user's shell -
that's what makes `ssh user@host` drop you at ~. The -l flag
sources profile but doesn't itself cd. Add `cd "${HOME:-/root}"`
to the launcher so the SSH and dashboard-iframe terminal sessions
both land at the user's home (mirrors what sshd does internally).
2. SFTP subsystem silently exited 0 when no sftp-server binary was
found in the VM. `exec $(empty)` is a POSIX no-op that returns 0,
so the scp client saw a clean EOF and reported only "Connection
closed" with no error message. Move the launcher logic into
shellutil.SftpSubsystem and explicitly check the resolved path:
if empty, write a clear stderr message ("sftp-server not found in
sandbox VM; install openssh-sftp-server, or fall back to 'scp -O'")
and exit 127. The client now sees that message via the SSH stderr
stream, and users can act on it.
Same shellutil helper covers both code paths (sshgateway + the
dashboard's WebSocket terminal), so behaviour stays consistent.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Commit
d7717290("replace Docker with BoxLite VM backend") removed Docker without replacing the userland DNS the SSH gateway depended on.apps/runner/pkg/sshgateway/service.go::connectToSandboxwas still doingssh.Dial("tcp", "<sandbox-uuid>:22220"), relying on Docker-era container-name DNS. Post-upgrade everyssh -p 2222 <token>@ssh.dev.boxlite.aifailed withlookup <uuid> on 127.0.0.53:53: server misbehaving; the user sawConnection closed.Root cause
The original SSH-to-SSH bridge assumed a daemon SSH server listening inside each sandbox at port 22220, reachable by container-name DNS. With Docker gone there's no DNS, no per-container IP, and no in-VM SSH daemon — only libkrun's vsock. The dashboard WebSocket terminal at
controllers/proxy.go:114already routes through libkrun viaboxlite.Client.StartExecution; the SSH gateway was the only caller still on the dead path.Fix
Four commits, smallest unit each:
baca7b62— replace dial-by-UUID withStartExecutionbridge. SSH channel requests (pty-req/env/shell/exec/window-change/signal) map onto SDK calls. DropsconnectToSandbox,getSandboxDetails,ssh.Password("sandbox-ssh"),InsecureIgnoreHostKey.c42f2e7d— newapps/runner/pkg/shellutil/launcher.gowithDefaultInteractiveShell()returning/bin/sh -c 'exec $(command -v bash || command -v ash || command -v sh) -l'. Shared by bothsshgateway/service.goandcontrollers/proxy.go::handleWebSocketTerminalso dashboard terminal + iframe terminal + SSH all land on the same shell. Follows kubectl exec convention.b2e722be— acceptsubsystem sftprequests (RFC 4254 §6.5; OpenSSH 9.0+ scp defaults to SFTP).bada4280—cd "${HOME:-/root}"before exec (matches OpenSSHchdir(pw_dir)so users land at~, not/); explicit error whensftp-serverbinary is missing in the sandbox image.Test plan
ssh -p 2222 <token>@ssh.dev.boxlite.ailands atboxlite:~#prompt (verified post-deploy).ssh user@host "ls /"one-shot exec returns content and exits cleanly.scp -P 2222 file <token>@ssh.dev.boxlite.ai:/root/— known limitation, see below.Known limitations / follow-ups
sftp-server: showsscp: Connection closed. OpenSSH scp discards SSH channel stderr (extended-data type 1) in SFTP subsystem mode, so the fail-loud stderr frombada4280does run but the user never sees it. Two workarounds: (a) installopenssh-sftp-serverin the sandbox image (apk add openssh-sftp-serveron alpine), (b)scp -O -P 2222 …falls back to legacy SCP protocol overexec. Cleaner server-side fix is a pre-flight probe +Reply(false, nil)so scp printssubsystem request failed— deferred to a follow-up PR.-L/-R/Unix sockets). Daytona supports this via gliderlabs/ssh callbacks; we deliberately ship a minimal subset here.