You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add an SSH-based transport to all-smi view so operators can monitor a set of remote hosts without first installing and starting all-smi api on each target. Primary path: exec all-smi snapshot --format json remotely if the binary is present. Fallback path: a small shim that parses nvidia-smi --query-gpu=... --format=csv,noheader,nounits (or rocm-smi) into the same SnapshotFrame schema, so brand-new NVIDIA/AMD boxes are visible out of the box.
Motivation
The existing view mode assumes every remote target is already running all-smi api. This is great for a provisioned fleet but awful for ad-hoc use — the "I want to check 3 DGX boxes right now" flow. A zero-install agentless path dramatically lowers the barrier:
SRE uses SSH keys they already have to ssh user@dgx-01.
all-smi view --ssh user@dgx-01,user@dgx-02 just works.
If those boxes have all-smi installed, great — we use the native JSON. If not, we fall back to parsing nvidia-smi / rocm-smi output and still render the same TUI.
This matches the UX of tools like top, htop, ctop that ride SSH without infrastructure.
Current state
view supports --hosts and --hostfile for HTTP targets only (each must run all-smi api).
No SSH transport.
src/view/data_collection/strategy.rs has LocalStrategy + RemoteStrategy — a clean place to add SshStrategy.
rocm-smi shim: rocm-smi --showuse --showmemuse --showtemp --showpower --json (use the JSON output that rocm-smi supports).
If none available, mark the host as status: unsupported in the TUI.
Execution
Per tick, exec the chosen command on the kept-alive session; parse stdout into SnapshotFrame. Use the same SnapshotFrame schema as the snapshot subcommand and the /events endpoint.
Display:
Host tabs show ssh://user@dgx-01 prefix.
A small chip indicates the transport path per host: native (all-smi installed), nvidia-smi (shim), rocm-smi (shim), unsupported.
Connection state: connecting, connected, auth-failed, timeout, disconnected — with a human-readable last-error tooltip.
Authentication
Precedence: explicit --ssh-key → SSH agent (SSH_AUTH_SOCK) → ~/.ssh/config identity file → ~/.ssh/id_ed25519 / ~/.ssh/id_rsa. Password auth is not supported in v1 (agent/key only) — document explicitly.
New src/network/ssh_client.rs — russh-backed session manager: connect, keep-alive, exec, close.
New src/network/nvidia_smi_shim.rs — parse nvidia-smi CSV into Vec<GpuInfo>; fill chassis_info/cpu_info as best-effort (probably leave empty on shim path; document that CPU/chassis may be absent).
New src/network/rocm_smi_shim.rs — parse rocm-smi JSON similarly.
src/view/data_collection/ssh_strategy.rs — new strategy parallel to LocalStrategy/RemoteStrategy. Implements the same trait that feeds snapshots into the existing render pipeline.
src/view/runner.rs — route --ssh targets through SshStrategy.
src/ui/renderers/dashboard.rs — render transport chip + connection state.
src/ui/help.rs — mention SSH usage.
examples/hosts-ssh.txt — example SSH hostfile.
Acceptance criteria
all-smi view --ssh user@localhost connects via SSH (when an sshd is running locally) and renders the local hardware through the SSH transport, using all-smi snapshot when installed.
With --ssh-fallback nvidia-smi, connecting to a host without all-smi but with nvidia-smi installed renders GPU data via the shim.
Tab labels clearly show the ssh:// prefix; transport chip shows native or nvidia-smi/rocm-smi accordingly.
Connection errors (timeout, auth fail, host key mismatch) surface in the TUI as a per-host state, not a crash.
--ssh-strict-host-key=accept-new behaves as documented (accepts on first connect, persists fingerprint, rejects changes).
--ssh-hostfile parses correctly, including comments and port suffixes.
Passwords are never logged, never stored. Only key-based auth supported.
Rate-limit: opening > 50 SSH sessions staggers connection establishment (similar to the existing connection-staggering pattern for HTTP).
Works on musl static builds (validate with the x86_64-unknown-linux-musl target in CI).
README gains an "Agentless SSH mode" section with a one-liner example.
Edge cases & non-goals
macOS hosts: all-smi local requires sudo on macOS. SSH cannot supply a password non-interactively — document that macOS targets via SSH require passwordless sudo configured for the nvidia/iop readers, or they'll return limited data. Do not attempt to escalate.
Hosts behind a bastion: document ~/.ssh/config ProxyJump support (russh supports it).
Known hosts: default ~/.ssh/known_hosts; --ssh-known-hosts for a custom file; isolated from the system OpenSSH state by default? No — follow standard OpenSSH known_hosts location so users don't re-accept fingerprints.
SIGINT during connection: clean up in-flight sessions; don't leak file descriptors.
Large cluster (100+ hosts): use a bounded semaphore to limit concurrent SSH connects; expose --ssh-concurrency 32 flag (default 32).
Non-goal: password auth. Too many footguns; too hard to do safely in a TUI. Key + agent only.
Non-goal: Windows SSH targets as a v1 priority. Should work in theory (russh is cross-platform) but add a caveat in docs.
Non-goal: automatic installation of all-smi on remote (e.g., scp + execute). Out of scope.
Soft dependency
The snapshot subcommand is the primary path; this feature reaches full value once that subcommand is present. Until then, fall back to the nvidia-smi / rocm-smi shims only.
Summary
Add an SSH-based transport to
all-smi viewso operators can monitor a set of remote hosts without first installing and startingall-smi apion each target. Primary path: execall-smi snapshot --format jsonremotely if the binary is present. Fallback path: a small shim that parsesnvidia-smi --query-gpu=... --format=csv,noheader,nounits(orrocm-smi) into the sameSnapshotFrameschema, so brand-new NVIDIA/AMD boxes are visible out of the box.Motivation
The existing
viewmode assumes every remote target is already runningall-smi api. This is great for a provisioned fleet but awful for ad-hoc use — the "I want to check 3 DGX boxes right now" flow. A zero-install agentless path dramatically lowers the barrier:ssh user@dgx-01.all-smi view --ssh user@dgx-01,user@dgx-02just works.all-smiinstalled, great — we use the native JSON. If not, we fall back to parsingnvidia-smi/rocm-smioutput and still render the same TUI.This matches the UX of tools like
top,htop,ctopthat ride SSH without infrastructure.Current state
viewsupports--hostsand--hostfilefor HTTP targets only (each must runall-smi api).src/view/data_collection/strategy.rshasLocalStrategy+RemoteStrategy— a clean place to addSshStrategy.Proposed design
CLI
--ssh-hostfilefile format: oneuser@host[:port]per line,#comments allowed.Transport selection
Per host, on initial connection:
all-smi --version(1-second timeout).all-smi snapshot --format json --include gpu,cpu,memory,chassis.--ssh-fallbacklist in order:nvidia-smishim:nvidia-smi --query-gpu=index,uuid,name,driver_version,utilization.gpu,memory.used,memory.total,temperature.gpu,clocks.current.graphics,power.draw --format=csv,noheader,nounitsrocm-smishim:rocm-smi --showuse --showmemuse --showtemp --showpower --json(use the JSON output that rocm-smi supports).status: unsupportedin the TUI.Execution
Per tick, exec the chosen command on the kept-alive session; parse stdout into
SnapshotFrame. Use the sameSnapshotFrameschema as thesnapshotsubcommand and the/eventsendpoint.Display:
ssh://user@dgx-01prefix.native(all-smi installed),nvidia-smi(shim),rocm-smi(shim),unsupported.connecting,connected,auth-failed,timeout,disconnected— with a human-readable last-error tooltip.Authentication
Precedence: explicit
--ssh-key→ SSH agent (SSH_AUTH_SOCK) →~/.ssh/configidentity file →~/.ssh/id_ed25519/~/.ssh/id_rsa. Password auth is not supported in v1 (agent/key only) — document explicitly.--ssh-strict-host-key:yes(default): refuse unknown hosts (matches OpenSSHStrictHostKeyChecking=yes).accept-new: accept on first connect, reject if key changes (matchesaccept-new).no: accept any (emit a prominent warning in the TUI).Library choice
Use
russh(pure Rust, maintained, supports async, proxy_jump) rather thanssh2(libssh2 bindings — C dep; musl builds painful).Implementation plan
Files to add / modify:
Cargo.toml—russh = "0.54"(or the latest compatible; validate musl compatibility),russh-keys.src/cli.rs—ViewArgs.ssh,ViewArgs.ssh_hostfile,ssh_key,ssh_config,ssh_strict_host_key,ssh_fallback,ssh_known_hosts,ssh_timeout_secs.src/network/ssh_client.rs— russh-backed session manager: connect, keep-alive, exec, close.src/network/nvidia_smi_shim.rs— parsenvidia-smiCSV intoVec<GpuInfo>; fillchassis_info/cpu_infoas best-effort (probably leave empty on shim path; document that CPU/chassis may be absent).src/network/rocm_smi_shim.rs— parse rocm-smi JSON similarly.src/view/data_collection/ssh_strategy.rs— new strategy parallel toLocalStrategy/RemoteStrategy. Implements the same trait that feeds snapshots into the existing render pipeline.src/view/data_collection/strategy.rs— registerSshStrategy.src/view/runner.rs— route--sshtargets throughSshStrategy.src/ui/renderers/dashboard.rs— render transport chip + connection state.src/ui/help.rs— mention SSH usage.examples/hosts-ssh.txt— example SSH hostfile.Acceptance criteria
all-smi view --ssh user@localhostconnects via SSH (when an sshd is running locally) and renders the local hardware through the SSH transport, usingall-smi snapshotwhen installed.--ssh-fallback nvidia-smi, connecting to a host withoutall-smibut withnvidia-smiinstalled renders GPU data via the shim.ssh://prefix; transport chip showsnativeornvidia-smi/rocm-smiaccordingly.--ssh-strict-host-key=accept-newbehaves as documented (accepts on first connect, persists fingerprint, rejects changes).--ssh-hostfileparses correctly, including comments and port suffixes.cargo testcovers: hostfile parsing, nvidia-smi CSV parsing (golden file), rocm-smi JSON parsing, strategy selection.x86_64-unknown-linux-musltarget in CI).Edge cases & non-goals
all-smi localrequires sudo on macOS. SSH cannot supply a password non-interactively — document that macOS targets via SSH require passwordless sudo configured for the nvidia/iop readers, or they'll return limited data. Do not attempt to escalate.~/.ssh/configProxyJump support (russh supports it).~/.ssh/known_hosts;--ssh-known-hostsfor a custom file; isolated from the system OpenSSH state by default? No — follow standard OpenSSH known_hosts location so users don't re-accept fingerprints.--ssh-concurrency 32flag (default 32).all-smion remote (e.g., scp + execute). Out of scope.Soft dependency
snapshotsubcommand is the primary path; this feature reaches full value once that subcommand is present. Until then, fall back to thenvidia-smi/rocm-smishims only.