Skip to content

feat(view): agentless SSH transport ('view --ssh user@host') with nvidia-smi fallback shim #194

Description

@inureyes

Summary

Add an SSH-based transport to all-smi view so operators can monitor a set of remote hosts without first installing and starting all-smi api on each target. Primary path: exec all-smi snapshot --format json remotely if the binary is present. Fallback path: a small shim that parses nvidia-smi --query-gpu=... --format=csv,noheader,nounits (or rocm-smi) into the same SnapshotFrame schema, so brand-new NVIDIA/AMD boxes are visible out of the box.

Motivation

The existing view mode assumes every remote target is already running all-smi api. This is great for a provisioned fleet but awful for ad-hoc use — the "I want to check 3 DGX boxes right now" flow. A zero-install agentless path dramatically lowers the barrier:

  • SRE uses SSH keys they already have to ssh user@dgx-01.
  • all-smi view --ssh user@dgx-01,user@dgx-02 just works.
  • If those boxes have all-smi installed, great — we use the native JSON. If not, we fall back to parsing nvidia-smi / rocm-smi output and still render the same TUI.

This matches the UX of tools like top, htop, ctop that ride SSH without infrastructure.

Current state

  • view supports --hosts and --hostfile for HTTP targets only (each must run all-smi api).
  • No SSH transport.
  • src/view/data_collection/strategy.rs has LocalStrategy + RemoteStrategy — a clean place to add SshStrategy.

Proposed design

CLI

all-smi view --ssh user@host[:port][,user@host2[:port2],...]
             [--ssh-hostfile path.txt]
             [--ssh-key <path>]
             [--ssh-config <path>]
             [--ssh-strict-host-key=<yes|accept-new|no>]
             [--ssh-timeout-secs 10]
             [--ssh-fallback nvidia-smi,rocm-smi,none]
             [--ssh-known-hosts <path>]
             [--interval <secs>]

--ssh-hostfile file format: one user@host[:port] per line, # comments allowed.

Transport selection

Per host, on initial connection:

  1. SSH-connect, keep the session alive for the lifetime of the view.
  2. Probe: all-smi --version (1-second timeout).
  3. If present and ≥ a minimum version (say 0.22 once snapshot lands): primary path is all-smi snapshot --format json --include gpu,cpu,memory,chassis.
  4. Else, check --ssh-fallback list in order:
    • nvidia-smi shim: nvidia-smi --query-gpu=index,uuid,name,driver_version,utilization.gpu,memory.used,memory.total,temperature.gpu,clocks.current.graphics,power.draw --format=csv,noheader,nounits
    • rocm-smi shim: rocm-smi --showuse --showmemuse --showtemp --showpower --json (use the JSON output that rocm-smi supports).
  5. If none available, mark the host as status: unsupported in the TUI.

Execution

Per tick, exec the chosen command on the kept-alive session; parse stdout into SnapshotFrame. Use the same SnapshotFrame schema as the snapshot subcommand and the /events endpoint.

Display:

  • Host tabs show ssh://user@dgx-01 prefix.
  • A small chip indicates the transport path per host: native (all-smi installed), nvidia-smi (shim), rocm-smi (shim), unsupported.
  • Connection state: connecting, connected, auth-failed, timeout, disconnected — with a human-readable last-error tooltip.

Authentication

Precedence: explicit --ssh-key → SSH agent (SSH_AUTH_SOCK) → ~/.ssh/config identity file → ~/.ssh/id_ed25519 / ~/.ssh/id_rsa. Password auth is not supported in v1 (agent/key only) — document explicitly.

--ssh-strict-host-key:

  • yes (default): refuse unknown hosts (matches OpenSSH StrictHostKeyChecking=yes).
  • accept-new: accept on first connect, reject if key changes (matches accept-new).
  • no: accept any (emit a prominent warning in the TUI).

Library choice

Use russh (pure Rust, maintained, supports async, proxy_jump) rather than ssh2 (libssh2 bindings — C dep; musl builds painful).

Implementation plan

Files to add / modify:

  • Cargo.tomlrussh = "0.54" (or the latest compatible; validate musl compatibility), russh-keys.
  • src/cli.rsViewArgs.ssh, ViewArgs.ssh_hostfile, ssh_key, ssh_config, ssh_strict_host_key, ssh_fallback, ssh_known_hosts, ssh_timeout_secs.
  • New src/network/ssh_client.rs — russh-backed session manager: connect, keep-alive, exec, close.
  • New src/network/nvidia_smi_shim.rs — parse nvidia-smi CSV into Vec<GpuInfo>; fill chassis_info/cpu_info as best-effort (probably leave empty on shim path; document that CPU/chassis may be absent).
  • New src/network/rocm_smi_shim.rs — parse rocm-smi JSON similarly.
  • src/view/data_collection/ssh_strategy.rs — new strategy parallel to LocalStrategy/RemoteStrategy. Implements the same trait that feeds snapshots into the existing render pipeline.
  • src/view/data_collection/strategy.rs — register SshStrategy.
  • src/view/runner.rs — route --ssh targets through SshStrategy.
  • src/ui/renderers/dashboard.rs — render transport chip + connection state.
  • src/ui/help.rs — mention SSH usage.
  • examples/hosts-ssh.txt — example SSH hostfile.

Acceptance criteria

  • all-smi view --ssh user@localhost connects via SSH (when an sshd is running locally) and renders the local hardware through the SSH transport, using all-smi snapshot when installed.
  • With --ssh-fallback nvidia-smi, connecting to a host without all-smi but with nvidia-smi installed renders GPU data via the shim.
  • Tab labels clearly show the ssh:// prefix; transport chip shows native or nvidia-smi/rocm-smi accordingly.
  • Connection errors (timeout, auth fail, host key mismatch) surface in the TUI as a per-host state, not a crash.
  • --ssh-strict-host-key=accept-new behaves as documented (accepts on first connect, persists fingerprint, rejects changes).
  • --ssh-hostfile parses correctly, including comments and port suffixes.
  • Passwords are never logged, never stored. Only key-based auth supported.
  • Rate-limit: opening > 50 SSH sessions staggers connection establishment (similar to the existing connection-staggering pattern for HTTP).
  • cargo test covers: hostfile parsing, nvidia-smi CSV parsing (golden file), rocm-smi JSON parsing, strategy selection.
  • Works on musl static builds (validate with the x86_64-unknown-linux-musl target in CI).
  • README gains an "Agentless SSH mode" section with a one-liner example.

Edge cases & non-goals

  • macOS hosts: all-smi local requires sudo on macOS. SSH cannot supply a password non-interactively — document that macOS targets via SSH require passwordless sudo configured for the nvidia/iop readers, or they'll return limited data. Do not attempt to escalate.
  • Hosts behind a bastion: document ~/.ssh/config ProxyJump support (russh supports it).
  • Known hosts: default ~/.ssh/known_hosts; --ssh-known-hosts for a custom file; isolated from the system OpenSSH state by default? No — follow standard OpenSSH known_hosts location so users don't re-accept fingerprints.
  • SIGINT during connection: clean up in-flight sessions; don't leak file descriptors.
  • Large cluster (100+ hosts): use a bounded semaphore to limit concurrent SSH connects; expose --ssh-concurrency 32 flag (default 32).
  • Non-goal: password auth. Too many footguns; too hard to do safely in a TUI. Key + agent only.
  • Non-goal: Windows SSH targets as a v1 priority. Should work in theory (russh is cross-platform) but add a caveat in docs.
  • Non-goal: automatic installation of all-smi on remote (e.g., scp + execute). Out of scope.

Soft dependency

  • The snapshot subcommand is the primary path; this feature reaches full value once that subcommand is present. Until then, fall back to the nvidia-smi / rocm-smi shims only.

Metadata

Metadata

Assignees

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions