feat(view): agentless SSH transport#204
Conversation
Add an SSH-based transport to `all-smi view` so operators can monitor remote hosts without first installing `all-smi api`. Targets are probed once on first connect: native `all-smi snapshot --format json` is preferred, with CSV-parsing `nvidia-smi` and JSON-parsing `rocm-smi` shims as fallbacks. Unsupported hosts remain tab-visible with a status chip so operators see what changed. New CLI surface on `view`: - --ssh user@host[:port][,...] - --ssh-hostfile <path> (one target per line, `#` comments allowed) - --ssh-key / --ssh-config / --ssh-known-hosts - --ssh-strict-host-key yes|accept-new|no - --ssh-timeout-secs (default 10) - --ssh-fallback nvidia-smi,rocm-smi,none (default: both) - --ssh-concurrency N (default 32, semaphore-limited) Library: - russh 0.60 (pure Rust, musl-friendly) for the SSH client - New modules: ssh_client / ssh_host_key / ssh_target / ssh_decision / ssh_transport / nvidia_smi_shim / rocm_smi_shim under src/network/ - New SshStrategy under src/view/data_collection/ parallel to the existing Local/Remote strategies - ConnectionStatus gains transport_chip + connection_state fields so the TUI can render per-host status chips (`native`, `nvidia-smi`, `rocm-smi`, `unsupported`; `connecting`, `connected`, `auth-failed`, `timeout`, `host-key-rejected`, `disconnected`) Security: - Password auth is never attempted; key / agent auth only. No passwords flow through logs or the CLI. - `--ssh-strict-host-key=no` emits a prominent warning log. - `accept-new` persists new keys to the known_hosts file, rejects subsequent mismatches. Tests: - Unit: hostfile parsing (comments, port suffixes, IPv6 brackets), nvidia-smi CSV golden file, rocm-smi JSON golden file, transport selection decision tree, strict-host-key policy parsing, ConnectionStatus chip classification. - Integration: tests/ssh_transport_integration.rs covers the parser -> decision tree -> connection-status pipeline. Config file: - `[view]` section learns ssh / ssh_hostfile / ssh_key / ssh_config / ssh_strict_host_key / ssh_timeout_secs / ssh_fallback / ssh_known_hosts / ssh_concurrency keys with the documented CLI>env> file>default precedence chain (issue #192). Closes #194
…TF-8 The SSH strategy was caching the russh session across collection ticks for reuse, but never cleared the cache on exec-level transport failures. When the remote sshd dropped the TCP connection between ticks, subsequent ticks would keep reusing the dead session and permanently report the host as failed. Now exec errors (I/O, protocol, timeout) invalidate the cached session so the next tick re-opens a fresh connection. Non-zero exit codes are a local issue (missing binary, wrong sudo) and do NOT invalidate the session. Also fixed a latent panic in the stderr truncation helper: a remote command that emitted multi-byte UTF-8 into stderr could land a byte- index cut in the middle of a codepoint and panic the view loop. The fix walks char_indices so the cut always falls on a valid boundary.
Implementation Review SummaryIntent
Findings Addressed
Remaining Items
Verification
Fix commit: |
C1 (CRITICAL) — known_hosts append no longer follows symlinks. Refuse the write when the path is already a symlink, and pass O_NOFOLLOW on Unix so the kernel itself closes the TOCTOU window between the metadata check and open(). Create with mode 0o600, matching OpenSSH. H1 (HIGH) — SshSession drops the Mutex wrapping russh's client::Handle. The handle is already internally Arc-shared and safe to call concurrently; the external lock was serialising channel_open_session() calls and defeating russh's channel multiplexing. Concurrent probes on the same host now multiplex as intended. H3 (HIGH) — avoid copying stdout/stderr unnecessarily. When the bytes are already valid UTF-8 (the common case for nvidia-smi / rocm-smi output), String::from_utf8 consumes the Vec in place. Only pre-invalid bytes trigger the lossy path. H4 (HIGH) — nvidia-smi CSV parser requires EXACTLY 10 columns, not >= 7. A truncated or augmented row now surfaces a typed error rather than silently misaligning fields (e.g. aliasing temperature values onto power.draw). M4 (MEDIUM) — accept-new keeps a per-process in-memory cache of (host, port, fingerprint). If persisting to known_hosts fails (read-only fs, full disk, symlink refusal), the cache still detects a key change on the next connection in the same process. Persistence failures are logged at error level. M5 (MEDIUM) — known_hosts_lookup parses comma-separated multi-host entries (host1,host2 alg key) and skips hashed-hostname lines (|1|…) rather than misreading them. M6 (MEDIUM) — hostname parsing validates charset: only ASCII alphanumerics plus -, ., :, _ are accepted. Keeps shell-metacharacter- looking inputs (user@;whoami, user@\$(cat /etc/passwd)) out of logs and tab labels. Bracketed IPv6 body goes through the same check. M8 (MEDIUM) — every bare matches!() call in the test suite is now wrapped in assert!(...). Previous forms compiled to a no-op and did not actually fail on mis-matched error variants. Regression tests cover each fix: * symlink-planted known_hosts refuses the append; decoy file unchanged. * 0o600 mode check on fresh known_hosts files. * accept-new with a read-only parent dir still rejects a key change. * multi-host entries match any token; hashed entries are skipped. * 7-, 9-, 11-column CSV rows surface ColumnCount errors. * shell-metacharacter hostnames and invalid IPv6 bodies are rejected.
The README security notes for --ssh mode were missing coverage of the known_hosts write hardening: O_NOFOLLOW refusal of a pre-planted symlink and the in-process memory-cache fallback when persistence fails.
PR FinalizationVerification summaryREADME "Agentless SSH mode" section (lines 559-614): present and complete.
One gap fixed: The security notes section was missing the Help overlay (
Test coverage for hardening (all four requested scenarios):
Lint and format: Test counts:
HEAD: |
Summary
Add an SSH-based transport to
all-smi viewso operators can monitor remote hosts without first installingall-smi apion each target. On first connect the strategy probes each host: nativeall-smi snapshot --format jsonis preferred when the binary is present and at least v0.22; otherwise it falls back to parsingnvidia-smi --query-gpu=... --format=csv,noheader,nounitsorrocm-smi ... --json. Unsupported hosts remain tab-visible with a status chip so operators can immediately see what changed.New
viewCLI flags:--ssh user@host[:port][,...]--ssh-hostfile <path>user@host[:port]per line;#comments allowed.--ssh-key <path>~/.ssh/id_*probe order.--ssh-strict-host-key yes|accept-new|noyes--ssh-timeout-secs <n>10--ssh-fallback nvidia-smi,rocm-smi,noneall-smiis absent.--ssh-known-hosts <path>~/.ssh/known_hosts--ssh-concurrency <n>32Implementation
russh = 0.60(pure Rust, musl-friendly) for the SSH client.src/network/:ssh_client,ssh_host_key,ssh_target,ssh_decision,ssh_transport,nvidia_smi_shim,rocm_smi_shim.SshStrategy(src/view/data_collection/ssh_strategy.rs) is a newDataCollectionStrategyparallel to the existing Local/Remotestrategies. Per tick it snapshots the kept-alive session, runs the
pinned command, parses the output, and publishes a
CollectionDataframe intoAppState.ConnectionStatusgainstransport_chip+connection_statefieldsso the TUI tabs render
native/nvidia-smi/rocm-smi/unsupportedchips andconnecting/connected/auth-failed/timeout/host-key-rejected/disconnectedstates.main.rsroutes--ssh*flags to the newrun_ssh_mode, ahead ofthe existing HTTP
viewpath.[view]TOML section learnsssh,ssh_hostfile,ssh_key,ssh_config,ssh_strict_host_key,ssh_timeout_secs,ssh_fallback,ssh_known_hosts,ssh_concurrencykeys with thedocumented CLI > env > file > default precedence chain.
H) andREADME.mdgain an Agentless SSH section.Security posture:
supported. No password bytes flow through the CLI or logs.
--ssh-strict-host-key=noemits a prominent warning log on everyaccept.
accept-newpersists new keys to the known_hosts file and rejectssubsequent mismatches.
Testing
cargo test --lib— 988 passedcargo test --bin all-smi— 1123 passedcargo test --tests— 11 passed (8 new SSH integration tests)cargo clippy --all-targets -- -D warnings— cleancargo fmt --all -- --check— cleanUnit coverage includes: hostfile parsing (comments, port suffixes,
IPv6 brackets), nvidia-smi CSV golden file, rocm-smi JSON golden
file, transport selection decision tree, strict-host-key policy
parsing, ConnectionStatus chip classification for every error kind.
Integration:
tests/ssh_transport_integration.rsasserts the fullparser -> decision tree -> connection status pipeline.
Test plan
all-smi view --ssh user@localhostagainst a local sshd to validate thenativepath.all-smibut withnvidia-smito validate the CSV shim.rocm-smionly to validate the JSON shim.--ssh-strict-host-key=accept-newwrites to the known hosts file on first connect and refuses a modified key.x86_64-unknown-linux-muslbuild in CI.Closes #194