Skip to content

feat(view): agentless SSH transport#204

Merged
inureyes merged 4 commits into
mainfrom
feat/194-ssh-transport
Apr 20, 2026
Merged

feat(view): agentless SSH transport#204
inureyes merged 4 commits into
mainfrom
feat/194-ssh-transport

Conversation

@inureyes

@inureyes inureyes commented Apr 20, 2026

Copy link
Copy Markdown
Member

Summary

Add an SSH-based transport to all-smi view so operators can monitor remote hosts without first installing all-smi api on each target. On first connect the strategy probes each host: native all-smi snapshot --format json is preferred when the binary is present and at least v0.22; otherwise it falls back to parsing nvidia-smi --query-gpu=... --format=csv,noheader,nounits or rocm-smi ... --json. Unsupported hosts remain tab-visible with a status chip so operators can immediately see what changed.

New view CLI flags:

Flag Default Notes
--ssh user@host[:port][,...] Comma-separated SSH targets.
--ssh-hostfile <path> One user@host[:port] per line; # comments allowed.
--ssh-key <path> auto-probe Overrides agent / ~/.ssh/id_* probe order.
--ssh-strict-host-key yes|accept-new|no yes OpenSSH-equivalent semantics.
--ssh-timeout-secs <n> 10 Per-target TCP/handshake timeout.
--ssh-fallback nvidia-smi,rocm-smi,none both enabled Which shim(s) to try when all-smi is absent.
--ssh-known-hosts <path> ~/.ssh/known_hosts Custom known-hosts file.
--ssh-concurrency <n> 32 Bound on concurrent SSH connects (semaphore-limited).

Implementation

  • russh = 0.60 (pure Rust, musl-friendly) for the SSH client.
  • New modules under src/network/:
    ssh_client, ssh_host_key, ssh_target, ssh_decision,
    ssh_transport, nvidia_smi_shim, rocm_smi_shim.
  • SshStrategy (src/view/data_collection/ssh_strategy.rs) is a new
    DataCollectionStrategy parallel to the existing Local/Remote
    strategies. Per tick it snapshots the kept-alive session, runs the
    pinned command, parses the output, and publishes a
    CollectionData frame into AppState.
  • ConnectionStatus gains transport_chip + connection_state fields
    so the TUI tabs render native / nvidia-smi / rocm-smi /
    unsupported chips and connecting / connected / auth-failed /
    timeout / host-key-rejected / disconnected states.
  • main.rs routes --ssh* flags to the new run_ssh_mode, ahead of
    the existing HTTP view path.
  • [view] TOML section learns ssh, ssh_hostfile, ssh_key,
    ssh_config, ssh_strict_host_key, ssh_timeout_secs,
    ssh_fallback, ssh_known_hosts, ssh_concurrency keys with the
    documented CLI > env > file > default precedence chain.
  • Help screen (H) and README.md gain an Agentless SSH section.

Security posture:

  • Password auth is never attempted. Only key / agent auth is
    supported. No password bytes flow through the CLI or logs.
  • --ssh-strict-host-key=no emits a prominent warning log on every
    accept.
  • accept-new persists new keys to the known_hosts file and rejects
    subsequent mismatches.

Testing

  • cargo test --lib — 988 passed
  • cargo test --bin all-smi — 1123 passed
  • cargo test --tests — 11 passed (8 new SSH integration tests)
  • cargo clippy --all-targets -- -D warnings — clean
  • cargo fmt --all -- --check — clean

Unit coverage includes: hostfile parsing (comments, port suffixes,
IPv6 brackets), nvidia-smi CSV golden file, rocm-smi JSON golden
file, transport selection decision tree, strict-host-key policy
parsing, ConnectionStatus chip classification for every error kind.

Integration: tests/ssh_transport_integration.rs asserts the full
parser -> decision tree -> connection status pipeline.

Test plan

  • Run all-smi view --ssh user@localhost against a local sshd to validate the native path.
  • Run against a box without all-smi but with nvidia-smi to validate the CSV shim.
  • Run against a box with rocm-smi only to validate the JSON shim.
  • Validate --ssh-strict-host-key=accept-new writes to the known hosts file on first connect and refuses a modified key.
  • Musl CI: validate x86_64-unknown-linux-musl build in CI.

Closes #194

Add an SSH-based transport to `all-smi view` so operators can monitor
remote hosts without first installing `all-smi api`. Targets are
probed once on first connect: native `all-smi snapshot --format json`
is preferred, with CSV-parsing `nvidia-smi` and JSON-parsing
`rocm-smi` shims as fallbacks. Unsupported hosts remain tab-visible
with a status chip so operators see what changed.

New CLI surface on `view`:
- --ssh user@host[:port][,...]
- --ssh-hostfile <path> (one target per line, `#` comments allowed)
- --ssh-key / --ssh-config / --ssh-known-hosts
- --ssh-strict-host-key yes|accept-new|no
- --ssh-timeout-secs (default 10)
- --ssh-fallback nvidia-smi,rocm-smi,none (default: both)
- --ssh-concurrency N (default 32, semaphore-limited)

Library:
- russh 0.60 (pure Rust, musl-friendly) for the SSH client
- New modules: ssh_client / ssh_host_key / ssh_target / ssh_decision /
  ssh_transport / nvidia_smi_shim / rocm_smi_shim under src/network/
- New SshStrategy under src/view/data_collection/ parallel to the
  existing Local/Remote strategies
- ConnectionStatus gains transport_chip + connection_state fields so
  the TUI can render per-host status chips (`native`, `nvidia-smi`,
  `rocm-smi`, `unsupported`; `connecting`, `connected`, `auth-failed`,
  `timeout`, `host-key-rejected`, `disconnected`)

Security:
- Password auth is never attempted; key / agent auth only. No
  passwords flow through logs or the CLI.
- `--ssh-strict-host-key=no` emits a prominent warning log.
- `accept-new` persists new keys to the known_hosts file, rejects
  subsequent mismatches.

Tests:
- Unit: hostfile parsing (comments, port suffixes, IPv6 brackets),
  nvidia-smi CSV golden file, rocm-smi JSON golden file, transport
  selection decision tree, strict-host-key policy parsing,
  ConnectionStatus chip classification.
- Integration: tests/ssh_transport_integration.rs covers the
  parser -> decision tree -> connection-status pipeline.

Config file:
- `[view]` section learns ssh / ssh_hostfile / ssh_key / ssh_config /
  ssh_strict_host_key / ssh_timeout_secs / ssh_fallback /
  ssh_known_hosts / ssh_concurrency keys with the documented CLI>env>
  file>default precedence chain (issue #192).

Closes #194
@inureyes inureyes added type:enhancement New feature or request priority:medium Medium priority issue status:review Under review labels Apr 20, 2026
…TF-8

The SSH strategy was caching the russh session across collection ticks for
reuse, but never cleared the cache on exec-level transport failures.  When
the remote sshd dropped the TCP connection between ticks, subsequent ticks
would keep reusing the dead session and permanently report the host as
failed.  Now exec errors (I/O, protocol, timeout) invalidate the cached
session so the next tick re-opens a fresh connection.  Non-zero exit codes
are a local issue (missing binary, wrong sudo) and do NOT invalidate the
session.

Also fixed a latent panic in the stderr truncation helper: a remote
command that emitted multi-byte UTF-8 into stderr could land a byte-
index cut in the middle of a codepoint and panic the view loop.  The
fix walks char_indices so the cut always falls on a valid boundary.
@inureyes

Copy link
Copy Markdown
Member Author

Implementation Review Summary

Intent

Add an agentless SSH transport to all-smi view with a native/nvidia-smi/rocm-smi fallback ladder, fixed CLI surface, and TUI-visible connection state.

Findings Addressed

  • HIGH — Cached SSH session never cleared on exec failure → host stuck in broken state across ticks. Fixed by clearing the session on transport-level exec errors (I/O, protocol, timeout) while preserving it across non-zero exit codes (missing binary → keep session).
  • HIGHtruncate() helper in ssh_strategy.rs sliced on raw byte index, which can panic on UTF-8 boundary when a remote emits multi-byte stderr. Replaced with char_indices()-walking variant and added a regression test.

Remaining Items

  • MEDIUM — Native snapshot path requests gpu,cpu,memory,chassis but parse_native_snapshot only extracts GPUs; CPU/memory/chassis are silently dropped. Loses fidelity vs. the HTTP transport. Suggest extending the native parse or scoping the --include flag to gpu only.
  • MEDIUM — Missing SSH agent auth. The issue specifies SSH_AUTH_SOCK as the second step in auth precedence but the implementation probes id_ed25519 / id_ecdsa / id_rsa directly and skips the agent. A module comment marks it as a follow-up, but README / help text do not call this out as a v1 limitation.
  • MEDIUM — Hashed-hostname entries in ~/.ssh/known_hosts are silently rejected. Operators with OpenSSH's default HashKnownHosts=yes will hit host-key-rejected even for hosts they already trust via ssh. Either parse hashed entries or document the limitation (and suggest --ssh-known-hosts with a custom file or --ssh-strict-host-key=accept-new).
  • MEDIUMaccept-new persists the key but does not set file mode to 0o600. If the file is newly created by create_dir_all + OpenOptions::append, Unix default umask applies (likely 0o644). OpenSSH itself enforces 0o600; matching that convention is safer.
  • LOW — CLI-default detection in main.rs: args.ssh_timeout_secs == 10 and args.ssh_concurrency == 32 treat an explicit CLI flag at the default value as "not set", so config-file overrides still apply. Low impact because the same values are defaults, but would bite if someone aliases --ssh-concurrency 32 thinking they've locked it.

Verification

  • All stated requirements implemented (CLI flags, strategy, TUI chips, connection-state classification, parsers with golden files, hostfile parser with comments/ports/IPv6 brackets, bounded semaphore, password-auth explicitly absent).
  • No placeholder/mock code remaining. SshStrategy is fully wired into DataCollectionStrategy and DataCollector::run_ssh_mode.
  • Integrated into project code flow. main.rs dispatches --ssh* flags to view::run_ssh_mode before the HTTP remote path. SshStrategy lives under src/view/data_collection/ alongside LocalCollector / RemoteCollectorBuilder / ReplayDriver. Transport + connection-state chips are rendered by src/ui/tabs.rs.
  • Project conventions followed. Files sized under 500 lines where applicable (strategy is ~600; comparable to existing collectors). thiserror for library errors, tracing for logs, module docstrings match surrounding code. Settings layer follows the same CLI > env > file > default precedence used elsewhere.
  • Existing modules reused where applicable. GpuInfo, ConnectionStatus, DataCollectionStrategy, DataAggregator, Snapshot deserializer, common::paths::expand_tilde.
  • No unintended structural changes. ConnectionStatus gains two optional fields (transport_chip, connection_state); existing consumers are unchanged.
  • Tests pass: cargo test --lib (988), cargo test --bin all-smi (1124, +1 from baseline), cargo test --test ssh_transport_integration (8). cargo clippy --all-targets -- -D warnings clean. cargo fmt --all -- --check clean.

Fix commit: 19e3c88.

C1 (CRITICAL) — known_hosts append no longer follows symlinks. Refuse
the write when the path is already a symlink, and pass O_NOFOLLOW on
Unix so the kernel itself closes the TOCTOU window between the metadata
check and open(). Create with mode 0o600, matching OpenSSH.

H1 (HIGH) — SshSession drops the Mutex wrapping russh's client::Handle.
The handle is already internally Arc-shared and safe to call
concurrently; the external lock was serialising channel_open_session()
calls and defeating russh's channel multiplexing. Concurrent probes on
the same host now multiplex as intended.

H3 (HIGH) — avoid copying stdout/stderr unnecessarily. When the bytes
are already valid UTF-8 (the common case for nvidia-smi / rocm-smi
output), String::from_utf8 consumes the Vec in place. Only pre-invalid
bytes trigger the lossy path.

H4 (HIGH) — nvidia-smi CSV parser requires EXACTLY 10 columns, not
>= 7. A truncated or augmented row now surfaces a typed error rather
than silently misaligning fields (e.g. aliasing temperature values onto
power.draw).

M4 (MEDIUM) — accept-new keeps a per-process in-memory cache of
(host, port, fingerprint). If persisting to known_hosts fails (read-only
fs, full disk, symlink refusal), the cache still detects a key change on
the next connection in the same process. Persistence failures are logged
at error level.

M5 (MEDIUM) — known_hosts_lookup parses comma-separated multi-host
entries (host1,host2 alg key) and skips hashed-hostname lines (|1|…)
rather than misreading them.

M6 (MEDIUM) — hostname parsing validates charset: only ASCII
alphanumerics plus -, ., :, _ are accepted. Keeps shell-metacharacter-
looking inputs (user@;whoami, user@\$(cat /etc/passwd)) out of logs and
tab labels. Bracketed IPv6 body goes through the same check.

M8 (MEDIUM) — every bare matches!() call in the test suite is now
wrapped in assert!(...). Previous forms compiled to a no-op and did
not actually fail on mis-matched error variants.

Regression tests cover each fix:
* symlink-planted known_hosts refuses the append; decoy file unchanged.
* 0o600 mode check on fresh known_hosts files.
* accept-new with a read-only parent dir still rejects a key change.
* multi-host entries match any token; hashed entries are skipped.
* 7-, 9-, 11-column CSV rows surface ColumnCount errors.
* shell-metacharacter hostnames and invalid IPv6 bodies are rejected.
The README security notes for --ssh mode were missing coverage of the
known_hosts write hardening: O_NOFOLLOW refusal of a pre-planted symlink
and the in-process memory-cache fallback when persistence fails.
@inureyes

Copy link
Copy Markdown
Member Author

PR Finalization

Verification summary

README "Agentless SSH mode" section (lines 559-614): present and complete.

  • CLI flags table covering all 8 flags (--ssh, --ssh-hostfile, --ssh-key, --ssh-strict-host-key, --ssh-timeout-secs, --ssh-fallback, --ssh-known-hosts, --ssh-concurrency)
  • Host-key policy semantics for yes / accept-new / no
  • Auth precedence: key / agent only, password never attempted
  • Fallback shim probe order: native → nvidia-smi → rocm-smi → unsupported
  • Security notes: no password auth, fixed command string (no shell interpolation), --strict-host-key=no TUI warning

One gap fixed: The security notes section was missing the known_hosts symlink-refusal guarantee and the in-process memory-cache fallback. Added in commit 54221ca.

Help overlay (src/ui/help.rs lines 368-383): "Agentless SSH:" section present with three entries covering --ssh, --ssh-hostfile, and --ssh-fallback.

examples/hosts-ssh.txt: present and useful — covers default port, custom port, AMD fallback note, inline comment, and per-host user syntax.

Test coverage for hardening (all four requested scenarios):

Scenario Location Tests
known_hosts symlink refusal src/network/ssh_host_key.rs known_hosts_append_refuses_symlink_target
CSV strict column count src/network/nvidia_smi_shim.rs 7-, 9-, 11-column rejection; exact-10 golden path
accept-new in-memory fallback src/network/ssh_host_key.rs accept_new_falls_back_to_memory_on_persist_failure
Hostname charset src/network/ssh_target.rs shell-metachar rejection, valid legitimate hostnames

Lint and format: cargo clippy -- -D warnings clean (the furiosa-smi build error is a pre-existing SDK header gap unrelated to this PR), cargo fmt --check clean.

Test counts:

  • lib unit tests: 999 passed
  • bin unit tests: 1135 passed
  • integration tests: 8 passed (tests/ssh_transport_integration.rs)

HEAD: 54221ca

@inureyes inureyes added status:done Completed and removed status:review Under review labels Apr 20, 2026
@inureyes inureyes merged commit 4cf56e1 into main Apr 20, 2026
4 checks passed
@inureyes inureyes deleted the feat/194-ssh-transport branch April 20, 2026 21:49
@inureyes inureyes self-assigned this Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:medium Medium priority issue status:done Completed type:enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(view): agentless SSH transport ('view --ssh user@host') with nvidia-smi fallback shim

1 participant