feat: 'all-smi doctor' subcommand for self-diagnosis and support bundle#198
Conversation
Introduce a read-only diagnostic subcommand that runs 50 checks across 14 categories (platform, privileges, container, nvidia, amd, apple, gaudi, tpu, tenstorrent, rebellions, furiosa, windows, env, network) and emits a PASS/WARN/FAIL/SKIP report. Every check runs in parallel under a hard 3-second timeout and uses bounded-timeout subprocess helpers so no call can hang the command. Adds --json machine-readable output, --bundle for a redacted tar.gz support archive, --only/--skip prefix filters, --remote-check opt-in endpoint probes, and --include-identifiers to override the default hostname/IP/MAC/username/kptr scrubbing. Extracts a shared introspection snapshot from platform_detection.rs so the doctor and reader_factory agree on detected accelerators. Closes #188.
Implementation Review SummaryIntent
Findings Addressed
Remaining Items
Verification
Constraint-by-constraint audit
VerdictShip it. Implementation matches the issue spec end-to-end; remaining findings are cosmetic / LOW severity and can be addressed in a follow-up. No auto-fixes applied. |
Security review of the PR #198 diff surfaced five concrete attack paths against the new `all-smi doctor --bundle` flow. Each is addressed in this commit: 1. Secret leakage via env.txt. The bundle allow-listed `ALL_SMI_*`, `BACKENDAI_*`, and similar prefixes without masking credential values. Add a substring-matched secret name filter (TOKEN / SECRET / PASSWORD / API_KEY / ACCESS_KEY / CREDENTIAL / AUTH / SESSION / COOKIE / BEARER / SIGNATURE / PRIVATE_KEY / CLIENT_SECRET) that replaces the value with `<redacted:secret>` regardless of `--include-identifiers`. Drop `PATH` and `LD_LIBRARY_PATH` from the verbatim dump and emit a length-only summary instead so `/home/<user>` segments never reach the archive. 2. Unbounded subprocess output. `run_command_with_timeout` read stdout/stderr into `Vec<u8>` with no cap. A hostile or runaway `dmesg`/`lspci`/`nvidia-smi` could OOM the process because the bundle-side filtering happens only after the full read. Cap each stream at 16 MiB (`COMMAND_OUTPUT_CAP_BYTES`) with an explicit `[truncated by all-smi: output exceeded cap]` marker, and keep draining the pipe so the child cannot block on a full pipe buffer. 3. Bundle TOCTOU / permissions. The writer used `File::create(path)` which happily follows a pre-existing symlink and leaves the file world-readable. Route the write through `open_bundle_file` which uses `O_NOFOLLOW | mode(0o600)` on Unix and `share_mode(0)` on Windows, matching the snapshot (#185) and record (#187) hardening. Also `fsync` the archive on finalisation, and set tar entry mode to 0o600 so extracted files inherit the same restrictive bits. 4. Network-check SSRF via redirects. The HTTP probe built a default `reqwest::Client` which follows up to 10 redirects — a user-supplied URL pointing at a benign host could bounce the probe at `169.254.169.254` or another internal endpoint. Set `redirect(Policy::none())` so the probe hits only the URL the user actually supplied. 5. Weak redaction. Broaden the kernel-pointer regex from `ffff[0-9a-f]{12}` to cover `ffffffff...` canonical addresses plus any `0x`-anchored 8-16 hex-digit address (32-bit kernels, module load addresses, KASAN shadow offsets). Make hostname and username matchers case-insensitive so mixed-case tool output (Windows `uname`, macOS `scutil`, BIOS strings) is still scrubbed. Tests: add 10 new unit tests covering secret redaction, PATH summary, 0o600 bundle mode, symlink refusal, case-insensitive host/user matching, broader kptr coverage, and the 16 MiB output cap. `cargo test --lib` passes 726 (was 715); `cargo test --test doctor_test` passes 9; `cargo clippy -- -D warnings` and `cargo fmt --check` are clean. Closes #188 review notes.
Security + performance reviewReviewed the Resolved in
|
| # | Severity | Finding | Fix |
|---|---|---|---|
| 1 | CRITICAL | env.txt in bundle included ALL_SMI_*/BACKENDAI_*/NVIDIA_* prefixes verbatim. Credential-bearing names (BACKENDAI_SECRET_KEY, *_API_TOKEN, HUGGINGFACE_HUB_TOKEN, AWS_SESSION_TOKEN, ...) would be leaked. scrub() only redacts IPs / hostnames / MACs / usernames, not secret values. |
Added is_secret_env_name with substring match against TOKEN / SECRET / PASSWORD / API_KEY / ACCESS_KEY / CREDENTIAL / AUTH / SESSION / COOKIE / BEARER / SIGNATURE / PRIVATE_KEY / CLIENT_SECRET. Value replaced with <redacted:secret> regardless of --include-identifiers. |
| 2 | CRITICAL | run_command_with_timeout read subprocess stdout/stderr into Vec<u8> without a cap. A hostile / runaway dmesg / lspci / nvidia-smi could OOM the process (bundle-side filtering happens only after full read). |
Introduced COMMAND_OUTPUT_CAP_BYTES = 16 MiB with read_capped helper that truncates the buffer and keeps draining the pipe so the child does not block on a full pipe buffer. Appends [truncated by all-smi: output exceeded cap] marker. |
| 3 | HIGH | Bundle writer called File::create(path) directly — no O_NOFOLLOW, no 0o600. A pre-existing symlink at the target path would be followed. Diverged from the snapshot (#185) and record (#187) hardening pattern. |
Added open_bundle_file using O_NOFOLLOW | mode(0o600) on Unix and share_mode(0) on Windows. Added sync_all on finalisation. Tar entry mode lowered from 0o644 to 0o600 so extracted files inherit restrictive bits. |
| 4 | HIGH | reqwest::Client::builder() in network.http check used the default redirect policy (up to 10 follows). A benign user-supplied URL could bounce the probe at 169.254.169.254 / internal endpoints. |
Set redirect(reqwest::redirect::Policy::none()). |
| 5 | MEDIUM | Kernel pointer regex only matched ffff[0-9a-f]{12} — missed ffffffff... canonical addresses, 0x-prefixed kernel module load addresses, KASAN shadow offsets. Issue #188 spec mandates kptr redaction. |
Broadened to (?i)\bf{4,}[0-9a-f]{8,12}\b|\b0x[0-9a-fA-F]{8,16}\b. |
| 6 | MEDIUM | Hostname / username regex was case-sensitive. Windows uname, macOS scutil, BIOS strings, and log lines that upper-case do not trigger redaction. |
Added (?i) case-insensitive flag to both matchers. |
| 7 | MEDIUM | PATH and LD_LIBRARY_PATH included verbatim in env.txt — values frequently contain /home/<username>/... and private build directories. |
Replaced with length-only summary: PATH=<redacted:path-list N entries>. |
Not exploitable (verified)
- Argument injection via
--remote-check: URLs are parsed withurl::Url::parse(). Non-shellCommand::new()is used everywhere in the doctor — noshell=True, no/bin/sh -c, and notry_execcall in any check passes user-controlled data as command or arg. Alltry_exec(cmd, args, ...)invocations use static string literals for bothcmdandargs. - Tar path traversal: entry names are hard-coded string literals (
"all-smi-doctor/report.txt", ...). No attacker-controlled path component reachestar::Builder::append_data. whichspawn: bounded by a 500 ms timeout via the sametry_execwrapper; no untrustedcmdargument path.- Subprocess timeouts: every call goes through
try_exec→run_command_with_timeoutwith an explicitDuration. Thespawn_blocking+ 3-second orchestrator ceiling provides a second layer.
Test coverage added
10 new unit tests:
is_secret_env_name_matches_common_patternsenv_dump_redacts_secret_values_and_summarises_pathbundle_unix_mode_is_0o600bundle_refuses_preexisting_symlink(verifies decoy target is NOT overwritten through symlink)scrub_hostname_is_case_insensitivescrub_username_is_case_insensitivescrub_kernel_pointer_canonicalscrub_kernel_pointer_0x_prefixedscrub_kernel_pointer_leaves_version_strings_aloneoutput_cap_truncates_hostile_stdout+output_cap_leaves_small_outputs_alone
Verification
cargo test --lib— 726 passed (was 715 in the PR)cargo test --test doctor_test— 9 passedcargo clippy --all-targets --lib --bin all-smi -- -D warnings— cleancargo fmt --all -- --check— clean
Remaining observations (non-blocking, not auto-fixed)
- The ephemeral current-thread Tokio runtime built inside
spawn_blockingfor the HTTP probe (network.rs) works but is unusual. Worth revisiting if another check needs outbound HTTP — a shared single-threaded runtime for network checks would be cleaner. - IPv6 matcher still misses some compressed forms (e.g.
::ffff:192.0.2.1mixed form past the embedded v4 case). Low impact since the v4 regex scrubs the trailing192.0.2.1independently, but the<ipv6>marker will not be emitted for those inputs.
Add a "Support Bundle Security" subsection under Diagnostics covering O_NOFOLLOW symlink refusal, 0o600 owner-only file mode, secret-name redaction in env.txt (always applied), and the exact scope of --include-identifiers (network identifiers only, not credentials). Add the doctor subcommand entry to the v0.21.0 upcoming changelog.
PR Finalization CompleteChecks run
Stable check-ID table51 check IDs registered and verified across 14 categories. README table lists all 51; DocumentationAdded a "Support Bundle Security" subsection to the README Diagnostics section covering:
Also added a Commit
|
Summary
Introduce a read-only
all-smi doctorsubcommand that runs a battery ofenvironment checks and prints a PASS/WARN/FAIL/SKIP report. Closes #188.
container, nvidia, amd, apple, gaudi, tpu, tenstorrent, rebellions,
furiosa, windows, env, network) with stable dotted IDs.
wrappers — no
Command::output()without a wrapper.Semaphore-limited pool.0clean,1warnings only,2any failures.Implementation
src/cli.rsgains aDoctor(DoctorArgs)variant with--json,--verbose,--bundle <PATH>,--include-identifiers,--remote-check <URL>...,--skip <ID>...,--only <ID>....src/main.rsdispatches todoctor::run_cliwhich returns the exitcode to the CLI driver.
src/doctor/subtree:types.rs—Check,CheckCtx,CheckResult,Severity.exec.rs— bounded-timeout child-process helpers (try_exec,which).redact.rs— regex scrubbing of hostnames, IPv4, IPv6, MAC, kernelpointers, and the current username.
report.rs— human and JSON renderers; ANSI colour controlled byNO_COLOR+ TTY detection.bundle.rs— tar.gz archive packer (addstar = "0.4"+ existingflate2). Includesreport.txt,report.json,env.txt,version.txt,uname.txt,lspci.txt,lsmod.txt,dmesg-gpu.txt, and (macOS--verboseonly)system_profiler_display.txt. Leaves a TODO for the effectivemerged-config file pending feat: TOML config file support ('~/.config/all-smi/config.toml') #192.
checks/— one module per category, each exposingpub fn checks() -> &'static [&'static Check]. Every non-applicable path is a cleanSkip; Windows-only probes compile and run on every platform.mod.rs— orchestrator with stable-sort rendering, filterpredicates,
DoctorOptionsnormalisation,Summary::exit_code().src/device/platform_detection.rsgains apub mod introspectionthat bundles every detector into a single
PlatformSnapshotstruct,shared between the doctor and
reader_factory.Cargo.toml: newtar = "0.4"dep and a top-levelfuriosafeaturealias for the existing
furiosa-smi-rsoptional dep so the featurename is discoverable via
--features furiosa.tests/doctor_test.rs: 9 integration tests covering report shape,JSON parseability,
--only/--skipfilters, required stable checkIDs, redaction behaviour, exit-code mapping, and serde shape.
README.md: new "Diagnostics" section documenting every stablecheck ID and the exit-code contract.
Testing
cargo fmt --all -- --checkcargo clippy --all-targets -- -D warningscargo test --lib— 715 passed, 0 failedcargo test --bin all-smi— 812 passed, 0 failedcargo test --test doctor_test— 9 passed, 0 faileddoctor,--json,--bundle,--only,--skip,--remote-check,--include-identifiers,NO_COLOR, andthe generated tar.gz unpacks to the expected layout.
Closes #188