Skip to content

feat: 'all-smi doctor' subcommand for self-diagnosis and support bundle#198

Merged
inureyes merged 3 commits into
mainfrom
feature/issue-188-doctor-subcommand
Apr 20, 2026
Merged

feat: 'all-smi doctor' subcommand for self-diagnosis and support bundle#198
inureyes merged 3 commits into
mainfrom
feature/issue-188-doctor-subcommand

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

Introduce a read-only all-smi doctor subcommand that runs a battery of
environment checks and prints a PASS/WARN/FAIL/SKIP report. Closes #188.

  • 50 built-in checks across 14 categories (platform, privileges,
    container, nvidia, amd, apple, gaudi, tpu, tenstorrent, rebellions,
    furiosa, windows, env, network) with stable dotted IDs.
  • Hard 3-second per-check ceiling plus bounded-timeout subprocess
    wrappers — no Command::output() without a wrapper.
  • Checks execute in parallel on a Tokio Semaphore-limited pool.
  • Exit codes: 0 clean, 1 warnings only, 2 any failures.

Implementation

  • src/cli.rs gains a Doctor(DoctorArgs) variant with --json,
    --verbose, --bundle <PATH>, --include-identifiers,
    --remote-check <URL>..., --skip <ID>..., --only <ID>....
  • src/main.rs dispatches to doctor::run_cli which returns the exit
    code to the CLI driver.
  • src/doctor/ subtree:
    • types.rsCheck, CheckCtx, CheckResult, Severity.
    • exec.rs — bounded-timeout child-process helpers (try_exec,
      which).
    • redact.rs — regex scrubbing of hostnames, IPv4, IPv6, MAC, kernel
      pointers, and the current username.
    • report.rs — human and JSON renderers; ANSI colour controlled by
      NO_COLOR + TTY detection.
    • bundle.rs — tar.gz archive packer (adds tar = "0.4" + existing
      flate2). Includes report.txt, report.json, env.txt,
      version.txt, uname.txt, lspci.txt, lsmod.txt,
      dmesg-gpu.txt, and (macOS --verbose only)
      system_profiler_display.txt. Leaves a TODO for the effective
      merged-config file pending feat: TOML config file support ('~/.config/all-smi/config.toml') #192.
    • checks/ — one module per category, each exposing pub fn checks() -> &'static [&'static Check]. Every non-applicable path is a clean
      Skip; Windows-only probes compile and run on every platform.
    • mod.rs — orchestrator with stable-sort rendering, filter
      predicates, DoctorOptions normalisation, Summary::exit_code().
  • src/device/platform_detection.rs gains a pub mod introspection
    that bundles every detector into a single PlatformSnapshot struct,
    shared between the doctor and reader_factory.
  • Cargo.toml: new tar = "0.4" dep and a top-level furiosa feature
    alias for the existing furiosa-smi-rs optional dep so the feature
    name is discoverable via --features furiosa.
  • tests/doctor_test.rs: 9 integration tests covering report shape,
    JSON parseability, --only/--skip filters, required stable check
    IDs, redaction behaviour, exit-code mapping, and serde shape.
  • README.md: new "Diagnostics" section documenting every stable
    check ID and the exit-code contract.

Testing

  • cargo fmt --all -- --check
  • cargo clippy --all-targets -- -D warnings
  • cargo test --lib — 715 passed, 0 failed
  • cargo test --bin all-smi — 812 passed, 0 failed
  • cargo test --test doctor_test — 9 passed, 0 failed
  • Manual smoke tests: doctor, --json, --bundle, --only,
    --skip, --remote-check, --include-identifiers, NO_COLOR, and
    the generated tar.gz unpacks to the expected layout.

Closes #188

Introduce a read-only diagnostic subcommand that runs 50 checks across 14
categories (platform, privileges, container, nvidia, amd, apple, gaudi,
tpu, tenstorrent, rebellions, furiosa, windows, env, network) and emits a
PASS/WARN/FAIL/SKIP report. Every check runs in parallel under a hard
3-second timeout and uses bounded-timeout subprocess helpers so no call
can hang the command. Adds --json machine-readable output, --bundle for
a redacted tar.gz support archive, --only/--skip prefix filters,
--remote-check opt-in endpoint probes, and --include-identifiers to
override the default hostname/IP/MAC/username/kptr scrubbing. Extracts a
shared introspection snapshot from platform_detection.rs so the doctor
and reader_factory agree on detected accelerators. Closes #188.
@inureyes inureyes added type:enhancement New feature or request priority:high High priority issue status:review Under review labels Apr 20, 2026
@inureyes

Copy link
Copy Markdown
Member Author

Implementation Review Summary

Intent

Deliver all-smi doctor — a read-only self-diagnosis subcommand with PASS/WARN/FAIL/SKIP report, bounded per-check timeout, redaction-by-default, optional tar.gz support bundle, and stable filterable check IDs (issue #188).

Findings Addressed

  • None — review is read-only; no auto-fixes were applied.

Remaining Items

  • src/doctor/checks/privileges.rs:163check_dev_dri reports a raw read_dir().count() which includes the by-path subdirectory alongside card*/render* device nodes, so a typical Linux host with one GPU reports 3 node(s) present instead of 2. (LOW) — cosmetic, not blocking. Consider filtering the same way amd.dri.perms does (card*/render* only).
  • src/doctor/redact.rs:70 — kernel-pointer regex \bffff[0-9a-f]{12}\b is case-sensitive. Dmesg almost always emits lowercase hex, so this is a non-issue in practice, but (?i) or [0-9a-fA-F] would be strictly safer. (LOW)
  • src/doctor/bundle.rs:157-164 — bundle env.txt keeps PATH / LD_LIBRARY_PATH verbatim (minus username redaction). Full filesystem paths can still reveal project structure. Default scrubbing already redacts usernames; further path redaction is arguably out of scope for a support bundle users opt into. (LOW — privacy-by-default is already met per issue spec)
  • PR body states "50 built-in checks" but the registry exposes 51 check IDs (confirmed via doctor --json). (LOW — PR-body drift only)

Verification

  • All stated requirements implemented — every acceptance-criteria box in issue feat: 'all-smi doctor' subcommand for self-diagnosis and support bundle #188 is delivered.
  • No placeholder/mock code remaining — every check does real work or returns Skip with a reason.
  • Integrated into project code flow — src/main.rs:267-278 dispatches Commands::Doctor to doctor::run_cli, which propagates the per-spec exit code via std::process::exit(code). cargo run -- doctor on this host ran 51 checks, produced a clean report, and returned exit code 1 (warn-only) exactly as specified.
  • Project conventions followed — module layout under src/doctor/ mirrors the issue's proposed structure; Apache-2.0 headers on every new file; thiserror/anyhow used appropriately; inline format args consistently; #[cfg(target_os = ...)] guards are used in every non-portable block.
  • Existing modules reused where applicable — utils::command_timeout::run_command_with_timeout (bounded + process-kill on timeout) is the single backing call for doctor::exec::try_exec. device::platform_detection::introspection::PlatformSnapshot was added as a shared hardware-detection API for both the doctor and reader_factory. utils::runtime_environment::detect_container_environment is reused for container detection.
  • No unintended structural changes — only additions; no refactors of unrelated subsystems.
  • Tests pass — cargo test --test doctor_test 9/9 pass; cargo test --lib doctor 15/15 pass; cargo clippy --all-targets clean; build succeeds.

Constraint-by-constraint audit

  • Integration: src/cli.rs:56 adds Doctor(DoctorArgs), src/main.rs:267-278 dispatches, src/lib.rs:132-133 exposes pub mod doctor. cargo run -- doctor and --json/--only/--bundle all produce correct output on this host. PASS.
  • Timeout discipline: grep across src/doctor/** finds zero direct Command::output() / Command::status() / tokio::process::Command / std::process::Command usages. Every subprocess goes through doctor::exec::try_executils::command_timeout::run_command_with_timeout (which spawns, polls, kills child on deadline). PASS.
  • 3-second per-check ceiling: enforced in src/doctor/mod.rs:252-264 via tokio::time::timeout(CHECK_HARD_TIMEOUT, ...) wrapping every spawn_blocking check body. PASS.
  • Windows compat: zero panic!("not supported") anywhere under src/doctor/. Every #[cfg(target_os = "windows")] block in src/doctor/checks/windows.rs uses the wmi crate that Cargo.toml already pulls in under [target.'cfg(target_os = "windows")']. Full Windows cross-compile on this box is blocked by the missing mingw toolchain (environmental), but the Rust-layer cargo check for x86_64-pc-windows-gnu proceeded through the doctor module before dying at the cc link stage on unrelated crates — doctor code itself compiles cleanly for that target. PASS with note.
  • Parallel execution: src/doctor/mod.rs:62,188tokio::sync::Semaphore::new(8) with CHECK_PARALLELISM = 8. Every spawned check acquires a permit before executing the check body. PASS.
  • Check ID stability: 51 IDs in code === 51 IDs in README table at README.md:240-253. Cross-referenced programmatically — zero diff in either direction. --only/--skip use prefix-with-dot matching via passes_filter (unit-tested). PASS.
  • Exit codes: Summary::exit_code in src/doctor/mod.rs:163-1712 on any fail, 1 on warn-only, 0 otherwise. Propagated via std::process::exit(code) in src/main.rs:272. Verified end-to-end: shell $? was 1 after a run with 1 warn / 0 fail. PASS.
  • Redaction: default RedactOptions (src/doctor/redact.rs:101-115) pulls hostname via whoami::hostname()/$HOSTNAME and username via $USER/$LOGNAME; IPv4, IPv6, MAC regexes run unconditionally when enabled; scrub_kernel_pointers = true by default respects the kptr_restrict spirit even on permissive hosts; RedactOptions::passthrough() is used under --include-identifiers. Verified against the bundle: /home/<user>/... and HOSTNAME=<hostname> in extracted env.txt. PASS.
  • Read-only invariant: grep for write|remove|truncate|chmod|chown|delete|unlink|rename|mkdir in src/doctor/** finds exactly one hit: File::create in bundle.rs:60 — which writes the user-requested bundle tar.gz, i.e. the subcommand's own output. No mutation of system state. PASS.
  • Bundle: src/doctor/bundle.rs writes all-smi-doctor/{report.txt,report.json,env.txt,version.txt,uname.txt,lspci.txt,lsmod.txt,dmesg-gpu.txt} plus system_profiler_display.txt on macOS --verbose. Verified on this host: 7 entries produced (dmesg omitted per-spec because CAP_SYSLOG was unavailable). PASS.
  • Project conventions / file size: Largest new file is src/doctor/mod.rs at 430 lines. Every other new file is under 360 lines. Matches the existing subcommand module sizes (snapshot/, record/). PASS.

Verdict

Ship it. Implementation matches the issue spec end-to-end; remaining findings are cosmetic / LOW severity and can be addressed in a follow-up. No auto-fixes applied.

Security review of the PR #198 diff surfaced five concrete attack paths
against the new `all-smi doctor --bundle` flow. Each is addressed in
this commit:

1. Secret leakage via env.txt. The bundle allow-listed `ALL_SMI_*`,
   `BACKENDAI_*`, and similar prefixes without masking credential
   values. Add a substring-matched secret name filter
   (TOKEN / SECRET / PASSWORD / API_KEY / ACCESS_KEY / CREDENTIAL /
   AUTH / SESSION / COOKIE / BEARER / SIGNATURE / PRIVATE_KEY /
   CLIENT_SECRET) that replaces the value with `<redacted:secret>`
   regardless of `--include-identifiers`. Drop `PATH` and
   `LD_LIBRARY_PATH` from the verbatim dump and emit a length-only
   summary instead so `/home/<user>` segments never reach the archive.

2. Unbounded subprocess output. `run_command_with_timeout` read
   stdout/stderr into `Vec<u8>` with no cap. A hostile or runaway
   `dmesg`/`lspci`/`nvidia-smi` could OOM the process because the
   bundle-side filtering happens only after the full read. Cap each
   stream at 16 MiB (`COMMAND_OUTPUT_CAP_BYTES`) with an explicit
   `[truncated by all-smi: output exceeded cap]` marker, and keep
   draining the pipe so the child cannot block on a full pipe buffer.

3. Bundle TOCTOU / permissions. The writer used `File::create(path)`
   which happily follows a pre-existing symlink and leaves the file
   world-readable. Route the write through `open_bundle_file` which
   uses `O_NOFOLLOW | mode(0o600)` on Unix and `share_mode(0)` on
   Windows, matching the snapshot (#185) and record (#187) hardening.
   Also `fsync` the archive on finalisation, and set tar entry mode
   to 0o600 so extracted files inherit the same restrictive bits.

4. Network-check SSRF via redirects. The HTTP probe built a default
   `reqwest::Client` which follows up to 10 redirects — a
   user-supplied URL pointing at a benign host could bounce the
   probe at `169.254.169.254` or another internal endpoint. Set
   `redirect(Policy::none())` so the probe hits only the URL the
   user actually supplied.

5. Weak redaction. Broaden the kernel-pointer regex from
   `ffff[0-9a-f]{12}` to cover `ffffffff...` canonical addresses
   plus any `0x`-anchored 8-16 hex-digit address (32-bit kernels,
   module load addresses, KASAN shadow offsets). Make hostname and
   username matchers case-insensitive so mixed-case tool output
   (Windows `uname`, macOS `scutil`, BIOS strings) is still scrubbed.

Tests: add 10 new unit tests covering secret redaction, PATH
summary, 0o600 bundle mode, symlink refusal, case-insensitive
host/user matching, broader kptr coverage, and the 16 MiB output
cap. `cargo test --lib` passes 726 (was 715); `cargo test --test
doctor_test` passes 9; `cargo clippy -- -D warnings` and `cargo
fmt --check` are clean.

Closes #188 review notes.
@inureyes

Copy link
Copy Markdown
Member Author

Security + performance review

Reviewed the all-smi doctor change against the attack surface outlined in the review brief. Findings and fixes follow.

Resolved in 02adf80

# Severity Finding Fix
1 CRITICAL env.txt in bundle included ALL_SMI_*/BACKENDAI_*/NVIDIA_* prefixes verbatim. Credential-bearing names (BACKENDAI_SECRET_KEY, *_API_TOKEN, HUGGINGFACE_HUB_TOKEN, AWS_SESSION_TOKEN, ...) would be leaked. scrub() only redacts IPs / hostnames / MACs / usernames, not secret values. Added is_secret_env_name with substring match against TOKEN / SECRET / PASSWORD / API_KEY / ACCESS_KEY / CREDENTIAL / AUTH / SESSION / COOKIE / BEARER / SIGNATURE / PRIVATE_KEY / CLIENT_SECRET. Value replaced with <redacted:secret> regardless of --include-identifiers.
2 CRITICAL run_command_with_timeout read subprocess stdout/stderr into Vec<u8> without a cap. A hostile / runaway dmesg / lspci / nvidia-smi could OOM the process (bundle-side filtering happens only after full read). Introduced COMMAND_OUTPUT_CAP_BYTES = 16 MiB with read_capped helper that truncates the buffer and keeps draining the pipe so the child does not block on a full pipe buffer. Appends [truncated by all-smi: output exceeded cap] marker.
3 HIGH Bundle writer called File::create(path) directly — no O_NOFOLLOW, no 0o600. A pre-existing symlink at the target path would be followed. Diverged from the snapshot (#185) and record (#187) hardening pattern. Added open_bundle_file using O_NOFOLLOW | mode(0o600) on Unix and share_mode(0) on Windows. Added sync_all on finalisation. Tar entry mode lowered from 0o644 to 0o600 so extracted files inherit restrictive bits.
4 HIGH reqwest::Client::builder() in network.http check used the default redirect policy (up to 10 follows). A benign user-supplied URL could bounce the probe at 169.254.169.254 / internal endpoints. Set redirect(reqwest::redirect::Policy::none()).
5 MEDIUM Kernel pointer regex only matched ffff[0-9a-f]{12} — missed ffffffff... canonical addresses, 0x-prefixed kernel module load addresses, KASAN shadow offsets. Issue #188 spec mandates kptr redaction. Broadened to (?i)\bf{4,}[0-9a-f]{8,12}\b|\b0x[0-9a-fA-F]{8,16}\b.
6 MEDIUM Hostname / username regex was case-sensitive. Windows uname, macOS scutil, BIOS strings, and log lines that upper-case do not trigger redaction. Added (?i) case-insensitive flag to both matchers.
7 MEDIUM PATH and LD_LIBRARY_PATH included verbatim in env.txt — values frequently contain /home/<username>/... and private build directories. Replaced with length-only summary: PATH=<redacted:path-list N entries>.

Not exploitable (verified)

  • Argument injection via --remote-check: URLs are parsed with url::Url::parse(). Non-shell Command::new() is used everywhere in the doctor — no shell=True, no /bin/sh -c, and no try_exec call in any check passes user-controlled data as command or arg. All try_exec(cmd, args, ...) invocations use static string literals for both cmd and args.
  • Tar path traversal: entry names are hard-coded string literals ("all-smi-doctor/report.txt", ...). No attacker-controlled path component reaches tar::Builder::append_data.
  • which spawn: bounded by a 500 ms timeout via the same try_exec wrapper; no untrusted cmd argument path.
  • Subprocess timeouts: every call goes through try_execrun_command_with_timeout with an explicit Duration. The spawn_blocking + 3-second orchestrator ceiling provides a second layer.

Test coverage added

10 new unit tests:

  • is_secret_env_name_matches_common_patterns
  • env_dump_redacts_secret_values_and_summarises_path
  • bundle_unix_mode_is_0o600
  • bundle_refuses_preexisting_symlink (verifies decoy target is NOT overwritten through symlink)
  • scrub_hostname_is_case_insensitive
  • scrub_username_is_case_insensitive
  • scrub_kernel_pointer_canonical
  • scrub_kernel_pointer_0x_prefixed
  • scrub_kernel_pointer_leaves_version_strings_alone
  • output_cap_truncates_hostile_stdout + output_cap_leaves_small_outputs_alone

Verification

  • cargo test --lib726 passed (was 715 in the PR)
  • cargo test --test doctor_test — 9 passed
  • cargo clippy --all-targets --lib --bin all-smi -- -D warnings — clean
  • cargo fmt --all -- --check — clean

Remaining observations (non-blocking, not auto-fixed)

  • The ephemeral current-thread Tokio runtime built inside spawn_blocking for the HTTP probe (network.rs) works but is unusual. Worth revisiting if another check needs outbound HTTP — a shared single-threaded runtime for network checks would be cleaner.
  • IPv6 matcher still misses some compressed forms (e.g. ::ffff:192.0.2.1 mixed form past the embedded v4 case). Low impact since the v4 regex scrubs the trailing 192.0.2.1 independently, but the <ipv6> marker will not be emitted for those inputs.

Add a "Support Bundle Security" subsection under Diagnostics covering
O_NOFOLLOW symlink refusal, 0o600 owner-only file mode, secret-name
redaction in env.txt (always applied), and the exact scope of
--include-identifiers (network identifiers only, not credentials).

Add the doctor subcommand entry to the v0.21.0 upcoming changelog.
@inureyes

Copy link
Copy Markdown
Member Author

PR Finalization Complete

Checks run

  • cargo fmt --all -- --check: clean
  • cargo clippy --all-targets -- -D warnings: clean
  • cargo test --lib: 726 passed, 0 failed
  • cargo test --bin all-smi: 823 passed, 0 failed
  • cargo test --test doctor_test: 9 passed, 0 failed

Stable check-ID table

51 check IDs registered and verified across 14 categories. README table lists all 51; doctor_specific_check_ids_are_registered integration test enforces the subset that must remain stable.

Documentation

Added a "Support Bundle Security" subsection to the README Diagnostics section covering:

  • O_NOFOLLOW symlink refusal (pre-existing symlink at --bundle <PATH> surfaces an error rather than following)
  • 0o600 owner-only file permissions on the produced archive
  • Secret-name redaction in env.txt (16 keyword substrings; always applied regardless of --include-identifiers)
  • Exact scope of --include-identifiers: restores network-identity tokens (hostname, IPv4/6, MAC, username); credential values are never restored

Also added a doctor subcommand bullet to the v0.21.0 upcoming changelog entry.

Commit

179942e — ready for merge.

@inureyes inureyes added status:done Completed and removed status:review Under review labels Apr 20, 2026
@inureyes inureyes merged commit 3408512 into main Apr 20, 2026
3 checks passed
@inureyes inureyes deleted the feature/issue-188-doctor-subcommand branch April 20, 2026 15:15
@inureyes inureyes self-assigned this May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:high High priority issue status:done Completed type:enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: 'all-smi doctor' subcommand for self-diagnosis and support bundle

1 participant