Skip to content

feat: 'all-smi doctor' subcommand for self-diagnosis and support bundle #188

Description

@inureyes

Summary

Add a new all-smi doctor subcommand that runs a battery of environment checks (drivers, permissions, libraries, env vars, container state, optional network) and prints a readable PASS/WARN/FAIL report. A --bundle flag packs the report plus relevant system context into a tar.gz suitable for attaching to a bug report; a --json flag emits the same data for CI/script consumption.

Motivation

Recurring support questions follow the pattern "why doesn't all-smi detect my GPUs?" — the answers are almost always one of: driver not loaded, wrong user group (video/render), ROCm/libamdgpu_top missing, NVML not loadable, sudo not used on macOS, compiled without the Furiosa feature flag, wrong Rust target. Today users and maintainers must guess from absent output. A standard doctor command gives an opinionated, fast, deterministic answer and dramatically reduces support round-trips.

Current state

  • src/device/platform_detection.rs has some detection logic, but it is not exposed as an introspection API.
  • src/device/reader_factory.rs decides which reader to instantiate, silently falling back.
  • No doctor subcommand exists.

Proposed design

Subcommand shape

all-smi doctor [--json] [--verbose]
               [--bundle <path.tar.gz>]
               [--include-identifiers]
               [--remote-check <host-or-url>]...
               [--skip <check-id>]...
               [--only <check-id>]...

Check categories

Each check has a stable ID (nvidia.nvml.loadable) so users and scripts can opt in/out and so output is greppable across versions.

  • platform.*: OS, kernel, arch, CPU model, socket count, memory total, uptime, compilation target (glibc vs musl), Rust target triple.
  • privileges.*: uid, effective uid, sudo/root status, macOS — running as root? video/render group membership, CAP_SYS_ADMIN, /dev/dri access, /dev/tenstorrent access.
  • container.*: is this running inside a container? cgroup v1/v2, CPU/memory limits, cpuset pinning, Kubernetes ServiceAccount token presence.
  • nvidia.*: NVML loadable, driver version, nvidia-smi binary path + version, NVIDIA_VISIBLE_DEVICES/CUDA_VISIBLE_DEVICES env, MIG mode, vGPU host detection.
  • amd.*: ROCm detected, libamdgpu_top ABI version, /dev/dri permissions, compiled with not(target_env = "musl") (AMD not available in musl builds).
  • apple.*: macOS version, IOReport API accessible, SMC accessible, running as root (required on macOS).
  • gaudi.*: hl-smi binary path, HL device nodes, driver version.
  • tpu.*: libtpu path, TPU_NAME env, JAX/Python interop reachable.
  • tenstorrent.*: luwen library, /dev/tenstorrent, tt-kmd module loaded.
  • rebellions.*: rbln-stat path, rbln driver version.
  • furiosa.*: compiled with furiosa feature flag, furiosa-smi binary.
  • windows.*: WMI thermal zones, AMD Ryzen Master SDK, Intel WMI, LibreHardwareMonitor running (for CPU temp fallback chain).
  • env.*: Every ALL_SMI_* env var, CUDA_*, ROCR_*, TPU_*, HL_* env vars redacted where appropriate.
  • network.* (only when --remote-check is given): DNS resolution, TCP reachability, latency, HTTP GET on /metrics with version check.

Each check is a unit with:

pub struct Check {
    pub id: &'static str,
    pub title: &'static str,
    pub severity_on_fail: Severity, // Info/Warn/Error
    pub run: fn(&Ctx) -> CheckResult,
}

pub enum CheckResult {
    Pass(String),             // one-line explanation
    Warn(String, Option<String>), // short + optional "how to fix" hint
    Fail(String, Option<String>),
    Skip(String),             // not applicable on this platform
}

Each check must complete within a hard 3-second timeout; timeouts become Warn with a "check timed out" message.

Output

Human-readable (default):

all-smi doctor — 0.21.0
Running 47 checks…

PASS platform.os              Linux 6.17.0 x86_64
PASS platform.runtime         glibc target
PASS privileges.user          uid=1000 in group render
FAIL nvidia.nvml.loadable     libnvidia-ml.so.1 not found
     → Fix: install NVIDIA driver (nvidia-driver-580+) or set LD_LIBRARY_PATH
WARN amd.rocm.version         ROCm 6.3.0 detected; all-smi tested against 7.0.x+
SKIP apple.smc                Not on macOS

Summary: 43 pass, 1 warn, 1 fail, 2 skipped
Exit code: 2

JSON (with --json):

{
  "schema": 1,
  "version": "0.21.0",
  "summary": {"pass": 43, "warn": 1, "fail": 1, "skip": 2},
  "checks": [
    {"id": "platform.os", "status": "pass", "message": "Linux 6.17.0 x86_64"},
    {"id": "nvidia.nvml.loadable", "status": "fail", "message": "libnvidia-ml.so.1 not found", "fix": "install NVIDIA driver..."}
  ]
}

Exit codes: 0 all pass (warnings allowed if --strict not set), 1 warnings only, 2 any failures.

Support bundle

--bundle report.tar.gz packs:

  • report.txt (human output)
  • report.json (json output)
  • env.txt (filtered env vars)
  • uname.txt (uname -a)
  • lspci.txt (lspci -vv | grep -iE 'vga|3d|display|accel') — Linux only
  • lsmod.txt — Linux only
  • dmesg-gpu.txt — last 200 lines matching GPU keywords (may need elevated privileges; skip if unavailable)
  • version.txt (all-smi --version, build features, target triple)
  • config.toml effective merged config (if the config-file issue has landed) with sensitive fields redacted
  • On macOS: system_profiler SPDisplaysDataType output.

Privacy:

  • Default bundle scrubs hostnames, IP addresses, usernames, MAC addresses.
  • --include-identifiers keeps them (opt-in).

Implementation plan

Files to add / modify:

  • src/cli.rsDoctor(DoctorArgs).
  • New module tree:
    • src/doctor/mod.rs — orchestrator, parallel execution with a small tokio pool, timeouts.
    • src/doctor/checks/platform.rs, privileges.rs, container.rs, nvidia.rs, amd.rs, apple.rs, gaudi.rs, tpu.rs, tenstorrent.rs, rebellions.rs, furiosa.rs, windows.rs, env.rs, network.rs
    • src/doctor/report.rs — human + JSON renderers.
    • src/doctor/bundle.rs — tar.gz packer (uses tar = "0.4" + flate2).
    • src/doctor/redact.rs — regex-based hostname/IP/MAC/user redaction.
  • src/device/platform_detection.rs — extract detection routines into a public introspection API so doctor and reader_factory share them.
  • tests/doctor_test.rs — smoke test covering a few check IDs on the host.
  • CI: a job runs cargo run -- doctor --json on the CI image and asserts exit 0/1 (no hard failures on the CI runner).

Acceptance criteria

  • all-smi doctor completes in under 3 seconds on a machine with no GPUs.
  • Every check outputs PASS/WARN/FAIL/SKIP with a one-line explanation and a fix hint on WARN/FAIL.
  • Every check has a stable ID; IDs are documented in the README.
  • --json emits a valid JSON document matching a fixed schema (with schema: 1).
  • --bundle report.tar.gz produces a tar.gz containing report.txt, report.json, and the applicable system files.
  • By default, bundles do not contain hostnames, IPs, MAC addresses, or local usernames.
  • --include-identifiers keeps them; --verbose prints extra context per check.
  • --skip nvidia.nvml.loadable omits that check; --only privileges.* runs only that group.
  • Respects NO_COLOR env var.
  • On Windows, the subcommand runs to completion (no panic!("not supported")).
  • Exit codes as specified.
  • README gains a "Diagnostics" section; issue templates ask for all-smi doctor --bundle output.

Edge cases & non-goals

  • Each check must be robust to missing binaries (which foo returns None → Skip).
  • Spawning child processes (lspci, lsmod) must time out — do not use Command::output() without a bounded timeout; use tokio with a deadline.
  • On macOS, system_profiler is slow; offer it behind --verbose only.
  • Checks must not mutate state (read-only tools like lsmod, lspci, uname, ldd).
  • Bundle must NOT include dmesg output that contains kernel addresses if kptr_restrict is on — follow kernel redaction.
  • Non-goal: fixing detected problems. This is diagnosis only.
  • Non-goal: uploading bundles. Users attach to GitHub issues themselves.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions