Summary
Add a new all-smi doctor subcommand that runs a battery of environment checks (drivers, permissions, libraries, env vars, container state, optional network) and prints a readable PASS/WARN/FAIL report. A --bundle flag packs the report plus relevant system context into a tar.gz suitable for attaching to a bug report; a --json flag emits the same data for CI/script consumption.
Motivation
Recurring support questions follow the pattern "why doesn't all-smi detect my GPUs?" — the answers are almost always one of: driver not loaded, wrong user group (video/render), ROCm/libamdgpu_top missing, NVML not loadable, sudo not used on macOS, compiled without the Furiosa feature flag, wrong Rust target. Today users and maintainers must guess from absent output. A standard doctor command gives an opinionated, fast, deterministic answer and dramatically reduces support round-trips.
Current state
src/device/platform_detection.rs has some detection logic, but it is not exposed as an introspection API.
src/device/reader_factory.rs decides which reader to instantiate, silently falling back.
- No
doctor subcommand exists.
Proposed design
Subcommand shape
all-smi doctor [--json] [--verbose]
[--bundle <path.tar.gz>]
[--include-identifiers]
[--remote-check <host-or-url>]...
[--skip <check-id>]...
[--only <check-id>]...
Check categories
Each check has a stable ID (nvidia.nvml.loadable) so users and scripts can opt in/out and so output is greppable across versions.
- platform.*: OS, kernel, arch, CPU model, socket count, memory total, uptime, compilation target (glibc vs musl), Rust target triple.
- privileges.*: uid, effective uid, sudo/root status, macOS — running as root? video/render group membership, CAP_SYS_ADMIN, /dev/dri access, /dev/tenstorrent access.
- container.*: is this running inside a container? cgroup v1/v2, CPU/memory limits, cpuset pinning, Kubernetes ServiceAccount token presence.
- nvidia.*: NVML loadable, driver version,
nvidia-smi binary path + version, NVIDIA_VISIBLE_DEVICES/CUDA_VISIBLE_DEVICES env, MIG mode, vGPU host detection.
- amd.*: ROCm detected, libamdgpu_top ABI version, /dev/dri permissions, compiled with
not(target_env = "musl") (AMD not available in musl builds).
- apple.*: macOS version, IOReport API accessible, SMC accessible, running as root (required on macOS).
- gaudi.*: hl-smi binary path, HL device nodes, driver version.
- tpu.*: libtpu path, TPU_NAME env, JAX/Python interop reachable.
- tenstorrent.*: luwen library, /dev/tenstorrent, tt-kmd module loaded.
- rebellions.*: rbln-stat path, rbln driver version.
- furiosa.*: compiled with
furiosa feature flag, furiosa-smi binary.
- windows.*: WMI thermal zones, AMD Ryzen Master SDK, Intel WMI, LibreHardwareMonitor running (for CPU temp fallback chain).
- env.*: Every
ALL_SMI_* env var, CUDA_*, ROCR_*, TPU_*, HL_* env vars redacted where appropriate.
- network.* (only when
--remote-check is given): DNS resolution, TCP reachability, latency, HTTP GET on /metrics with version check.
Each check is a unit with:
pub struct Check {
pub id: &'static str,
pub title: &'static str,
pub severity_on_fail: Severity, // Info/Warn/Error
pub run: fn(&Ctx) -> CheckResult,
}
pub enum CheckResult {
Pass(String), // one-line explanation
Warn(String, Option<String>), // short + optional "how to fix" hint
Fail(String, Option<String>),
Skip(String), // not applicable on this platform
}
Each check must complete within a hard 3-second timeout; timeouts become Warn with a "check timed out" message.
Output
Human-readable (default):
all-smi doctor — 0.21.0
Running 47 checks…
PASS platform.os Linux 6.17.0 x86_64
PASS platform.runtime glibc target
PASS privileges.user uid=1000 in group render
FAIL nvidia.nvml.loadable libnvidia-ml.so.1 not found
→ Fix: install NVIDIA driver (nvidia-driver-580+) or set LD_LIBRARY_PATH
WARN amd.rocm.version ROCm 6.3.0 detected; all-smi tested against 7.0.x+
SKIP apple.smc Not on macOS
Summary: 43 pass, 1 warn, 1 fail, 2 skipped
Exit code: 2
JSON (with --json):
{
"schema": 1,
"version": "0.21.0",
"summary": {"pass": 43, "warn": 1, "fail": 1, "skip": 2},
"checks": [
{"id": "platform.os", "status": "pass", "message": "Linux 6.17.0 x86_64"},
{"id": "nvidia.nvml.loadable", "status": "fail", "message": "libnvidia-ml.so.1 not found", "fix": "install NVIDIA driver..."}
]
}
Exit codes: 0 all pass (warnings allowed if --strict not set), 1 warnings only, 2 any failures.
Support bundle
--bundle report.tar.gz packs:
report.txt (human output)
report.json (json output)
env.txt (filtered env vars)
uname.txt (uname -a)
lspci.txt (lspci -vv | grep -iE 'vga|3d|display|accel') — Linux only
lsmod.txt — Linux only
dmesg-gpu.txt — last 200 lines matching GPU keywords (may need elevated privileges; skip if unavailable)
version.txt (all-smi --version, build features, target triple)
config.toml effective merged config (if the config-file issue has landed) with sensitive fields redacted
- On macOS:
system_profiler SPDisplaysDataType output.
Privacy:
- Default bundle scrubs hostnames, IP addresses, usernames, MAC addresses.
--include-identifiers keeps them (opt-in).
Implementation plan
Files to add / modify:
src/cli.rs — Doctor(DoctorArgs).
- New module tree:
src/doctor/mod.rs — orchestrator, parallel execution with a small tokio pool, timeouts.
src/doctor/checks/platform.rs, privileges.rs, container.rs, nvidia.rs, amd.rs, apple.rs, gaudi.rs, tpu.rs, tenstorrent.rs, rebellions.rs, furiosa.rs, windows.rs, env.rs, network.rs
src/doctor/report.rs — human + JSON renderers.
src/doctor/bundle.rs — tar.gz packer (uses tar = "0.4" + flate2).
src/doctor/redact.rs — regex-based hostname/IP/MAC/user redaction.
src/device/platform_detection.rs — extract detection routines into a public introspection API so doctor and reader_factory share them.
tests/doctor_test.rs — smoke test covering a few check IDs on the host.
- CI: a job runs
cargo run -- doctor --json on the CI image and asserts exit 0/1 (no hard failures on the CI runner).
Acceptance criteria
Edge cases & non-goals
- Each check must be robust to missing binaries (
which foo returns None → Skip).
- Spawning child processes (
lspci, lsmod) must time out — do not use Command::output() without a bounded timeout; use tokio with a deadline.
- On macOS,
system_profiler is slow; offer it behind --verbose only.
- Checks must not mutate state (read-only tools like
lsmod, lspci, uname, ldd).
- Bundle must NOT include
dmesg output that contains kernel addresses if kptr_restrict is on — follow kernel redaction.
- Non-goal: fixing detected problems. This is diagnosis only.
- Non-goal: uploading bundles. Users attach to GitHub issues themselves.
Summary
Add a new
all-smi doctorsubcommand that runs a battery of environment checks (drivers, permissions, libraries, env vars, container state, optional network) and prints a readable PASS/WARN/FAIL report. A--bundleflag packs the report plus relevant system context into a tar.gz suitable for attaching to a bug report; a--jsonflag emits the same data for CI/script consumption.Motivation
Recurring support questions follow the pattern "why doesn't all-smi detect my GPUs?" — the answers are almost always one of: driver not loaded, wrong user group (video/render), ROCm/libamdgpu_top missing, NVML not loadable, sudo not used on macOS, compiled without the Furiosa feature flag, wrong Rust target. Today users and maintainers must guess from absent output. A standard
doctorcommand gives an opinionated, fast, deterministic answer and dramatically reduces support round-trips.Current state
src/device/platform_detection.rshas some detection logic, but it is not exposed as an introspection API.src/device/reader_factory.rsdecides which reader to instantiate, silently falling back.doctorsubcommand exists.Proposed design
Subcommand shape
Check categories
Each check has a stable ID (
nvidia.nvml.loadable) so users and scripts can opt in/out and so output is greppable across versions.nvidia-smibinary path + version, NVIDIA_VISIBLE_DEVICES/CUDA_VISIBLE_DEVICES env, MIG mode, vGPU host detection.not(target_env = "musl")(AMD not available in musl builds).furiosafeature flag, furiosa-smi binary.ALL_SMI_*env var,CUDA_*,ROCR_*,TPU_*,HL_*env vars redacted where appropriate.--remote-checkis given): DNS resolution, TCP reachability, latency, HTTP GET on/metricswith version check.Each check is a unit with:
Each check must complete within a hard 3-second timeout; timeouts become
Warnwith a "check timed out" message.Output
Human-readable (default):
JSON (with
--json):{ "schema": 1, "version": "0.21.0", "summary": {"pass": 43, "warn": 1, "fail": 1, "skip": 2}, "checks": [ {"id": "platform.os", "status": "pass", "message": "Linux 6.17.0 x86_64"}, {"id": "nvidia.nvml.loadable", "status": "fail", "message": "libnvidia-ml.so.1 not found", "fix": "install NVIDIA driver..."} ] }Exit codes:
0all pass (warnings allowed if--strictnot set),1warnings only,2any failures.Support bundle
--bundle report.tar.gzpacks:report.txt(human output)report.json(json output)env.txt(filtered env vars)uname.txt(uname -a)lspci.txt(lspci -vv | grep -iE 'vga|3d|display|accel') — Linux onlylsmod.txt— Linux onlydmesg-gpu.txt— last 200 lines matching GPU keywords (may need elevated privileges; skip if unavailable)version.txt(all-smi --version, build features, target triple)config.tomleffective merged config (if the config-file issue has landed) with sensitive fields redactedsystem_profiler SPDisplaysDataTypeoutput.Privacy:
--include-identifierskeeps them (opt-in).Implementation plan
Files to add / modify:
src/cli.rs—Doctor(DoctorArgs).src/doctor/mod.rs— orchestrator, parallel execution with a small tokio pool, timeouts.src/doctor/checks/platform.rs,privileges.rs,container.rs,nvidia.rs,amd.rs,apple.rs,gaudi.rs,tpu.rs,tenstorrent.rs,rebellions.rs,furiosa.rs,windows.rs,env.rs,network.rssrc/doctor/report.rs— human + JSON renderers.src/doctor/bundle.rs— tar.gz packer (usestar = "0.4"+flate2).src/doctor/redact.rs— regex-based hostname/IP/MAC/user redaction.src/device/platform_detection.rs— extract detection routines into a publicintrospectionAPI so doctor and reader_factory share them.tests/doctor_test.rs— smoke test covering a few check IDs on the host.cargo run -- doctor --jsonon the CI image and asserts exit 0/1 (no hard failures on the CI runner).Acceptance criteria
all-smi doctorcompletes in under 3 seconds on a machine with no GPUs.--jsonemits a valid JSON document matching a fixed schema (withschema: 1).--bundle report.tar.gzproduces a tar.gz containingreport.txt,report.json, and the applicable system files.--include-identifierskeeps them;--verboseprints extra context per check.--skip nvidia.nvml.loadableomits that check;--only privileges.*runs only that group.NO_COLORenv var.panic!("not supported")).all-smi doctor --bundleoutput.Edge cases & non-goals
which fooreturns None →Skip).lspci,lsmod) must time out — do not useCommand::output()without a bounded timeout; use tokio with a deadline.system_profileris slow; offer it behind--verboseonly.lsmod,lspci,uname,ldd).dmesgoutput that contains kernel addresses if kptr_restrict is on — follow kernel redaction.