feat: 'all-smi doctor' subcommand for self-diagnosis and support bundle

## Summary

Add a new `all-smi doctor` subcommand that runs a battery of environment checks (drivers, permissions, libraries, env vars, container state, optional network) and prints a readable PASS/WARN/FAIL report. A `--bundle` flag packs the report plus relevant system context into a tar.gz suitable for attaching to a bug report; a `--json` flag emits the same data for CI/script consumption.

## Motivation

Recurring support questions follow the pattern "why doesn't all-smi detect my GPUs?" — the answers are almost always one of: driver not loaded, wrong user group (video/render), ROCm/libamdgpu_top missing, NVML not loadable, sudo not used on macOS, compiled without the Furiosa feature flag, wrong Rust target. Today users and maintainers must guess from absent output. A standard `doctor` command gives an opinionated, fast, deterministic answer and dramatically reduces support round-trips.

## Current state

- `src/device/platform_detection.rs` has some detection logic, but it is not exposed as an introspection API.
- `src/device/reader_factory.rs` decides which reader to instantiate, silently falling back.
- No `doctor` subcommand exists.

## Proposed design

### Subcommand shape

```
all-smi doctor [--json] [--verbose]
               [--bundle <path.tar.gz>]
               [--include-identifiers]
               [--remote-check <host-or-url>]...
               [--skip <check-id>]...
               [--only <check-id>]...
```

### Check categories

Each check has a stable ID (`nvidia.nvml.loadable`) so users and scripts can opt in/out and so output is greppable across versions.

- **platform.***: OS, kernel, arch, CPU model, socket count, memory total, uptime, compilation target (glibc vs musl), Rust target triple.
- **privileges.***: uid, effective uid, sudo/root status, macOS — running as root? video/render group membership, CAP_SYS_ADMIN, /dev/dri access, /dev/tenstorrent access.
- **container.***: is this running inside a container? cgroup v1/v2, CPU/memory limits, cpuset pinning, Kubernetes ServiceAccount token presence.
- **nvidia.***: NVML loadable, driver version, `nvidia-smi` binary path + version, NVIDIA_VISIBLE_DEVICES/CUDA_VISIBLE_DEVICES env, MIG mode, vGPU host detection.
- **amd.***: ROCm detected, libamdgpu_top ABI version, /dev/dri permissions, compiled with `not(target_env = "musl")` (AMD not available in musl builds).
- **apple.***: macOS version, IOReport API accessible, SMC accessible, running as root (required on macOS).
- **gaudi.***: hl-smi binary path, HL device nodes, driver version.
- **tpu.***: libtpu path, TPU_NAME env, JAX/Python interop reachable.
- **tenstorrent.***: luwen library, /dev/tenstorrent, tt-kmd module loaded.
- **rebellions.***: rbln-stat path, rbln driver version.
- **furiosa.***: compiled with `furiosa` feature flag, furiosa-smi binary.
- **windows.***: WMI thermal zones, AMD Ryzen Master SDK, Intel WMI, LibreHardwareMonitor running (for CPU temp fallback chain).
- **env.***: Every `ALL_SMI_*` env var, `CUDA_*`, `ROCR_*`, `TPU_*`, `HL_*` env vars redacted where appropriate.
- **network.*** (only when `--remote-check` is given): DNS resolution, TCP reachability, latency, HTTP GET on `/metrics` with version check.

Each check is a unit with:

```rust
pub struct Check {
    pub id: &'static str,
    pub title: &'static str,
    pub severity_on_fail: Severity, // Info/Warn/Error
    pub run: fn(&Ctx) -> CheckResult,
}

pub enum CheckResult {
    Pass(String),             // one-line explanation
    Warn(String, Option<String>), // short + optional "how to fix" hint
    Fail(String, Option<String>),
    Skip(String),             // not applicable on this platform
}
```

Each check must complete within a hard 3-second timeout; timeouts become `Warn` with a "check timed out" message.

### Output

Human-readable (default):

```
all-smi doctor — 0.21.0
Running 47 checks…

PASS platform.os              Linux 6.17.0 x86_64
PASS platform.runtime         glibc target
PASS privileges.user          uid=1000 in group render
FAIL nvidia.nvml.loadable     libnvidia-ml.so.1 not found
     → Fix: install NVIDIA driver (nvidia-driver-580+) or set LD_LIBRARY_PATH
WARN amd.rocm.version         ROCm 6.3.0 detected; all-smi tested against 7.0.x+
SKIP apple.smc                Not on macOS

Summary: 43 pass, 1 warn, 1 fail, 2 skipped
Exit code: 2
```

JSON (with `--json`):
```json
{
  "schema": 1,
  "version": "0.21.0",
  "summary": {"pass": 43, "warn": 1, "fail": 1, "skip": 2},
  "checks": [
    {"id": "platform.os", "status": "pass", "message": "Linux 6.17.0 x86_64"},
    {"id": "nvidia.nvml.loadable", "status": "fail", "message": "libnvidia-ml.so.1 not found", "fix": "install NVIDIA driver..."}
  ]
}
```

Exit codes: `0` all pass (warnings allowed if `--strict` not set), `1` warnings only, `2` any failures.

### Support bundle

`--bundle report.tar.gz` packs:

- `report.txt` (human output)
- `report.json` (json output)
- `env.txt` (filtered env vars)
- `uname.txt` (`uname -a`)
- `lspci.txt` (`lspci -vv | grep -iE 'vga|3d|display|accel'`) — Linux only
- `lsmod.txt` — Linux only
- `dmesg-gpu.txt` — last 200 lines matching GPU keywords (may need elevated privileges; skip if unavailable)
- `version.txt` (`all-smi --version`, build features, target triple)
- `config.toml` effective merged config (if the config-file issue has landed) with sensitive fields redacted
- On macOS: `system_profiler SPDisplaysDataType` output.

Privacy:
- Default bundle scrubs hostnames, IP addresses, usernames, MAC addresses.
- `--include-identifiers` keeps them (opt-in).

## Implementation plan

Files to add / modify:

- `src/cli.rs` — `Doctor(DoctorArgs)`.
- New module tree:
  - `src/doctor/mod.rs` — orchestrator, parallel execution with a small tokio pool, timeouts.
  - `src/doctor/checks/platform.rs`, `privileges.rs`, `container.rs`, `nvidia.rs`, `amd.rs`, `apple.rs`, `gaudi.rs`, `tpu.rs`, `tenstorrent.rs`, `rebellions.rs`, `furiosa.rs`, `windows.rs`, `env.rs`, `network.rs`
  - `src/doctor/report.rs` — human + JSON renderers.
  - `src/doctor/bundle.rs` — tar.gz packer (uses `tar = "0.4"` + `flate2`).
  - `src/doctor/redact.rs` — regex-based hostname/IP/MAC/user redaction.
- `src/device/platform_detection.rs` — extract detection routines into a public `introspection` API so doctor and reader_factory share them.
- `tests/doctor_test.rs` — smoke test covering a few check IDs on the host.
- CI: a job runs `cargo run -- doctor --json` on the CI image and asserts exit 0/1 (no hard failures on the CI runner).

## Acceptance criteria

- [x] `all-smi doctor` completes in under 3 seconds on a machine with no GPUs.
- [x] Every check outputs PASS/WARN/FAIL/SKIP with a one-line explanation and a fix hint on WARN/FAIL.
- [x] Every check has a stable ID; IDs are documented in the README.
- [x] `--json` emits a valid JSON document matching a fixed schema (with `schema: 1`).
- [x] `--bundle report.tar.gz` produces a tar.gz containing `report.txt`, `report.json`, and the applicable system files.
- [x] By default, bundles do not contain hostnames, IPs, MAC addresses, or local usernames.
- [x] `--include-identifiers` keeps them; `--verbose` prints extra context per check.
- [x] `--skip nvidia.nvml.loadable` omits that check; `--only privileges.*` runs only that group.
- [x] Respects `NO_COLOR` env var.
- [x] On Windows, the subcommand runs to completion (no `panic!("not supported")`).
- [x] Exit codes as specified.
- [x] README gains a "Diagnostics" section; issue templates ask for `all-smi doctor --bundle` output.

## Edge cases & non-goals

- Each check must be robust to missing binaries (`which foo` returns None → `Skip`).
- Spawning child processes (`lspci`, `lsmod`) must time out — do not use `Command::output()` without a bounded timeout; use tokio with a deadline.
- On macOS, `system_profiler` is slow; offer it behind `--verbose` only.
- Checks must not mutate state (read-only tools like `lsmod`, `lspci`, `uname`, `ldd`).
- Bundle must NOT include `dmesg` output that contains kernel addresses if kptr_restrict is on — follow kernel redaction.
- Non-goal: fixing detected problems. This is diagnosis only.
- Non-goal: uploading bundles. Users attach to GitHub issues themselves.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: 'all-smi doctor' subcommand for self-diagnosis and support bundle #188

Summary

Motivation

Current state

Proposed design

Subcommand shape

Check categories

Output

Support bundle

Implementation plan

Acceptance criteria

Edge cases & non-goals

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

feat: 'all-smi doctor' subcommand for self-diagnosis and support bundle #188

Description

Summary

Motivation

Current state

Proposed design

Subcommand shape

Check categories

Output

Support bundle

Implementation plan

Acceptance criteria

Edge cases & non-goals

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions