Skip to content

feat: add 'snapshot' subcommand for one-shot JSON/CSV/Prometheus output #185

Description

@inureyes

Summary

Add a new all-smi snapshot subcommand that produces a one-shot machine-readable dump of the current hardware state (GPU/CPU/memory/chassis/process/storage) to stdout. This unlocks scripting, CI probes, Slurm prolog/epilog hooks, and quick jq/yq piping without having to start the long-running api server.

Motivation

Today the only machine-readable output path is the api subcommand's Prometheus /metrics endpoint, which requires running a long-lived HTTP server. The library API returns typed values but is only accessible from Rust code. Operators and CI users regularly want the nvidia-smi --query-gpu=... --format=csv ergonomics — one process invocation, machine-readable stdout, non-zero exit on failure. all-smi has all the data already (every device type already implements Serialize), but no CLI path to emit it for a single collection cycle.

Current state

  • AllSmi::new() library API returns Vec<GpuInfo>, Vec<CpuInfo>, etc., all Serialize + Deserialize.
  • api mode uses these to build Prometheus text — long-lived server only.
  • local and view subcommands enter a TUI loop — not scriptable.
  • No CLI path for "collect once, print JSON/CSV, exit".

Proposed design

New subcommand all-smi snapshot:

all-smi snapshot [--format json|csv|prometheus] [--pretty]
                 [--include gpu,cpu,memory,chassis,process,storage]
                 [--query index,uuid,name,utilization,memory.used,memory.total,temperature,power]
                 [--interval <secs>] [--samples <n>]
                 [--timeout-ms <n>] [--output <path>]

Semantics:

  • Default: --format json --pretty --include gpu,cpu,memory,chassis (no process/storage by default to keep it fast).
  • --format json prints a single top-level object { "schema": 1, "timestamp": "2026-04-18T...Z", "gpus": [...], "cpus": [...], ... } matching the library types' serde schema.
  • --format csv flattens to one row per device. Default columns per --include type; --query overrides with a comma-separated field list using dot paths (e.g. memory.used, detail.cuda_version).
  • --format prometheus emits the exact Prometheus exposition the /metrics endpoint would emit for this single collection — i.e. reuses the exporter.
  • --samples N --interval T collects N samples T seconds apart and emits a JSON array (or CSV with repeated rows sharing a timestamp column).
  • --output path writes to a file instead of stdout; - means stdout (default).
  • --timeout-ms applies per-reader (TPU/Gaudi can be slow); reader failures are reported under a top-level errors array in JSON mode and as an errors column or stderr line in CSV/prometheus modes.
  • Process and storage sections are opt-in because they are expensive.

Exit codes:

  • 0 — success (possibly with partial errors array).
  • 1 — hard failure (no devices collected at all).
  • 2 — CLI / flag parse error.

Implementation plan

Files to add / modify:

  • src/cli.rs — add Snapshot(SnapshotArgs) variant with the flags above.
  • src/main.rs — dispatch.
  • New module src/snapshot/mod.rs containing:
    • SnapshotOptions struct parsed from SnapshotArgs.
    • fn run(opts: SnapshotOptions) -> anyhow::Result<()> that drives a single collection using the existing AllSmi library client and reuses src/api/metrics/* exporters for the prometheus format path.
    • serializers/json.rs, serializers/csv.rs, serializers/prometheus.rs for each format. CSV lives here rather than pulling csv crate — write rows manually to keep deps light.
    • query.rs — dot-path field selector evaluated against serde_json::Value so every device type is supported uniformly.
  • src/lib.rs — re-export SnapshotOptions and snapshot::run for programmatic use.
  • src/traits/exporter.rs — extend ExporterError variants if needed; reuse existing SerializationError, FormatError, UnsupportedFormat.

Reuse rules:

  • The JSON schema MUST be the same as the SSE endpoint when that lands (see companion issue). Schema version field "schema": 1 pins this.
  • The Prometheus serializer MUST be the same codepath as src/api/metrics/*; do not re-implement.

Acceptance criteria

  • all-smi snapshot prints valid JSON with at least schema, timestamp, and one device array (gpus on supported systems; cpus + memory always).
  • all-smi snapshot --format csv --query index,name,utilization,temperature prints a CSV with a header row matching the query and one row per GPU.
  • all-smi snapshot --format prometheus byte-for-byte matches a single scrape of api mode's /metrics for the same data.
  • all-smi snapshot --include cpu,memory omits gpus entirely (not an empty array — absent key).
  • all-smi snapshot --samples 3 --interval 1 emits a JSON array of 3 objects taken ~1s apart.
  • all-smi snapshot --output /tmp/x.json writes to that file and prints nothing to stdout.
  • Hard failure (e.g., all readers error) exits 1; flag parse error exits 2.
  • Works on all supported platforms — macOS, Linux (NVIDIA/AMD/Jetson/Gaudi/TPU/Tenstorrent/Rebellions/Furiosa where applicable), Windows.
  • Integration test under tests/snapshot_test.rs covering JSON and CSV against mock readers.
  • README gains a "Scripting / CI" section with examples (jq, Slurm epilog).

Edge cases & non-goals

  • Piping to jq must work (no ANSI colors, no TTY probes on --format json). --pretty default is on when stdout is a TTY and off otherwise.
  • --query dot paths into nested detail HashMap entries must not panic on missing keys (emit empty cell / null).
  • Slow readers (TPU, Gaudi) must respect --timeout-ms and surface the failure in errors rather than hanging.
  • Running on macOS without sudo must still emit what it can (CPU/memory/chassis) and note missing GPU access in errors.
  • Non-goal: long-term metric scraping — that is what api mode is for.
  • Non-goal: streaming — see the SSE companion issue.

Soft dependencies

  • Issue "SSE streaming endpoint" reuses this JSON schema.
  • Issue "Agentless SSH mode" remotely executes all-smi snapshot --format json as its primary path (with a nvidia-smi CSV shim as fallback).

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions