You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add a new all-smi snapshot subcommand that produces a one-shot machine-readable dump of the current hardware state (GPU/CPU/memory/chassis/process/storage) to stdout. This unlocks scripting, CI probes, Slurm prolog/epilog hooks, and quick jq/yq piping without having to start the long-running api server.
Motivation
Today the only machine-readable output path is the api subcommand's Prometheus /metrics endpoint, which requires running a long-lived HTTP server. The library API returns typed values but is only accessible from Rust code. Operators and CI users regularly want the nvidia-smi --query-gpu=... --format=csv ergonomics — one process invocation, machine-readable stdout, non-zero exit on failure. all-smi has all the data already (every device type already implements Serialize), but no CLI path to emit it for a single collection cycle.
Current state
AllSmi::new() library API returns Vec<GpuInfo>, Vec<CpuInfo>, etc., all Serialize + Deserialize.
api mode uses these to build Prometheus text — long-lived server only.
local and view subcommands enter a TUI loop — not scriptable.
No CLI path for "collect once, print JSON/CSV, exit".
Default: --format json --pretty --include gpu,cpu,memory,chassis (no process/storage by default to keep it fast).
--format json prints a single top-level object { "schema": 1, "timestamp": "2026-04-18T...Z", "gpus": [...], "cpus": [...], ... } matching the library types' serde schema.
--format csv flattens to one row per device. Default columns per --include type; --query overrides with a comma-separated field list using dot paths (e.g. memory.used, detail.cuda_version).
--format prometheus emits the exact Prometheus exposition the /metrics endpoint would emit for this single collection — i.e. reuses the exporter.
--samples N --interval T collects N samples T seconds apart and emits a JSON array (or CSV with repeated rows sharing a timestamp column).
--output path writes to a file instead of stdout; - means stdout (default).
--timeout-ms applies per-reader (TPU/Gaudi can be slow); reader failures are reported under a top-level errors array in JSON mode and as an errors column or stderr line in CSV/prometheus modes.
Process and storage sections are opt-in because they are expensive.
Exit codes:
0 — success (possibly with partial errors array).
1 — hard failure (no devices collected at all).
2 — CLI / flag parse error.
Implementation plan
Files to add / modify:
src/cli.rs — add Snapshot(SnapshotArgs) variant with the flags above.
src/main.rs — dispatch.
New module src/snapshot/mod.rs containing:
SnapshotOptions struct parsed from SnapshotArgs.
fn run(opts: SnapshotOptions) -> anyhow::Result<()> that drives a single collection using the existing AllSmi library client and reuses src/api/metrics/* exporters for the prometheus format path.
serializers/json.rs, serializers/csv.rs, serializers/prometheus.rs for each format. CSV lives here rather than pulling csv crate — write rows manually to keep deps light.
query.rs — dot-path field selector evaluated against serde_json::Value so every device type is supported uniformly.
src/lib.rs — re-export SnapshotOptions and snapshot::run for programmatic use.
Summary
Add a new
all-smi snapshotsubcommand that produces a one-shot machine-readable dump of the current hardware state (GPU/CPU/memory/chassis/process/storage) to stdout. This unlocks scripting, CI probes, Slurm prolog/epilog hooks, and quickjq/yqpiping without having to start the long-runningapiserver.Motivation
Today the only machine-readable output path is the
apisubcommand's Prometheus/metricsendpoint, which requires running a long-lived HTTP server. The library API returns typed values but is only accessible from Rust code. Operators and CI users regularly want thenvidia-smi --query-gpu=... --format=csvergonomics — one process invocation, machine-readable stdout, non-zero exit on failure.all-smihas all the data already (every device type already implementsSerialize), but no CLI path to emit it for a single collection cycle.Current state
AllSmi::new()library API returnsVec<GpuInfo>,Vec<CpuInfo>, etc., allSerialize + Deserialize.apimode uses these to build Prometheus text — long-lived server only.localandviewsubcommands enter a TUI loop — not scriptable.Proposed design
New subcommand
all-smi snapshot:Semantics:
--format json --pretty --include gpu,cpu,memory,chassis(noprocess/storageby default to keep it fast).--format jsonprints a single top-level object{ "schema": 1, "timestamp": "2026-04-18T...Z", "gpus": [...], "cpus": [...], ... }matching the library types' serde schema.--format csvflattens to one row per device. Default columns per--includetype;--queryoverrides with a comma-separated field list using dot paths (e.g.memory.used,detail.cuda_version).--format prometheusemits the exact Prometheus exposition the/metricsendpoint would emit for this single collection — i.e. reuses the exporter.--samples N --interval Tcollects N samples T seconds apart and emits a JSON array (or CSV with repeated rows sharing atimestampcolumn).--output pathwrites to a file instead of stdout;-means stdout (default).--timeout-msapplies per-reader (TPU/Gaudi can be slow); reader failures are reported under a top-levelerrorsarray in JSON mode and as anerrorscolumn or stderr line in CSV/prometheus modes.Exit codes:
0— success (possibly with partialerrorsarray).1— hard failure (no devices collected at all).2— CLI / flag parse error.Implementation plan
Files to add / modify:
src/cli.rs— addSnapshot(SnapshotArgs)variant with the flags above.src/main.rs— dispatch.src/snapshot/mod.rscontaining:SnapshotOptionsstruct parsed fromSnapshotArgs.fn run(opts: SnapshotOptions) -> anyhow::Result<()>that drives a single collection using the existingAllSmilibrary client and reusessrc/api/metrics/*exporters for theprometheusformat path.serializers/json.rs,serializers/csv.rs,serializers/prometheus.rsfor each format. CSV lives here rather than pullingcsvcrate — write rows manually to keep deps light.query.rs— dot-path field selector evaluated againstserde_json::Valueso every device type is supported uniformly.src/lib.rs— re-exportSnapshotOptionsandsnapshot::runfor programmatic use.src/traits/exporter.rs— extendExporterErrorvariants if needed; reuse existingSerializationError,FormatError,UnsupportedFormat.Reuse rules:
"schema": 1pins this.src/api/metrics/*; do not re-implement.Acceptance criteria
all-smi snapshotprints valid JSON with at leastschema,timestamp, and one device array (gpuson supported systems;cpus+memoryalways).all-smi snapshot --format csv --query index,name,utilization,temperatureprints a CSV with a header row matching the query and one row per GPU.all-smi snapshot --format prometheusbyte-for-byte matches a single scrape ofapimode's/metricsfor the same data.all-smi snapshot --include cpu,memoryomitsgpusentirely (not an empty array — absent key).all-smi snapshot --samples 3 --interval 1emits a JSON array of 3 objects taken ~1s apart.all-smi snapshot --output /tmp/x.jsonwrites to that file and prints nothing to stdout.tests/snapshot_test.rscovering JSON and CSV against mock readers.jq, Slurm epilog).Edge cases & non-goals
jqmust work (no ANSI colors, no TTY probes on--format json).--prettydefault is on when stdout is a TTY and off otherwise.--querydot paths into nesteddetailHashMap entries must not panic on missing keys (emit empty cell /null).--timeout-msand surface the failure inerrorsrather than hanging.errors.apimode is for.Soft dependencies
all-smi snapshot --format jsonas its primary path (with anvidia-smiCSV shim as fallback).