Summary
Add the ability to persistently capture a metric stream to a file (all-smi record) and replay it later in the TUI (all-smi view --replay file.ndjson). Replay supports play/pause, step, seek, and speed controls so operators can investigate past incidents without a Prometheus instance or full observability stack.
Motivation
When an incident happens on a cluster, the current all-smi modes cannot reproduce state post-hoc: local/view are live only, and api requires a running Prometheus to retain history. Operators and researchers regularly want to say "rewind to 14:32, that's when throughput cratered" and see the exact same TUI they would have seen live. A compact self-contained recording format unlocks this without the operational overhead of a full monitoring stack.
Current state
src/view/render_snapshot.rs already materializes a full snapshot per frame — a natural structural fit for a per-frame record.
- Data types (
GpuInfo, CpuInfo, ChassisInfo, MemoryInfo, process list, …) all implement Serialize + Deserialize.
- No persistence today.
src/view/data_collection/strategy.rs already abstracts local vs remote collection — a clean place to add a ReplayStrategy.
Proposed design
Record subcommand
all-smi record [--output path.ndjson | path.ndjson.zst | path.ndjson.gz]
[--interval <secs>]
[--duration 1h | 0]
[--source local|remote]
[--hosts ...] [--hostfile ...]
[--include gpu,cpu,memory,chassis,process]
[--max-size 100M] [--max-files 10]
[--compress zstd|gzip|none]
Behavior:
- Source selection mirrors
view (local reader vs remote HTTP scrape), so recording a remote cluster works too.
- Writes one NDJSON line per collection cycle. Each line is a self-contained JSON object with
{ "schema": 1, "t": "2026-04-18T...Z", "gpus": [...], "cpus": [...], ... } — same shape as the snapshot subcommand's JSON output, sharing serialization code.
- Rotation on size: when output exceeds
--max-size, close, rename to path.0001.ndjson[.zst], open a new file. Keep at most --max-files segments.
- Compression: auto-detect from extension, or explicit
--compress. Prefer zstd default.
--duration 0 means record until SIGTERM; on signal, flush and close cleanly.
- On error reading a device, emit
{ "t": ..., "errors": ["nvidia: NotSupported", ...] } lines so the replay shows gaps instead of silently skipping.
Replay
Extend all-smi view with:
all-smi view --replay path.ndjson[.zst|.gz]
[--speed 1.0] [--start 00:30] [--loop]
- Entering replay mode the TUI status bar shows
REPLAY | 00:12:34 / 01:00:00 | 2.0x | paused.
- New keybindings (active only in replay mode):
SPACE — play/pause
] — step one frame forward, [ — step one frame back
+ / - — speed up/down (0.25x, 0.5x, 1x, 2x, 4x, 8x)
j / k — seek backward/forward by 10s
g — open timecode entry, type HH:MM:SS + Enter to jump
L — toggle looping
- The replay strategy feeds the same
RenderSnapshot pipeline as live data — no renderer code should change behavior based on mode, only the status bar shows the replay chip.
File format
- NDJSON (one JSON object per line) so the file can be truncated, appended, or
tail -f'd safely.
- Frame shape (v1 schema):
{"schema":1,"t":"2026-04-18T12:00:00.000Z","seq":42,"source":"local","gpus":[...],"cpus":[...],"memory":{...},"chassis":{...},"processes":[...],"errors":[]}
- Header frame (optional first line):
{"schema":1,"header":true,"interval_ms":3000,"hosts":["dgx-01","dgx-02"],"all_smi_version":"0.21.0"}.
- A sparse index frame every 1000 data frames:
{"schema":1,"index":true,"seq":42000,"byte_offset":1234567} to enable efficient seeking into compressed files.
Implementation plan
Files to add / modify:
src/cli.rs — Record(RecordArgs); extend ViewArgs with replay: Option<PathBuf>, speed: f32, start: Option<String>, replay_loop: bool.
- New
src/record/mod.rs:
Recorder struct with start(source, options) -> Result<()>.
- Uses
tokio::io::BufWriter atop an adapter that picks zstd::stream::write::Encoder, flate2::write::GzEncoder, or raw File based on extension.
- Signal handler: on
SIGTERM/SIGINT, flush, close, exit 0.
- Rotation helper.
- New
src/record/replay.rs:
Replayer struct: streams frames, maintains cursor, exposes current(), next(), prev(), seek(Duration).
- Uses the
index frames for fast seek; if absent, scans from nearest known position with a bounded-memory sliding cache.
src/view/data_collection/strategy.rs — add ReplayStrategy variant alongside local/remote.
src/view/runner.rs — when replay.is_some(), wire ReplayStrategy.
src/view/event_handler.rs — mode-aware key handling; in replay mode, add the SPACE/]/[/+/-/j/k/g/L keys.
src/ui/chrome.rs or src/ui/local_header.rs — add the REPLAY | ts / total | speed | state indicator.
- Re-export a shared
pub fn write_frame_json(w, snapshot) -> io::Result<()> so record and snapshot cannot drift.
Dependencies to add:
zstd = "0.13" (feature-gated behind record if binary size is a concern, else default on).
flate2 = "1" (gzip for interoperability).
Acceptance criteria
Edge cases & non-goals
- Schema version mismatch must produce a clear error (
replay: unsupported schema version 2, this all-smi supports schema 1), not a panic.
- Corrupted tail line (incomplete frame written on crash): skip with a warning, continue.
- Large files (>1 GiB) must stream — do not load the whole file. Benchmark with a 1M-frame file.
- The replay pipeline MUST NOT call any device readers — ensures replay works on a totally different machine than the one that recorded.
- Remote recordings of a 100-node cluster produce large frames; the writer must buffer and flush periodically.
- Non-goal: editing / trimming recordings — users can
jq -c filter externally.
- Non-goal: converting to PromQL history — that's a different tool.
Soft dependency
- Shares JSON schema with the
snapshot subcommand and the SSE streaming endpoint — land schema first, these three features should use the identical serializer.
Summary
Add the ability to persistently capture a metric stream to a file (
all-smi record) and replay it later in the TUI (all-smi view --replay file.ndjson). Replay supports play/pause, step, seek, and speed controls so operators can investigate past incidents without a Prometheus instance or full observability stack.Motivation
When an incident happens on a cluster, the current
all-smimodes cannot reproduce state post-hoc:local/vieware live only, andapirequires a running Prometheus to retain history. Operators and researchers regularly want to say "rewind to 14:32, that's when throughput cratered" and see the exact same TUI they would have seen live. A compact self-contained recording format unlocks this without the operational overhead of a full monitoring stack.Current state
src/view/render_snapshot.rsalready materializes a full snapshot per frame — a natural structural fit for a per-frame record.GpuInfo,CpuInfo,ChassisInfo,MemoryInfo, process list, …) all implementSerialize + Deserialize.src/view/data_collection/strategy.rsalready abstracts local vs remote collection — a clean place to add aReplayStrategy.Proposed design
Record subcommand
Behavior:
view(local reader vs remote HTTP scrape), so recording a remote cluster works too.{ "schema": 1, "t": "2026-04-18T...Z", "gpus": [...], "cpus": [...], ... }— same shape as thesnapshotsubcommand's JSON output, sharing serialization code.--max-size, close, rename topath.0001.ndjson[.zst], open a new file. Keep at most--max-filessegments.--compress. Prefer zstd default.--duration 0means record untilSIGTERM; on signal, flush and close cleanly.{ "t": ..., "errors": ["nvidia: NotSupported", ...] }lines so the replay shows gaps instead of silently skipping.Replay
Extend
all-smi viewwith:REPLAY | 00:12:34 / 01:00:00 | 2.0x | paused.SPACE— play/pause]— step one frame forward,[— step one frame back+/-— speed up/down (0.25x, 0.5x, 1x, 2x, 4x, 8x)j/k— seek backward/forward by 10sg— open timecode entry, typeHH:MM:SS+ Enter to jumpL— toggle loopingRenderSnapshotpipeline as live data — no renderer code should change behavior based on mode, only the status bar shows the replay chip.File format
tail -f'd safely.{"schema":1,"t":"2026-04-18T12:00:00.000Z","seq":42,"source":"local","gpus":[...],"cpus":[...],"memory":{...},"chassis":{...},"processes":[...],"errors":[]}{"schema":1,"header":true,"interval_ms":3000,"hosts":["dgx-01","dgx-02"],"all_smi_version":"0.21.0"}.{"schema":1,"index":true,"seq":42000,"byte_offset":1234567}to enable efficient seeking into compressed files.Implementation plan
Files to add / modify:
src/cli.rs—Record(RecordArgs); extendViewArgswithreplay: Option<PathBuf>,speed: f32,start: Option<String>,replay_loop: bool.src/record/mod.rs:Recorderstruct withstart(source, options) -> Result<()>.tokio::io::BufWriteratop an adapter that pickszstd::stream::write::Encoder,flate2::write::GzEncoder, or rawFilebased on extension.SIGTERM/SIGINT, flush, close, exit 0.src/record/replay.rs:Replayerstruct: streams frames, maintains cursor, exposescurrent(),next(),prev(),seek(Duration).indexframes for fast seek; if absent, scans from nearest known position with a bounded-memory sliding cache.src/view/data_collection/strategy.rs— addReplayStrategyvariant alongside local/remote.src/view/runner.rs— whenreplay.is_some(), wireReplayStrategy.src/view/event_handler.rs— mode-aware key handling; in replay mode, add the SPACE/]/[/+/-/j/k/g/Lkeys.src/ui/chrome.rsorsrc/ui/local_header.rs— add theREPLAY | ts / total | speed | stateindicator.pub fn write_frame_json(w, snapshot) -> io::Result<()>sorecordandsnapshotcannot drift.Dependencies to add:
zstd = "0.13"(feature-gated behindrecordif binary size is a concern, else default on).flate2 = "1"(gzip for interoperability).Acceptance criteria
all-smi record -o /tmp/trace.ndjson --duration 30sproduces a valid NDJSON file with at least 10 frames (at 3s default interval).all-smi record -o /tmp/trace.ndjson.zst --duration 30sproduces a valid zstd-compressed file (zstd -ddecompresses to NDJSON).--max-size 1K --max-files 3yields up to 3 segments, oldest evicted.SIGTERMduring recording closes cleanly with a complete final JSON line.all-smi view --replay /tmp/trace.ndjsonopens the TUI and shows the first frame; the status bar shows REPLAY and ts/total.SPACEtoggles play; frames advance at 1x by default;+/-scale speed.]/[step one frame.g 00:00:15 <Enter>jumps to 15 seconds in.--loopreplays indefinitely.view --replayagainst aview --hostfilerecording renders the same tab structure (multiple hosts), not just a single local view.cargo testincludes: round-trip (record → replay), schema version mismatch error path, compressed file handling, seek correctness with and without index frames.Edge cases & non-goals
replay: unsupported schema version 2, this all-smi supports schema 1), not a panic.jq -cfilter externally.Soft dependency
snapshotsubcommand and the SSE streaming endpoint — land schema first, these three features should use the identical serializer.