Skip to content

feat: metric recording ('all-smi record') and TUI replay ('view --replay') #187

Description

@inureyes

Summary

Add the ability to persistently capture a metric stream to a file (all-smi record) and replay it later in the TUI (all-smi view --replay file.ndjson). Replay supports play/pause, step, seek, and speed controls so operators can investigate past incidents without a Prometheus instance or full observability stack.

Motivation

When an incident happens on a cluster, the current all-smi modes cannot reproduce state post-hoc: local/view are live only, and api requires a running Prometheus to retain history. Operators and researchers regularly want to say "rewind to 14:32, that's when throughput cratered" and see the exact same TUI they would have seen live. A compact self-contained recording format unlocks this without the operational overhead of a full monitoring stack.

Current state

  • src/view/render_snapshot.rs already materializes a full snapshot per frame — a natural structural fit for a per-frame record.
  • Data types (GpuInfo, CpuInfo, ChassisInfo, MemoryInfo, process list, …) all implement Serialize + Deserialize.
  • No persistence today.
  • src/view/data_collection/strategy.rs already abstracts local vs remote collection — a clean place to add a ReplayStrategy.

Proposed design

Record subcommand

all-smi record [--output path.ndjson | path.ndjson.zst | path.ndjson.gz]
               [--interval <secs>]
               [--duration 1h | 0]
               [--source local|remote]
               [--hosts ...] [--hostfile ...]
               [--include gpu,cpu,memory,chassis,process]
               [--max-size 100M] [--max-files 10]
               [--compress zstd|gzip|none]

Behavior:

  • Source selection mirrors view (local reader vs remote HTTP scrape), so recording a remote cluster works too.
  • Writes one NDJSON line per collection cycle. Each line is a self-contained JSON object with { "schema": 1, "t": "2026-04-18T...Z", "gpus": [...], "cpus": [...], ... } — same shape as the snapshot subcommand's JSON output, sharing serialization code.
  • Rotation on size: when output exceeds --max-size, close, rename to path.0001.ndjson[.zst], open a new file. Keep at most --max-files segments.
  • Compression: auto-detect from extension, or explicit --compress. Prefer zstd default.
  • --duration 0 means record until SIGTERM; on signal, flush and close cleanly.
  • On error reading a device, emit { "t": ..., "errors": ["nvidia: NotSupported", ...] } lines so the replay shows gaps instead of silently skipping.

Replay

Extend all-smi view with:

all-smi view --replay path.ndjson[.zst|.gz]
             [--speed 1.0] [--start 00:30] [--loop]
  • Entering replay mode the TUI status bar shows REPLAY | 00:12:34 / 01:00:00 | 2.0x | paused.
  • New keybindings (active only in replay mode):
    • SPACE — play/pause
    • ] — step one frame forward, [ — step one frame back
    • + / - — speed up/down (0.25x, 0.5x, 1x, 2x, 4x, 8x)
    • j / k — seek backward/forward by 10s
    • g — open timecode entry, type HH:MM:SS + Enter to jump
    • L — toggle looping
  • The replay strategy feeds the same RenderSnapshot pipeline as live data — no renderer code should change behavior based on mode, only the status bar shows the replay chip.

File format

  • NDJSON (one JSON object per line) so the file can be truncated, appended, or tail -f'd safely.
  • Frame shape (v1 schema):
    {"schema":1,"t":"2026-04-18T12:00:00.000Z","seq":42,"source":"local","gpus":[...],"cpus":[...],"memory":{...},"chassis":{...},"processes":[...],"errors":[]}
  • Header frame (optional first line): {"schema":1,"header":true,"interval_ms":3000,"hosts":["dgx-01","dgx-02"],"all_smi_version":"0.21.0"}.
  • A sparse index frame every 1000 data frames: {"schema":1,"index":true,"seq":42000,"byte_offset":1234567} to enable efficient seeking into compressed files.

Implementation plan

Files to add / modify:

  • src/cli.rsRecord(RecordArgs); extend ViewArgs with replay: Option<PathBuf>, speed: f32, start: Option<String>, replay_loop: bool.
  • New src/record/mod.rs:
    • Recorder struct with start(source, options) -> Result<()>.
    • Uses tokio::io::BufWriter atop an adapter that picks zstd::stream::write::Encoder, flate2::write::GzEncoder, or raw File based on extension.
    • Signal handler: on SIGTERM/SIGINT, flush, close, exit 0.
    • Rotation helper.
  • New src/record/replay.rs:
    • Replayer struct: streams frames, maintains cursor, exposes current(), next(), prev(), seek(Duration).
    • Uses the index frames for fast seek; if absent, scans from nearest known position with a bounded-memory sliding cache.
  • src/view/data_collection/strategy.rs — add ReplayStrategy variant alongside local/remote.
  • src/view/runner.rs — when replay.is_some(), wire ReplayStrategy.
  • src/view/event_handler.rs — mode-aware key handling; in replay mode, add the SPACE/]/[/+/-/j/k/g/L keys.
  • src/ui/chrome.rs or src/ui/local_header.rs — add the REPLAY | ts / total | speed | state indicator.
  • Re-export a shared pub fn write_frame_json(w, snapshot) -> io::Result<()> so record and snapshot cannot drift.

Dependencies to add:

  • zstd = "0.13" (feature-gated behind record if binary size is a concern, else default on).
  • flate2 = "1" (gzip for interoperability).

Acceptance criteria

  • all-smi record -o /tmp/trace.ndjson --duration 30s produces a valid NDJSON file with at least 10 frames (at 3s default interval).
  • all-smi record -o /tmp/trace.ndjson.zst --duration 30s produces a valid zstd-compressed file (zstd -d decompresses to NDJSON).
  • Rotation kicks in: --max-size 1K --max-files 3 yields up to 3 segments, oldest evicted.
  • SIGTERM during recording closes cleanly with a complete final JSON line.
  • all-smi view --replay /tmp/trace.ndjson opens the TUI and shows the first frame; the status bar shows REPLAY and ts/total.
  • SPACE toggles play; frames advance at 1x by default; +/- scale speed.
  • ]/[ step one frame.
  • g 00:00:15 <Enter> jumps to 15 seconds in.
  • --loop replays indefinitely.
  • Running view --replay against a view --hostfile recording renders the same tab structure (multiple hosts), not just a single local view.
  • cargo test includes: round-trip (record → replay), schema version mismatch error path, compressed file handling, seek correctness with and without index frames.
  • README gains "Recording & Replay" section with usage.

Edge cases & non-goals

  • Schema version mismatch must produce a clear error (replay: unsupported schema version 2, this all-smi supports schema 1), not a panic.
  • Corrupted tail line (incomplete frame written on crash): skip with a warning, continue.
  • Large files (>1 GiB) must stream — do not load the whole file. Benchmark with a 1M-frame file.
  • The replay pipeline MUST NOT call any device readers — ensures replay works on a totally different machine than the one that recorded.
  • Remote recordings of a 100-node cluster produce large frames; the writer must buffer and flush periodically.
  • Non-goal: editing / trimming recordings — users can jq -c filter externally.
  • Non-goal: converting to PromQL history — that's a different tool.

Soft dependency

  • Shares JSON schema with the snapshot subcommand and the SSE streaming endpoint — land schema first, these three features should use the identical serializer.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions