Skip to content

feat: metric recording ('all-smi record') and TUI replay ('view --replay')#197

Merged
inureyes merged 4 commits into
mainfrom
feat/187-record-replay
Apr 20, 2026
Merged

feat: metric recording ('all-smi record') and TUI replay ('view --replay')#197
inureyes merged 4 commits into
mainfrom
feat/187-record-replay

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

Adds the all-smi record subcommand and extends all-smi view with
--replay <file> so operators can capture a metric stream to disk and
scrub/step/seek through it later in the exact same TUI they would have
seen live. Intended for post-hoc incident investigation without a
Prometheus retention store.

Closes #187.

Implementation

  • Shared serializer: new write_frame_json<W: Write>(&mut W, &Snapshot)
    in src/snapshot/serializers/json.rs, re-exported through
    src/snapshot/serializers/mod.rs. Both the snapshot subcommand
    and record use it, preventing drift. Snapshot gains Deserialize
    with #[serde(default)] on optional section fields so the replay
    side reconstructs frames even when --include omitted a section
    during recording.

  • Record pipeline (src/record/):

    • mod.rsRecorder::start orchestrator: signal handlers (SIGINT,
      SIGTERM), interval ticker, header frame, sparse index frame every
      1000 data frames. Local and remote sources supported; remote
      reuses RemoteCollectorBuilder from the view scrape path.
    • writer.rsRotatingWriter handles .zst/.gz encoders and
      rolls over at --max-size, keeping at most --max-files segments.
    • replay.rsReplayer streaming cursor with bounded frame
      cache (FRAME_CACHE_MAX = 1024), sparse-index fast-path for
      seeks, malformed-line recovery, and exact-match error message for
      unsupported schemas.
  • Replay in the TUI:

    • src/view/data_collection/replay_collector.rsReplayDriver
      feeds frames into AppState honoring play/pause/speed/step/seek.
      No renderer branches on replay mode; only the status bar differs.
    • src/ui/chrome.rs — new print_replay_bar chip with
      REPLAY | ts | frame N / M | Xx | state; falls back to the
      normal hotkey strip when replay is None on AppState.
      Filter bar still wins over the replay chip so operators can
      filter visible GPUs mid-playback.
    • src/view/event_handler.rs — strict mode precedence: filter-edit
      mode > replay-timecode editor > normal keys > replay keys. The
      g key opens the timecode editor only when replay is active; the
      existing g → sort by GPU memory binding is preserved otherwise.
  • CLI: RecordArgs, RecordSource, RecordCompression added to
    src/cli.rs; ViewArgs gains replay, speed, start,
    replay_loop. main.rs dispatches Record alongside Snapshot,
    and the View branch short-circuits to run_replay_mode when
    --replay is set.

  • Dependencies: adds zstd = "0.13" and flate2 = "1" at the
    crate root (both are needed at all times because the writer and
    reader select by file extension at runtime, not by feature flag).

  • Docs: README gains a "Recording & Replay" section with the
    record CLI table, replay keybindings table, and usage examples
    (local, zstd, remote hostfile, timecode start, loop).

Testing

cargo test surface:

  • src/record/writer.rs — codec detection, rotation eviction, plain
    / gzip / zstd round-trips.
  • src/record/replay.rs — prime-first-frame, schema-v2 exact error
    message, step forward/back, seek to offset, gzip decoder, zstd
    decoder, corrupted-tail line skip.
  • src/record/mod.rsparse_duration / parse_byte_size
    suffix parsing (30s, 5m, 1h, 1d, 1K, 10M, 2G).
  • src/view/event_handler.rs — 8 new replay-mode key tests: SPACE
    toggles pause, [/] step+pause, +/- cycle speed, j/k
    ±10s seek, g opens timecode editor and Enter commits, L
    toggles loop, replay keys inert when replay = None, filter
    mode wins over replay mode.
  • tests/record_replay_test.rs — 8 integration tests: plain/gzip/zstd
    round-trip, corrupted tail skip, schema-v2 rejection, seek with
    and without index frames, 50k-frame streaming check.

Runs:

  • cargo test --lib — 695 passed
  • cargo test --bin all-smi — 781 passed (↑ from 766; +15 new tests)
  • cargo test --test record_replay_test — 8 passed
  • cargo clippy --all-targets -- -D warnings — clean

Manual smoke: all-smi record --output /tmp/smoke.ndjson --duration 5s --interval 1 --compress none produced a header + 5 data frames;
--output /tmp/smoke.ndjson.zst produced a zstd file that zstd -dc
decompresses correctly; --max-size 1024 --max-files 3 capped the
total on-disk segments at 3 (active + 2 rotated).

Test plan

  • Build all-smi record and all-smi view --replay binaries.
  • cargo test --lib, cargo test --bin all-smi, and integration
    test all green.
  • cargo clippy --all-targets -- -D warnings clean.
  • Smoke-recorded local stream round-trips through replay parser.
  • Compressed output (.zst, .gz) round-trips.
  • Rotation cap (--max-size 1K --max-files 3) evicts oldest.
  • Manual TUI replay walkthrough on a multi-host recording
    (reviewer to validate keybinding feel + status-bar layout).

Add the `record` subcommand that captures live metrics to an NDJSON
stream and extends `view` with `--replay <file>` so operators can
step, seek, and scrub through a past incident in the exact same TUI
they would have seen live.

Both sides share the snapshot subcommand's JSON serializer through a
new `write_frame_json` helper; compression (`.zst`, `.gz`) is
auto-detected by extension and rotation is bounded by
`--max-size`/`--max-files`. Replay-mode keybindings (SPACE, `]`/`[`,
`+`/`-`, `j`/`k`, `g`, `L`) are routed with strict mode precedence so
the filter-edit mode from #186 keeps its keys.

Closes #187
@inureyes inureyes added type:enhancement New feature or request priority:high High priority issue status:review Under review labels Apr 20, 2026
…y-one

Three issues caught during review of issue #187 (record + replay):

1. SIGTERM/SIGINT during `all-smi record` was calling the unconditional
   `std::process::exit(0)` handlers from `run_command`, which raced the
   recorder's cooperative shutdown and left the output file at zero
   bytes. Skip the global signal handlers when the subcommand is
   `Record` so `record::install_signal_handlers` can flush the encoder
   and call `RotatingWriter::finish()` cleanly. Directly covers the
   acceptance criterion "SIGTERM during recording closes cleanly with
   a complete final JSON line."

2. `handle_up_arrow` / `handle_down_arrow` / `handle_page_up` /
   `handle_page_down` decided local-vs-remote mode from
   `args.hosts.is_some() || args.hostfile.is_some()`. In `view --replay`
   neither flag is set, so arrow keys fell into the local-process-list
   scroll branch even though the UI renders tabs + GPU columns from the
   recorded data. Include `args.replay.is_some()` in the remote
   classification so scrolling matches what the user sees.

3. `Replayer::rewind_and_seek_to` set `next_disk_seq = nearest_seq`
   after skipping past a sparse index frame. Index frames are written
   AFTER their matching data frame (`record::write_data_frame`), so
   the first data frame read post-skip has absolute sequence
   `nearest_seq + 1`. The old code off-by-one-ed every subsequent
   frame's reported sequence number, corrupting the REPLAY status-bar
   "frame N / M" display after an index fast-path seek. Added a
   regression test and a `#[cfg(test)] index_points_seen()` accessor
   so the test can assert the fast-path actually fired.
@inureyes

Copy link
Copy Markdown
Member Author

Implementation Review Summary

Intent

all-smi record captures a live metric stream to NDJSON (with optional zstd/gzip compression and rotation); all-smi view --replay <file> plays it back through the same RenderSnapshot pipeline as the live TUI.

Findings Addressed (all fixed in 770f829)

  • [CRITICAL] SIGTERM/SIGINT during record truncated output to zero bytes. The unconditional std::process::exit(0) handlers installed in run_command raced the recorder's cooperative stop flag and killed the process before RotatingWriter::finish() could flush the zstd/gzip trailer. Directly violated the "SIGTERM during recording closes cleanly with a complete final JSON line" acceptance criterion. Fix: skip the global signal handlers when the subcommand is Record; the record path's own handlers then cleanly flush and exit. Empirically verified: before fix, a SIGTERM'd recording was 0 bytes. After fix, the same SIGTERM produces a proper 1.6 KB zstd file with 5 decompressible NDJSON lines.

  • [MEDIUM] Replay-mode arrow/PageUp/PageDown keys took the local-process-list branch. handle_up_arrow etc. classified remote vs local via args.hosts.is_some() || args.hostfile.is_some(), but --replay sets neither, so in replay mode the arrow keys tried to scroll a non-existent process list even though the UI rendered tabs + GPU columns. Fix: include args.replay.is_some() in the remote classification.

  • [MEDIUM] Index-frame fast-path off-by-one in Replayer::rewind_and_seek_to. The record writer places index frames AFTER the matching data frame (if seq % 1000 == 0 { write(index { seq }) } runs after the data frame has already been written). On the read side, skipping past the index line leaves the reader at data frame nearest_seq + 1, but the old code set next_disk_seq = nearest_seq, so every subsequent frame was labeled with a seq that was off by one. The REPLAY status-bar "frame N / M" display would drift after a fast-path seek. Added a regression test.

Verification

  • All stated requirements implemented — record subcommand, replay mode, shared serializer, schema version error message, corrupted-tail handling, streaming discipline (no read_to_string/read_to_end in production code), rotation+eviction, SIGTERM clean shutdown (after fix).
  • No placeholder/mock code remaining.
  • Integrated into project code flow — Commands::Record dispatches from main.rs; View short-circuits to run_replay_mode when --replay is set; the replay driver feeds AppState through the same path as live collectors.
  • Project conventions followed — thiserror for library errors, anyhow for the CLI entry point, inline format args, per-module tests alongside the code.
  • Existing modules reused — shared write_frame_json serializer, shared RemoteCollectorBuilder for the remote record source, shared RenderSnapshot pipeline for the replay UI.
  • No unintended structural changes — all additions, existing APIs untouched.
  • Tests pass — 782 unit tests + 8 integration tests + clippy clean. One regression test added for the seek off-by-one.

Other Observations (non-blocking)

  • The replay driver calls replayer.next()? + replayer.prev()? on every 50 ms tick to decide whether a frame is due. For a 3-s record interval that's 60 next/prev pairs per frame — wasteful but not a correctness issue. Worth revisiting if replay idle CPU shows up in profiling.
  • Replayer::next() returns Ok(self.current()) at EOF (not Ok(None)), so callers that use while r.next().is_some() to drain the stream hang forever. The internal apply_frame_to_state path is not affected because it uses at_eof() indirectly, but it's an API sharp edge. Worth documenting on the next() signature in a follow-up.

Address security/performance findings on the record+replay pipeline
(PR #197 follow-up, issue #187):

Writer (src/record/writer.rs):
- Open new segments with `O_NOFOLLOW` + `create_new(true)` + mode 0o600
  on Unix and `share_mode(0)` on Windows via new `open_secure` helper,
  mirroring the atomic-snapshot hardening from #185. Blocks the
  co-tenant symlink preplant attack (`/tmp/all-smi-record.ndjson ->
  /etc/shadow`).
- During rollover, refuse to unlink or rename onto a pre-planted
  symlink at the rotated path (symlink_metadata check before
  remove_file).

Replay (src/record/replay.rs):
- Cap per-line length at 16 MiB (MAX_LINE_BYTES) so a 1 KiB `.zst`
  cannot decompress to a 10 GiB line; oversized lines are drained,
  logged, and skipped while the rest of the file stays playable.
- Apply the cap to the index-fast-path skip-ahead loop too via
  `skip_one_line_bounded`.
- Call `window_log_max(27)` on the zstd decoder to cap declared
  frame-window memory at 128 MiB.
- Validate index frames at ingest: reject implausible `seq` (≥
  u64::MAX/2 or 0), non-monotonic `seq`, index frames that point
  past observed data (seq ≥ next_disk_seq), and index frames before
  any data frame (line == 0). Drop offending entries without
  aborting playback.
- Replace `self.next_disk_seq += 1` with checked_add that surfaces
  a new `ReplayError::SeqOverflow` on wrap, eliminating the debug
  panic / release silent-wrap on u64 overflow.
- Cap header `hosts[]` at MAX_HEADER_HOSTS (1024) at ingest; truncate
  and warn on overflow.
- Introduce a per-tick scan budget (SCAN_BUDGET_PER_TICK = 1024) so
  a file with millions of malformed lines cannot monopolise the
  async driver tick; the budget-exhaust path yields via Ok(false)
  and preserves pause/seek state for the next call.
- Switch back-cache from `Vec` with `remove(0)` (O(N)) to `VecDeque`
  with `pop_front()` (O(1)).

Driver (src/view/data_collection/replay_collector.rs):
- Tighten speed-multiplier guard: clamp to [0.05, 16.0] and fall
  back to 1.0 on NaN/+inf/-inf before `Duration::from_secs_f32`,
  which panics on those inputs.
- Replace `debug_assert_eq!` on cursor retreat with a soft tracing
  warning so an unexpected cursor position cannot abort the replay
  session.

Runner (src/view/runner.rs):
- Defensive belt-and-suspenders cap on `initial_state.tabs` length
  using the replayer's public MAX_HEADER_HOSTS constant.

Regression tests added:
- Record writer symlink refusal (Unix-only), 0o600 mode, existing-file
  refusal, rollover symlink refusal.
- Oversized-line skip without OOM.
- Zstd decoder window cap (file with window_log=28 rejected).
- Index frame with seq=u64::MAX rejected at ingest.
- Header hosts list truncated at 1024.
- next_disk_seq overflow → clean `SeqOverflow` error, no panic.
- README "Recording & Replay" section now documents all four security
  invariants: O_NOFOLLOW symlink refusal on the writer, 16 MiB per-line
  cap and 128 MiB zstd window ceiling on the reader, and 1024-entry
  hosts-list cap.
- Changelog entry for v0.21.0 updated to include record + view --replay,
  interactive filter bar, and alert history panel.
- Help pane (h key) conditionally shows replay keybindings when replay
  mode is active (state.replay.is_some()), matching the status-bar hint.
- New unit tests:
    replay.rs: replayer_scan_budget_prevents_runaway_loop (F6) — verifies
      SCAN_BUDGET_PER_TICK stops the read loop from spinning on an all-
      garbage file and still reaches the next valid frame.
    app_state.rs: cycle_speed_up_advances_ladder,
      cycle_speed_down_retreats_ladder, cycle_speed_nan_input_does_not_panic
      — cover the speed-ladder stepping and the NaN-safe fallback.
@inureyes

Copy link
Copy Markdown
Member Author

PR Finalization Complete

Summary

Tests: Added 4 new tests; all suites green.

  • src/app_state.rs: cycle_speed_up_advances_ladder, cycle_speed_down_retreats_ladder, cycle_speed_nan_input_does_not_panic — exercises the speed ladder stepping and the NaN-safe fallback in cycle_speed.
  • src/record/replay.rs: replayer_scan_budget_prevents_runaway_loop (F6) — verifies SCAN_BUDGET_PER_TICK stops the frame-scan loop from spinning indefinitely on an all-garbage file and that the next valid frame is still reachable on a subsequent call.

Documentation:

  • README.md "Recording & Replay" section now includes a "Security invariants" block covering: writer O_NOFOLLOW symlink refusal, 16 MiB per-line decompression cap, 128 MiB zstd window ceiling, and 1024-entry hosts-list cap.
  • Changelog v0.21.0 (upcoming) entry updated to mention record, view --replay, interactive filter bar, and alert history panel.
  • Help pane (h key) now conditionally renders a "Replay Controls" section listing all 8 replay keybindings when state.replay.is_some(), matching the status-bar hint.

Lint/Format: cargo fmt --check and cargo clippy --all-targets -- -D warnings both clean.

Final test counts (HEAD 9c9a089)

Suite Count
cargo test --lib 698 passed
cargo test --bin all-smi 795 passed
cargo test --test record_replay_test 8 passed

@inureyes inureyes added status:done Completed and removed status:review Under review labels Apr 20, 2026
@inureyes inureyes merged commit 36ee46a into main Apr 20, 2026
3 checks passed
@inureyes inureyes deleted the feat/187-record-replay branch April 20, 2026 14:24
@inureyes inureyes self-assigned this May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:high High priority issue status:done Completed type:enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: metric recording ('all-smi record') and TUI replay ('view --replay')

1 participant