feat: metric recording ('all-smi record') and TUI replay ('view --replay')#197
Conversation
Add the `record` subcommand that captures live metrics to an NDJSON stream and extends `view` with `--replay <file>` so operators can step, seek, and scrub through a past incident in the exact same TUI they would have seen live. Both sides share the snapshot subcommand's JSON serializer through a new `write_frame_json` helper; compression (`.zst`, `.gz`) is auto-detected by extension and rotation is bounded by `--max-size`/`--max-files`. Replay-mode keybindings (SPACE, `]`/`[`, `+`/`-`, `j`/`k`, `g`, `L`) are routed with strict mode precedence so the filter-edit mode from #186 keeps its keys. Closes #187
…y-one Three issues caught during review of issue #187 (record + replay): 1. SIGTERM/SIGINT during `all-smi record` was calling the unconditional `std::process::exit(0)` handlers from `run_command`, which raced the recorder's cooperative shutdown and left the output file at zero bytes. Skip the global signal handlers when the subcommand is `Record` so `record::install_signal_handlers` can flush the encoder and call `RotatingWriter::finish()` cleanly. Directly covers the acceptance criterion "SIGTERM during recording closes cleanly with a complete final JSON line." 2. `handle_up_arrow` / `handle_down_arrow` / `handle_page_up` / `handle_page_down` decided local-vs-remote mode from `args.hosts.is_some() || args.hostfile.is_some()`. In `view --replay` neither flag is set, so arrow keys fell into the local-process-list scroll branch even though the UI renders tabs + GPU columns from the recorded data. Include `args.replay.is_some()` in the remote classification so scrolling matches what the user sees. 3. `Replayer::rewind_and_seek_to` set `next_disk_seq = nearest_seq` after skipping past a sparse index frame. Index frames are written AFTER their matching data frame (`record::write_data_frame`), so the first data frame read post-skip has absolute sequence `nearest_seq + 1`. The old code off-by-one-ed every subsequent frame's reported sequence number, corrupting the REPLAY status-bar "frame N / M" display after an index fast-path seek. Added a regression test and a `#[cfg(test)] index_points_seen()` accessor so the test can assert the fast-path actually fired.
Implementation Review SummaryIntent
Findings Addressed (all fixed in 770f829)
Verification
Other Observations (non-blocking)
|
Address security/performance findings on the record+replay pipeline (PR #197 follow-up, issue #187): Writer (src/record/writer.rs): - Open new segments with `O_NOFOLLOW` + `create_new(true)` + mode 0o600 on Unix and `share_mode(0)` on Windows via new `open_secure` helper, mirroring the atomic-snapshot hardening from #185. Blocks the co-tenant symlink preplant attack (`/tmp/all-smi-record.ndjson -> /etc/shadow`). - During rollover, refuse to unlink or rename onto a pre-planted symlink at the rotated path (symlink_metadata check before remove_file). Replay (src/record/replay.rs): - Cap per-line length at 16 MiB (MAX_LINE_BYTES) so a 1 KiB `.zst` cannot decompress to a 10 GiB line; oversized lines are drained, logged, and skipped while the rest of the file stays playable. - Apply the cap to the index-fast-path skip-ahead loop too via `skip_one_line_bounded`. - Call `window_log_max(27)` on the zstd decoder to cap declared frame-window memory at 128 MiB. - Validate index frames at ingest: reject implausible `seq` (≥ u64::MAX/2 or 0), non-monotonic `seq`, index frames that point past observed data (seq ≥ next_disk_seq), and index frames before any data frame (line == 0). Drop offending entries without aborting playback. - Replace `self.next_disk_seq += 1` with checked_add that surfaces a new `ReplayError::SeqOverflow` on wrap, eliminating the debug panic / release silent-wrap on u64 overflow. - Cap header `hosts[]` at MAX_HEADER_HOSTS (1024) at ingest; truncate and warn on overflow. - Introduce a per-tick scan budget (SCAN_BUDGET_PER_TICK = 1024) so a file with millions of malformed lines cannot monopolise the async driver tick; the budget-exhaust path yields via Ok(false) and preserves pause/seek state for the next call. - Switch back-cache from `Vec` with `remove(0)` (O(N)) to `VecDeque` with `pop_front()` (O(1)). Driver (src/view/data_collection/replay_collector.rs): - Tighten speed-multiplier guard: clamp to [0.05, 16.0] and fall back to 1.0 on NaN/+inf/-inf before `Duration::from_secs_f32`, which panics on those inputs. - Replace `debug_assert_eq!` on cursor retreat with a soft tracing warning so an unexpected cursor position cannot abort the replay session. Runner (src/view/runner.rs): - Defensive belt-and-suspenders cap on `initial_state.tabs` length using the replayer's public MAX_HEADER_HOSTS constant. Regression tests added: - Record writer symlink refusal (Unix-only), 0o600 mode, existing-file refusal, rollover symlink refusal. - Oversized-line skip without OOM. - Zstd decoder window cap (file with window_log=28 rejected). - Index frame with seq=u64::MAX rejected at ingest. - Header hosts list truncated at 1024. - next_disk_seq overflow → clean `SeqOverflow` error, no panic.
- README "Recording & Replay" section now documents all four security
invariants: O_NOFOLLOW symlink refusal on the writer, 16 MiB per-line
cap and 128 MiB zstd window ceiling on the reader, and 1024-entry
hosts-list cap.
- Changelog entry for v0.21.0 updated to include record + view --replay,
interactive filter bar, and alert history panel.
- Help pane (h key) conditionally shows replay keybindings when replay
mode is active (state.replay.is_some()), matching the status-bar hint.
- New unit tests:
replay.rs: replayer_scan_budget_prevents_runaway_loop (F6) — verifies
SCAN_BUDGET_PER_TICK stops the read loop from spinning on an all-
garbage file and still reaches the next valid frame.
app_state.rs: cycle_speed_up_advances_ladder,
cycle_speed_down_retreats_ladder, cycle_speed_nan_input_does_not_panic
— cover the speed-ladder stepping and the NaN-safe fallback.
PR Finalization CompleteSummaryTests: Added 4 new tests; all suites green.
Documentation:
Lint/Format: Final test counts (HEAD 9c9a089)
|
Summary
Adds the
all-smi recordsubcommand and extendsall-smi viewwith--replay <file>so operators can capture a metric stream to disk andscrub/step/seek through it later in the exact same TUI they would have
seen live. Intended for post-hoc incident investigation without a
Prometheus retention store.
Closes #187.
Implementation
Shared serializer: new
write_frame_json<W: Write>(&mut W, &Snapshot)in
src/snapshot/serializers/json.rs, re-exported throughsrc/snapshot/serializers/mod.rs. Both thesnapshotsubcommandand
recorduse it, preventing drift.SnapshotgainsDeserializewith
#[serde(default)]on optional section fields so the replayside reconstructs frames even when
--includeomitted a sectionduring recording.
Record pipeline (
src/record/):mod.rs—Recorder::startorchestrator: signal handlers (SIGINT,SIGTERM), interval ticker, header frame, sparse index frame every
1000 data frames. Local and remote sources supported; remote
reuses
RemoteCollectorBuilderfrom theviewscrape path.writer.rs—RotatingWriterhandles.zst/.gzencoders androlls over at
--max-size, keeping at most--max-filessegments.replay.rs—Replayerstreaming cursor with bounded framecache (
FRAME_CACHE_MAX = 1024), sparse-index fast-path forseeks, malformed-line recovery, and exact-match error message for
unsupported schemas.
Replay in the TUI:
src/view/data_collection/replay_collector.rs—ReplayDriverfeeds frames into
AppStatehonoring play/pause/speed/step/seek.No renderer branches on replay mode; only the status bar differs.
src/ui/chrome.rs— newprint_replay_barchip withREPLAY | ts | frame N / M | Xx | state; falls back to thenormal hotkey strip when
replayisNoneonAppState.Filter bar still wins over the replay chip so operators can
filter visible GPUs mid-playback.
src/view/event_handler.rs— strict mode precedence: filter-editmode > replay-timecode editor > normal keys > replay keys. The
gkey opens the timecode editor only when replay is active; theexisting
g → sort by GPU memorybinding is preserved otherwise.CLI:
RecordArgs,RecordSource,RecordCompressionadded tosrc/cli.rs;ViewArgsgainsreplay,speed,start,replay_loop.main.rsdispatchesRecordalongsideSnapshot,and the
Viewbranch short-circuits torun_replay_modewhen--replayis set.Dependencies: adds
zstd = "0.13"andflate2 = "1"at thecrate root (both are needed at all times because the writer and
reader select by file extension at runtime, not by feature flag).
Docs: README gains a "Recording & Replay" section with the
record CLI table, replay keybindings table, and usage examples
(local, zstd, remote hostfile, timecode start, loop).
Testing
cargo testsurface:src/record/writer.rs— codec detection, rotation eviction, plain/ gzip / zstd round-trips.
src/record/replay.rs— prime-first-frame, schema-v2 exact errormessage, step forward/back, seek to offset, gzip decoder, zstd
decoder, corrupted-tail line skip.
src/record/mod.rs—parse_duration/parse_byte_sizesuffix parsing (
30s,5m,1h,1d,1K,10M,2G).src/view/event_handler.rs— 8 new replay-mode key tests: SPACEtoggles pause,
[/]step+pause,+/-cycle speed,j/k±10s seek,
gopens timecode editor and Enter commits,Ltoggles loop, replay keys inert when
replay = None, filtermode wins over replay mode.
tests/record_replay_test.rs— 8 integration tests: plain/gzip/zstdround-trip, corrupted tail skip, schema-v2 rejection, seek with
and without index frames, 50k-frame streaming check.
Runs:
cargo test --lib— 695 passedcargo test --bin all-smi— 781 passed (↑ from 766; +15 new tests)cargo test --test record_replay_test— 8 passedcargo clippy --all-targets -- -D warnings— cleanManual smoke:
all-smi record --output /tmp/smoke.ndjson --duration 5s --interval 1 --compress noneproduced a header + 5 data frames;--output /tmp/smoke.ndjson.zstproduced a zstd file thatzstd -dcdecompresses correctly;
--max-size 1024 --max-files 3capped thetotal on-disk segments at 3 (active + 2 rotated).
Test plan
all-smi recordandall-smi view --replaybinaries.cargo test --lib,cargo test --bin all-smi, and integrationtest all green.
cargo clippy --all-targets -- -D warningsclean..zst,.gz) round-trips.--max-size 1K --max-files 3) evicts oldest.(reviewer to validate keybinding feel + status-bar layout).