feat(view): cluster-wide user/process aggregation tab ('V' key)#199
Conversation
Add a new Users tab to remote `view` mode that groups per-process metrics across every scraped host so operators can answer "who is using the cluster and how much?" at a glance. Accessed via `V` or tab cycling. Columns: USER, NODES, GPUs, PROCS, VRAM, POWER*, LONGEST, CMD. POWER* is a VRAM-weighted approximation of the per-user power share, clamped to >= 0 and marked in the header. In-tab keys: `u/m/p/n/t` sort, `f` hide system accounts, `e` export to `~/.cache/all-smi/users-<timestamp>.csv`, `Enter` drill-down per-host (second Enter drills into processes on that host), `ESC` backs out. Mode precedence: filter-edit (#186) > replay timecode (#187) > Users-tab keys > global navigation > replay-mode keys. Users-tab keys intercept before the outer match so `m/u/p/f` on the tab switch the users-sort key rather than falling through to the global GPU-sort bindings. Partial-coverage chip surfaces when only a subset of hosts emit the `all_smi_process_*` families; hosts with no process rows still count toward the denominator. Replay frames feed through `ParsedProcessRow::from_local_process` so recorded sessions render the tab identically to live scrapes. Mock hosts emit synthetic processes when `ALL_SMI_MOCK_PROCESSES=1`, following the same env-var convention as `ALL_SMI_MOCK_VGPU`. Closes #189
Address findings from the security review of the Users tab introduction: CSV export (src/view/event_handler.rs): - Replace std::fs::write with OpenOptions + O_NOFOLLOW + 0o600 (Unix) / share_mode(0) + create_new (Windows) so a pre-planted symlink at ~/.cache/all-smi/ or the final CSV path cannot redirect the write to an attacker-chosen location. Mirrors the hardening already in snapshot/record/doctor. - Refuse to proceed when the cache dir itself is a symlink, closing the gap where create_dir_all no-ops on an existing symlinked directory. - Extend csv_escape with an OWASP CSV-injection guard: fields whose first character is =, +, -, @, TAB, or CR are quoted and prefixed with a single quote so Excel / LibreOffice / Google Sheets do not evaluate them as formulas. - Add regression tests for symlink refusal, 0o600 file mode, RFC-4180 quoting, and leading-character injection vectors. Process metric exporter (src/api/metrics/process.rs): - Cap command, name, and user label values at 256 / 128 / 128 bytes with a UTF-8-safe truncation helper that appends "...(N bytes truncated)" so the truncation is visible to operators. Mitigates both scrape-response amplification (long argv × 3 metric families × N processes × M hosts) and the privacy risk of API tokens / DB URLs on command lines being broadcast to every Prometheus consumer. Remote metrics parser (src/network/metrics_parser.rs): - Apply matching 256 / 128-byte caps to the command / name / user / gpu_uuid fields inside process_process_metrics so a malicious remote host that bypasses our exporter cannot inflate the per-host process accumulator to ~150 MB (1024-byte generic cap × 50 000-row cap × 3 families). Truncation honours UTF-8 boundaries. Users aggregator (src/ui/aggregation/user.rs): - Collapse the double-HashMap-lookup pattern in aggregate_users and UserScratch::absorb into a single entry(...).or_insert(0) + &mut accumulation, eliminating ~5 000 redundant hashes + string clones per scrape tick on a 100-node cluster. - Pre-size total_vram_by_gpu from the combined GPU count across all snapshots to remove rehash allocations on the hot path. - Add an ignored adversarial stress test (10 hosts × 50 000 rows = 500 000 rows) to catch future quadratic regressions; release-mode completion measured at ~0.5 s.
Security + performance reviewAddressed 4 findings with fixes in commit c5dc1b7. Findings and fixesCRITICAL — CSV export path lacked symlink/TOCTOU hardening
HIGH — CSV injection (Excel/LibreOffice formula execution)
HIGH —
MEDIUM — Aggregation hot-path double-lookups + redundant clones
Confirmed non-findings
Test results
|
…artial chip Addresses five pr-reviewer findings on the Users tab (issue #189): - F1 (CRITICAL): A second `Enter` on the per-host table now renders the full process list for `(drill_user, drill_host)`, matching the issue-#189 spec. Before this, `drill_host` was set but nothing was drawn, leaving the operator stuck on the per-host view. ESC peels back one level at a time: first clears `drill_host` (return to per-host table), then clears `drill_user` (return to top table). - F3: Split the single `data_version` counter into two. `data_version` now covers "any UI dirtiness" so the render loop wakes on sort / filter / drill key presses, while the new `collector_data_version` only advances when collectors push fresh data. The Users-tab aggregation cache keys on the collector counter, so typing a sort hotkey no longer re-groups the 5 000-row cluster on every keystroke. - F4: Partial-coverage chip false-triggered on idle-but-configured hosts (connected, `--processes` enabled, zero processes running). `HostSnapshot` now carries an `is_connected` flag; a host counts as "reporting" when it has processes OR is connected. Only genuinely disconnected hosts drag the chip denominator down. - F5: Replays of local recordings collapsed every GPU onto `gpu_index = 0` because local readers don't populate the `index` detail key. Aggregation now falls back to the GPU's positional order within its host-group when the detail key is absent. - F6: Banner height above the Users-tab body is now computed dynamically from the number of visible chips (summary + optional partial chip + optional export toast + header), instead of the hardcoded approximation that sometimes ate a row from the body. Tests added: - `users_tab_enter_twice_drills_into_per_host_processes` (F1) - `users_tab_sort_keypress_does_not_invalidate_aggregation_cache` (F3) - `idle_connected_host_is_not_flagged_as_partial` (F4) - `users_aggregation_assigns_positional_gpu_index_when_detail_missing` (F5) All existing tests still pass: 764 lib + 879 bin + 3 integration. Clippy + fmt clean.
Add three security notes to the README Security notes block documenting the hardening that shipped with the Users tab: - CSV export O_NOFOLLOW + mode 0600 symlink refusal (Unix) / share_mode(0) (Windows) - Formula injection mitigation for user/command fields via RFC 4180 quoting + single-quote prefix - Per-field label caps in the remote Prometheus parser (user: 128 B, command: 256 B, global: 100 labels / 1024 B each) Add a Users tab entry at the top of the v0.21.0 (upcoming) changelog line covering the V hotkey, all in-tab keys (u/m/p/n/t/Enter/ESC/e/f), power approximation, CSV export path, partial-coverage chip, and the security hardening.
PR Finalization CompleteSummaryTests: verified passing — 764 lib + 879 bin + 3 integration; all in-scope unit tests (user aggregation, sort stability, power weighting, partial-coverage chip, drill-down invariants, CSV injection guard) already present. Documentation: updated
The Users tab section itself (V hotkey, in-tab keys, power approximation formula with Lint/Format: cargo fmt clean, cargo clippy -D warnings clean. Final HEAD: f42a18c |
Summary
Implements the cluster-wide Users tab (issue #189) accessible via the
Vhotkey in remote
viewmode. Aggregates per-process metrics across everyscraped host so operators can answer "who is using the cluster and how
much?" at a glance, with drill-down into per-host and per-process
breakdowns.
Implementation
Aggregator (
src/ui/aggregation/user.rs) — pure functionaggregate_users(&[HostSnapshot]) -> UserAggregationResult. Keys by(host, pid)so the same PID on two hosts counts as two processes.Computes the VRAM-weighted power approximation with the formula
documented in the issue and clamps the result to >= 0.
Renderer (
src/ui/renderers/user_renderer.rs) — top-level tablewith selectable rows plus a drill-down view listing per-host VRAM /
power / PID counts, run-length encoded GPU ranges, and the top
command per host.
Metrics exporter (
src/api/metrics/process.rs) — extended toemit
all_smi_process_memory_used_bytes,all_smi_process_start_time_seconds(new, drives theLONGESTcolumn), and
all_smi_process_cpu_percentwithpid,user,gpu_index,device_uuid,commandlabels.Remote parser (
src/network/metrics_parser.rs) — now parses theprocess_*metric family into aVec<ParsedProcessRow>keyed by(pid, gpu_index), with a defensive 50 000-row cap.App state — added
UsersTabState(sort, drill path, filter flag,export toast), memoised
UsersAggregationCachekeyed ondata_version.RenderSnapshot::capturenow takes&mut AppStateso the aggregation is materialised under the lock exactly once per
snapshot version.
Tab plumbing — remote + replay collectors insert
"Users"astab index 1 (right after
All) so it participates in cyclingwithout breaking the existing
All + hostslayout. LED grid skipsthe synthetic tab when iterating hostnames.
Event handler — new mode precedence ladder documented at the
top of
handle_key_event:Users-tab keys (
u/m/p/n/t/f/e/Enter/ESC) intercept before theglobal GPU-sort bindings so pressing
mon the tab switches theusers-sort key rather than falling through to
MemoryPercent.CSV export —
ewrites~/.cache/all-smi/users-<timestamp>.csv(or the XDG / Windowsequivalent) with RFC-4180 quoting.
Replay —
ParsedProcessRow::from_local_processlifts recordedProcessInfointo the remote-side representation so replays renderthe tab identically to live scrapes.
Mock —
ALL_SMI_MOCK_PROCESSES=1makes every synthetic NVIDIAnode emit 1-4 processes per GPU across a rotating user/command pool
(same env-var convention as
ALL_SMI_MOCK_VGPU).Help + README — help overlay adds a dedicated Users-tab section
and lists
V; README gets a full Users-tab section covering thecolumns, the power-approximation formula, and the in-tab keybindings.
Testing
cargo test --lib: 759 passedcargo test --bin all-smi: 864 passed (8 new Users-tab event-handlertests)
cargo test --test users_tab_integration_test: 3 passed — end-to-endexporter -> parser -> aggregator round-trip on a 5x3 fixture,
sort-stability regression, replay-pipeline round-trip.
cargo clippy --all-targets -- -D warnings: cleancargo fmt --all -- --check: cleanroot/system-UID filter, multi-GPU multi-host fan-out, oldest TIME+,
partial-coverage detection, negative-power clamp, per-host
breakdown invariant, and a 100x50 stress run under the <50 ms
budget.
Vjump, users-tabm/fintercepts,Enter/ESC drill-down, filter-mode precedence over tab keys, and
Up/Down row navigation bounds.
Test plan
ALL_SMI_MOCK_PROCESSES=1andverify the Users tab renders correctly with multiple users / hosts.
Vto jump to the tab, confirm sort keys re-rank thetable, and that
Enterdrills into a user /ESCreturns.eand verify the CSV file lands at~/.cache/all-smi/users-<ts>.csvwith the expected header and rows.--processeson a subset of hosts and confirm thepartial-coverage chip surfaces.
Closes #189