feat(view): cluster-wide user/process aggregation tab ('V' key) by inureyes · Pull Request #199 · lablup/all-smi

inureyes · 2026-04-20T15:47:49Z

Summary

Implements the cluster-wide Users tab (issue #189) accessible via the V
hotkey in remote view mode. Aggregates per-process metrics across every
scraped host so operators can answer "who is using the cluster and how
much?" at a glance, with drill-down into per-host and per-process
breakdowns.

Implementation

Aggregator (src/ui/aggregation/user.rs) — pure function
aggregate_users(&[HostSnapshot]) -> UserAggregationResult. Keys by
(host, pid) so the same PID on two hosts counts as two processes.
Computes the VRAM-weighted power approximation with the formula
documented in the issue and clamps the result to >= 0.
Renderer (src/ui/renderers/user_renderer.rs) — top-level table
with selectable rows plus a drill-down view listing per-host VRAM /
power / PID counts, run-length encoded GPU ranges, and the top
command per host.
Metrics exporter (src/api/metrics/process.rs) — extended to
emit all_smi_process_memory_used_bytes,
all_smi_process_start_time_seconds (new, drives the LONGEST
column), and all_smi_process_cpu_percent with pid, user,
gpu_index, device_uuid, command labels.
Remote parser (src/network/metrics_parser.rs) — now parses the
process_* metric family into a Vec<ParsedProcessRow> keyed by
(pid, gpu_index), with a defensive 50 000-row cap.
App state — added UsersTabState (sort, drill path, filter flag,
export toast), memoised UsersAggregationCache keyed on
data_version. RenderSnapshot::capture now takes &mut AppState
so the aggregation is materialised under the lock exactly once per
snapshot version.
Tab plumbing — remote + replay collectors insert "Users" as
tab index 1 (right after All) so it participates in cycling
without breaking the existing All + hosts layout. LED grid skips
the synthetic tab when iterating hostnames.
Event handler — new mode precedence ladder documented at the
top of handle_key_event:
1. Filter-edit (feat(tui): interactive filter query ('/') and threshold alerts for local/remote modes #186)
2. Replay timecode (feat: metric recording ('all-smi record') and TUI replay ('view --replay') #187)
3. Users-tab keys when the tab is active
4. Normal global keys
5. Replay-mode keys
Users-tab keys (u/m/p/n/t/f/e/Enter/ESC) intercept before the
global GPU-sort bindings so pressing m on the tab switches the
users-sort key rather than falling through to MemoryPercent.
CSV export — e writes
~/.cache/all-smi/users-<timestamp>.csv (or the XDG / Windows
equivalent) with RFC-4180 quoting.
Replay — ParsedProcessRow::from_local_process lifts recorded
ProcessInfo into the remote-side representation so replays render
the tab identically to live scrapes.
Mock — ALL_SMI_MOCK_PROCESSES=1 makes every synthetic NVIDIA
node emit 1-4 processes per GPU across a rotating user/command pool
(same env-var convention as ALL_SMI_MOCK_VGPU).
Help + README — help overlay adds a dedicated Users-tab section
and lists V; README gets a full Users-tab section covering the
columns, the power-approximation formula, and the in-tab keybindings.

Testing

cargo test --lib: 759 passed
cargo test --bin all-smi: 864 passed (8 new Users-tab event-handler
tests)
cargo test --test users_tab_integration_test: 3 passed — end-to-end
exporter -> parser -> aggregator round-trip on a 5x3 fixture,
sort-stability regression, replay-pipeline round-trip.
cargo clippy --all-targets -- -D warnings: clean
cargo fmt --all -- --check: clean
Aggregator unit tests cover empty input, same-PID-on-two-hosts,
root/system-UID filter, multi-GPU multi-host fan-out, oldest TIME+,
partial-coverage detection, negative-power clamp, per-host
breakdown invariant, and a 100x50 stress run under the <50 ms
budget.
Event-handler tests cover V jump, users-tab m/f intercepts,
Enter/ESC drill-down, filter-mode precedence over tab keys, and
Up/Down row navigation bounds.

Test plan

Start a mock cluster with ALL_SMI_MOCK_PROCESSES=1 and
verify the Users tab renders correctly with multiple users / hosts.
Press V to jump to the tab, confirm sort keys re-rank the
table, and that Enter drills into a user / ESC returns.
Hit e and verify the CSV file lands at
~/.cache/all-smi/users-<ts>.csv with the expected header and rows.
Stop --processes on a subset of hosts and confirm the
partial-coverage chip surfaces.
Replay a recording and confirm the Users tab is populated.

Closes #189

Add a new Users tab to remote `view` mode that groups per-process metrics across every scraped host so operators can answer "who is using the cluster and how much?" at a glance. Accessed via `V` or tab cycling. Columns: USER, NODES, GPUs, PROCS, VRAM, POWER*, LONGEST, CMD. POWER* is a VRAM-weighted approximation of the per-user power share, clamped to >= 0 and marked in the header. In-tab keys: `u/m/p/n/t` sort, `f` hide system accounts, `e` export to `~/.cache/all-smi/users-<timestamp>.csv`, `Enter` drill-down per-host (second Enter drills into processes on that host), `ESC` backs out. Mode precedence: filter-edit (#186) > replay timecode (#187) > Users-tab keys > global navigation > replay-mode keys. Users-tab keys intercept before the outer match so `m/u/p/f` on the tab switch the users-sort key rather than falling through to the global GPU-sort bindings. Partial-coverage chip surfaces when only a subset of hosts emit the `all_smi_process_*` families; hosts with no process rows still count toward the denominator. Replay frames feed through `ParsedProcessRow::from_local_process` so recorded sessions render the tab identically to live scrapes. Mock hosts emit synthetic processes when `ALL_SMI_MOCK_PROCESSES=1`, following the same env-var convention as `ALL_SMI_MOCK_VGPU`. Closes #189

Address findings from the security review of the Users tab introduction: CSV export (src/view/event_handler.rs): - Replace std::fs::write with OpenOptions + O_NOFOLLOW + 0o600 (Unix) / share_mode(0) + create_new (Windows) so a pre-planted symlink at ~/.cache/all-smi/ or the final CSV path cannot redirect the write to an attacker-chosen location. Mirrors the hardening already in snapshot/record/doctor. - Refuse to proceed when the cache dir itself is a symlink, closing the gap where create_dir_all no-ops on an existing symlinked directory. - Extend csv_escape with an OWASP CSV-injection guard: fields whose first character is =, +, -, @, TAB, or CR are quoted and prefixed with a single quote so Excel / LibreOffice / Google Sheets do not evaluate them as formulas. - Add regression tests for symlink refusal, 0o600 file mode, RFC-4180 quoting, and leading-character injection vectors. Process metric exporter (src/api/metrics/process.rs): - Cap command, name, and user label values at 256 / 128 / 128 bytes with a UTF-8-safe truncation helper that appends "...(N bytes truncated)" so the truncation is visible to operators. Mitigates both scrape-response amplification (long argv × 3 metric families × N processes × M hosts) and the privacy risk of API tokens / DB URLs on command lines being broadcast to every Prometheus consumer. Remote metrics parser (src/network/metrics_parser.rs): - Apply matching 256 / 128-byte caps to the command / name / user / gpu_uuid fields inside process_process_metrics so a malicious remote host that bypasses our exporter cannot inflate the per-host process accumulator to ~150 MB (1024-byte generic cap × 50 000-row cap × 3 families). Truncation honours UTF-8 boundaries. Users aggregator (src/ui/aggregation/user.rs): - Collapse the double-HashMap-lookup pattern in aggregate_users and UserScratch::absorb into a single entry(...).or_insert(0) + &mut accumulation, eliminating ~5 000 redundant hashes + string clones per scrape tick on a 100-node cluster. - Pre-size total_vram_by_gpu from the combined GPU count across all snapshots to remove rehash allocations on the hot path. - Add an ignored adversarial stress test (10 hosts × 50 000 rows = 500 000 rows) to catch future quadratic regressions; release-mode completion measured at ~0.5 s.

inureyes · 2026-04-20T16:02:01Z

Security + performance review

Addressed 4 findings with fixes in commit c5dc1b7.

Findings and fixes

CRITICAL — CSV export path lacked symlink/TOCTOU hardening
src/view/event_handler.rs:export_users_csv used std::fs::write, which follows symlinks. A co-tenant on a multi-user host could plant ~/.cache/all-smi (or the final users-<ts>.csv file) as a symlink to any location the user can write. Snapshot (#185), record (#197), and doctor (#198) already use O_NOFOLLOW + 0o600 for exactly this reason; the new CSV export regressed that posture.

Fix: OpenOptions::new().create_new(true).custom_flags(O_NOFOLLOW).mode(0o600) on Unix, share_mode(0) + create_new(true) on Windows, plus a defense-in-depth check that refuses to proceed if the cache dir itself is a symlink (since create_dir_all no-ops on an existing symlinked directory).
Regression tests: open_export_secure_refuses_symlinks, open_export_secure_sets_owner_only_mode.

HIGH — CSV injection (Excel/LibreOffice formula execution)
csv_escape did RFC-4180 quoting but did not guard against the OWASP CSV-injection pattern. A username/command like =cmd|"/c calc"!A1, +SUM(A1), or @DDE(...) would be written verbatim and executed as a formula when an operator opens the CSV in Excel or LibreOffice Calc. User names and commands both flow in from untrusted remote hosts.

Fix: when the field's first character is =, +, -, @, TAB, or CR, csv_escape now wraps the value in quotes and prefixes a single quote (') so spreadsheets treat it as text.
Regression tests: 5 tests covering =, +, -, @, TAB, CR prefixes plus RFC-4180 behaviour on benign inputs.

HIGH — command Prometheus label not length-capped (privacy + amplification)
src/api/metrics/process.rs emitted command, user, and name labels at their full runtime length. Scrape-response amplification: a 4 KiB argv × 3 metric families × N processes × M hosts rapidly bloats the HTTP response. Privacy: command lines regularly contain API tokens and DB URLs which were broadcast to every Prometheus consumer.

Fix: cap command at 256 B, user at 128 B, name at 128 B with a UTF-8-safe truncation helper that appends "...(N bytes truncated)" so operators see that the string was clipped. The remote parser (src/network/metrics_parser.rs) mirrors these caps so a malicious remote host bypassing our exporter cannot inflate the per-host accumulator (50 000-row cap × 1024 B × 3 families ≈ 150 MB was reduced to ~45 MB).
Regression tests: 4 tests covering pass-through, truncation marker, UTF-8 boundary safety, and end-to-end exporter output.

MEDIUM — Aggregation hot-path double-lookups + redundant clones
aggregate_users Pass 1 and UserScratch::absorb both used the pattern *map.entry(k).or_insert(0) = map.get(&k).copied().unwrap_or(0).saturating_add(...) which does two hashmap lookups and two host.clone() calls per row. For the spec-max cluster (100 hosts × 50 procs = 5 000 rows) that is ~10 000 redundant operations per scrape tick on the single-threaded UI path.

Fix: collapse to a single let entry = map.entry(k).or_insert(0); *entry = entry.saturating_add(...); and pre-size total_vram_by_gpu from the combined GPU count.
Regression tests: existing aggregation_handles_large_cluster_quickly covers correctness; added an #[ignore]d adversarial stress test aggregation_survives_adversarial_50k_per_host (10 × 50 000 rows = 500 000 rows) for future quadratic-regression detection. Release-mode timing: ~0.5 s for 500 000 rows.

Confirmed non-findings

ALL_SMI_MOCK_PROCESSES env gate: used only in src/mock/templates/nvidia.rs (invoked from the mock-server template path). No prod exposure.
50 000-row cap is per-host (not per-tick): confirmed. MetricsParser::parse_metrics is called per host, and MAX_TEXT_SIZE = 10 MB per host bounds the cluster-wide memory regardless of cap.
MetricBuilder label escaping: unchanged. Already strips control chars + escapes \\, ", \n, \r.

Test results

cargo test --lib: 763 passed, 1 ignored
cargo test --bin all-smi: 875 passed, 1 ignored
cargo test --test users_tab_integration_test: 3 passed
cargo clippy --all-targets -- -D warnings: clean
cargo fmt --all -- --check: clean
Adversarial stress (release): 500 000 rows aggregated in ~0.46 s

…artial chip Addresses five pr-reviewer findings on the Users tab (issue #189): - F1 (CRITICAL): A second `Enter` on the per-host table now renders the full process list for `(drill_user, drill_host)`, matching the issue-#189 spec. Before this, `drill_host` was set but nothing was drawn, leaving the operator stuck on the per-host view. ESC peels back one level at a time: first clears `drill_host` (return to per-host table), then clears `drill_user` (return to top table). - F3: Split the single `data_version` counter into two. `data_version` now covers "any UI dirtiness" so the render loop wakes on sort / filter / drill key presses, while the new `collector_data_version` only advances when collectors push fresh data. The Users-tab aggregation cache keys on the collector counter, so typing a sort hotkey no longer re-groups the 5 000-row cluster on every keystroke. - F4: Partial-coverage chip false-triggered on idle-but-configured hosts (connected, `--processes` enabled, zero processes running). `HostSnapshot` now carries an `is_connected` flag; a host counts as "reporting" when it has processes OR is connected. Only genuinely disconnected hosts drag the chip denominator down. - F5: Replays of local recordings collapsed every GPU onto `gpu_index = 0` because local readers don't populate the `index` detail key. Aggregation now falls back to the GPU's positional order within its host-group when the detail key is absent. - F6: Banner height above the Users-tab body is now computed dynamically from the number of visible chips (summary + optional partial chip + optional export toast + header), instead of the hardcoded approximation that sometimes ate a row from the body. Tests added: - `users_tab_enter_twice_drills_into_per_host_processes` (F1) - `users_tab_sort_keypress_does_not_invalidate_aggregation_cache` (F3) - `idle_connected_host_is_not_flagged_as_partial` (F4) - `users_aggregation_assigns_positional_gpu_index_when_detail_missing` (F5) All existing tests still pass: 764 lib + 879 bin + 3 integration. Clippy + fmt clean.

Add three security notes to the README Security notes block documenting the hardening that shipped with the Users tab: - CSV export O_NOFOLLOW + mode 0600 symlink refusal (Unix) / share_mode(0) (Windows) - Formula injection mitigation for user/command fields via RFC 4180 quoting + single-quote prefix - Per-field label caps in the remote Prometheus parser (user: 128 B, command: 256 B, global: 100 labels / 1024 B each) Add a Users tab entry at the top of the v0.21.0 (upcoming) changelog line covering the V hotkey, all in-tab keys (u/m/p/n/t/Enter/ESC/e/f), power approximation, CSV export path, partial-coverage chip, and the security hardening.

inureyes · 2026-04-20T16:22:25Z

PR Finalization Complete

Summary

Tests: verified passing — 764 lib + 879 bin + 3 integration; all in-scope unit tests (user aggregation, sort stability, power weighting, partial-coverage chip, drill-down invariants, CSV injection guard) already present.

Documentation: updated README.md with:

Three new items in the Security notes block (under Filtering & Alerts) covering:
- Users tab CSV export (e) O_NOFOLLOW + mode 0600 symlink refusal on Unix / share_mode(0) on Windows
- Formula injection mitigation in CSV cells (user/command fields via RFC 4180 quoting + single-quote prefix)
- Per-field label caps in the remote Prometheus parser (user: 128 B, command: 256 B; global: 100 labels / 1024 B per key-value)
Users tab entry prepended to the v0.21.0 (upcoming) changelog line, covering V hotkey, all in-tab keys (u/m/p/n/t/Enter/ESC/e/f), power approximation, CSV export path, partial-coverage chip, and the security hardening

The Users tab section itself (V hotkey, in-tab keys, power approximation formula with * caveat, CSV export path ~/.cache/all-smi/users-\<timestamp\>.csv, partial-coverage chip) and the help overlay (src/ui/help.rs) listings for V and all in-tab keys were already complete and required no changes.

Lint/Format: cargo fmt clean, cargo clippy -D warnings clean.

Final HEAD: f42a18c

inureyes added type:enhancement New feature or request priority:medium Medium priority issue status:review Under review labels Apr 20, 2026

inureyes added 2 commits April 21, 2026 01:17

inureyes added status:done Completed and removed status:review Under review labels Apr 20, 2026

inureyes merged commit cede28e into main Apr 20, 2026
4 checks passed

inureyes deleted the feat/189-users-tab branch April 20, 2026 16:30

inureyes self-assigned this May 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(view): cluster-wide user/process aggregation tab ('V' key)#199

feat(view): cluster-wide user/process aggregation tab ('V' key)#199
inureyes merged 4 commits into
mainfrom
feat/189-users-tab

inureyes commented Apr 20, 2026

Uh oh!

inureyes commented Apr 20, 2026

Uh oh!

inureyes commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

inureyes commented Apr 20, 2026

Summary

Implementation

Testing

Test plan

Uh oh!

inureyes commented Apr 20, 2026

Security + performance review

Findings and fixes

Confirmed non-findings

Test results

Uh oh!

inureyes commented Apr 20, 2026

PR Finalization Complete

Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant