Skip to content

feat(view): cluster-wide user/process aggregation tab ('V' key)#199

Merged
inureyes merged 4 commits into
mainfrom
feat/189-users-tab
Apr 20, 2026
Merged

feat(view): cluster-wide user/process aggregation tab ('V' key)#199
inureyes merged 4 commits into
mainfrom
feat/189-users-tab

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

Implements the cluster-wide Users tab (issue #189) accessible via the V
hotkey in remote view mode. Aggregates per-process metrics across every
scraped host so operators can answer "who is using the cluster and how
much?" at a glance, with drill-down into per-host and per-process
breakdowns.

Implementation

  • Aggregator (src/ui/aggregation/user.rs) — pure function
    aggregate_users(&[HostSnapshot]) -> UserAggregationResult. Keys by
    (host, pid) so the same PID on two hosts counts as two processes.
    Computes the VRAM-weighted power approximation with the formula
    documented in the issue and clamps the result to >= 0.

  • Renderer (src/ui/renderers/user_renderer.rs) — top-level table
    with selectable rows plus a drill-down view listing per-host VRAM /
    power / PID counts, run-length encoded GPU ranges, and the top
    command per host.

  • Metrics exporter (src/api/metrics/process.rs) — extended to
    emit all_smi_process_memory_used_bytes,
    all_smi_process_start_time_seconds (new, drives the LONGEST
    column), and all_smi_process_cpu_percent with pid, user,
    gpu_index, device_uuid, command labels.

  • Remote parser (src/network/metrics_parser.rs) — now parses the
    process_* metric family into a Vec<ParsedProcessRow> keyed by
    (pid, gpu_index), with a defensive 50 000-row cap.

  • App state — added UsersTabState (sort, drill path, filter flag,
    export toast), memoised UsersAggregationCache keyed on
    data_version. RenderSnapshot::capture now takes &mut AppState
    so the aggregation is materialised under the lock exactly once per
    snapshot version.

  • Tab plumbing — remote + replay collectors insert "Users" as
    tab index 1 (right after All) so it participates in cycling
    without breaking the existing All + hosts layout. LED grid skips
    the synthetic tab when iterating hostnames.

  • Event handler — new mode precedence ladder documented at the
    top of handle_key_event:

    1. Filter-edit (feat(tui): interactive filter query ('/') and threshold alerts for local/remote modes #186)
    2. Replay timecode (feat: metric recording ('all-smi record') and TUI replay ('view --replay') #187)
    3. Users-tab keys when the tab is active
    4. Normal global keys
    5. Replay-mode keys

    Users-tab keys (u/m/p/n/t/f/e/Enter/ESC) intercept before the
    global GPU-sort bindings so pressing m on the tab switches the
    users-sort key rather than falling through to MemoryPercent.

  • CSV exporte writes
    ~/.cache/all-smi/users-<timestamp>.csv (or the XDG / Windows
    equivalent) with RFC-4180 quoting.

  • ReplayParsedProcessRow::from_local_process lifts recorded
    ProcessInfo into the remote-side representation so replays render
    the tab identically to live scrapes.

  • MockALL_SMI_MOCK_PROCESSES=1 makes every synthetic NVIDIA
    node emit 1-4 processes per GPU across a rotating user/command pool
    (same env-var convention as ALL_SMI_MOCK_VGPU).

  • Help + README — help overlay adds a dedicated Users-tab section
    and lists V; README gets a full Users-tab section covering the
    columns, the power-approximation formula, and the in-tab keybindings.

Testing

  • cargo test --lib: 759 passed
  • cargo test --bin all-smi: 864 passed (8 new Users-tab event-handler
    tests)
  • cargo test --test users_tab_integration_test: 3 passed — end-to-end
    exporter -> parser -> aggregator round-trip on a 5x3 fixture,
    sort-stability regression, replay-pipeline round-trip.
  • cargo clippy --all-targets -- -D warnings: clean
  • cargo fmt --all -- --check: clean
  • Aggregator unit tests cover empty input, same-PID-on-two-hosts,
    root/system-UID filter, multi-GPU multi-host fan-out, oldest TIME+,
    partial-coverage detection, negative-power clamp, per-host
    breakdown invariant, and a 100x50 stress run under the <50 ms
    budget.
  • Event-handler tests cover V jump, users-tab m/f intercepts,
    Enter/ESC drill-down, filter-mode precedence over tab keys, and
    Up/Down row navigation bounds.

Test plan

  • Start a mock cluster with ALL_SMI_MOCK_PROCESSES=1 and
    verify the Users tab renders correctly with multiple users / hosts.
  • Press V to jump to the tab, confirm sort keys re-rank the
    table, and that Enter drills into a user / ESC returns.
  • Hit e and verify the CSV file lands at
    ~/.cache/all-smi/users-<ts>.csv with the expected header and rows.
  • Stop --processes on a subset of hosts and confirm the
    partial-coverage chip surfaces.
  • Replay a recording and confirm the Users tab is populated.

Closes #189

Add a new Users tab to remote `view` mode that groups per-process metrics
across every scraped host so operators can answer "who is using the
cluster and how much?" at a glance. Accessed via `V` or tab cycling.

Columns: USER, NODES, GPUs, PROCS, VRAM, POWER*, LONGEST, CMD.
POWER* is a VRAM-weighted approximation of the per-user power share,
clamped to >= 0 and marked in the header.

In-tab keys: `u/m/p/n/t` sort, `f` hide system accounts, `e` export to
`~/.cache/all-smi/users-<timestamp>.csv`, `Enter` drill-down per-host
(second Enter drills into processes on that host), `ESC` backs out.

Mode precedence: filter-edit (#186) > replay timecode (#187) > Users-tab
keys > global navigation > replay-mode keys. Users-tab keys intercept
before the outer match so `m/u/p/f` on the tab switch the users-sort key
rather than falling through to the global GPU-sort bindings.

Partial-coverage chip surfaces when only a subset of hosts emit the
`all_smi_process_*` families; hosts with no process rows still count
toward the denominator. Replay frames feed through
`ParsedProcessRow::from_local_process` so recorded sessions render the
tab identically to live scrapes.

Mock hosts emit synthetic processes when `ALL_SMI_MOCK_PROCESSES=1`,
following the same env-var convention as `ALL_SMI_MOCK_VGPU`.

Closes #189
@inureyes inureyes added type:enhancement New feature or request priority:medium Medium priority issue status:review Under review labels Apr 20, 2026
Address findings from the security review of the Users tab introduction:

CSV export (src/view/event_handler.rs):
- Replace std::fs::write with OpenOptions + O_NOFOLLOW + 0o600 (Unix) /
  share_mode(0) + create_new (Windows) so a pre-planted symlink at
  ~/.cache/all-smi/ or the final CSV path cannot redirect the write to
  an attacker-chosen location. Mirrors the hardening already in
  snapshot/record/doctor.
- Refuse to proceed when the cache dir itself is a symlink, closing the
  gap where create_dir_all no-ops on an existing symlinked directory.
- Extend csv_escape with an OWASP CSV-injection guard: fields whose
  first character is =, +, -, @, TAB, or CR are quoted and prefixed
  with a single quote so Excel / LibreOffice / Google Sheets do not
  evaluate them as formulas.
- Add regression tests for symlink refusal, 0o600 file mode, RFC-4180
  quoting, and leading-character injection vectors.

Process metric exporter (src/api/metrics/process.rs):
- Cap command, name, and user label values at 256 / 128 / 128 bytes
  with a UTF-8-safe truncation helper that appends
  "...(N bytes truncated)" so the truncation is visible to operators.
  Mitigates both scrape-response amplification (long argv × 3 metric
  families × N processes × M hosts) and the privacy risk of API
  tokens / DB URLs on command lines being broadcast to every
  Prometheus consumer.

Remote metrics parser (src/network/metrics_parser.rs):
- Apply matching 256 / 128-byte caps to the command / name / user /
  gpu_uuid fields inside process_process_metrics so a malicious remote
  host that bypasses our exporter cannot inflate the per-host process
  accumulator to ~150 MB (1024-byte generic cap × 50 000-row cap × 3
  families). Truncation honours UTF-8 boundaries.

Users aggregator (src/ui/aggregation/user.rs):
- Collapse the double-HashMap-lookup pattern in aggregate_users and
  UserScratch::absorb into a single entry(...).or_insert(0) + &mut
  accumulation, eliminating ~5 000 redundant hashes + string clones
  per scrape tick on a 100-node cluster.
- Pre-size total_vram_by_gpu from the combined GPU count across all
  snapshots to remove rehash allocations on the hot path.
- Add an ignored adversarial stress test (10 hosts × 50 000 rows =
  500 000 rows) to catch future quadratic regressions; release-mode
  completion measured at ~0.5 s.
@inureyes

Copy link
Copy Markdown
Member Author

Security + performance review

Addressed 4 findings with fixes in commit c5dc1b7.

Findings and fixes

CRITICAL — CSV export path lacked symlink/TOCTOU hardening
src/view/event_handler.rs:export_users_csv used std::fs::write, which follows symlinks. A co-tenant on a multi-user host could plant ~/.cache/all-smi (or the final users-<ts>.csv file) as a symlink to any location the user can write. Snapshot (#185), record (#197), and doctor (#198) already use O_NOFOLLOW + 0o600 for exactly this reason; the new CSV export regressed that posture.

  • Fix: OpenOptions::new().create_new(true).custom_flags(O_NOFOLLOW).mode(0o600) on Unix, share_mode(0) + create_new(true) on Windows, plus a defense-in-depth check that refuses to proceed if the cache dir itself is a symlink (since create_dir_all no-ops on an existing symlinked directory).
  • Regression tests: open_export_secure_refuses_symlinks, open_export_secure_sets_owner_only_mode.

HIGH — CSV injection (Excel/LibreOffice formula execution)
csv_escape did RFC-4180 quoting but did not guard against the OWASP CSV-injection pattern. A username/command like =cmd|"/c calc"!A1, +SUM(A1), or @DDE(...) would be written verbatim and executed as a formula when an operator opens the CSV in Excel or LibreOffice Calc. User names and commands both flow in from untrusted remote hosts.

  • Fix: when the field's first character is =, +, -, @, TAB, or CR, csv_escape now wraps the value in quotes and prefixes a single quote (') so spreadsheets treat it as text.
  • Regression tests: 5 tests covering =, +, -, @, TAB, CR prefixes plus RFC-4180 behaviour on benign inputs.

HIGH — command Prometheus label not length-capped (privacy + amplification)
src/api/metrics/process.rs emitted command, user, and name labels at their full runtime length. Scrape-response amplification: a 4 KiB argv × 3 metric families × N processes × M hosts rapidly bloats the HTTP response. Privacy: command lines regularly contain API tokens and DB URLs which were broadcast to every Prometheus consumer.

  • Fix: cap command at 256 B, user at 128 B, name at 128 B with a UTF-8-safe truncation helper that appends "...(N bytes truncated)" so operators see that the string was clipped. The remote parser (src/network/metrics_parser.rs) mirrors these caps so a malicious remote host bypassing our exporter cannot inflate the per-host accumulator (50 000-row cap × 1024 B × 3 families ≈ 150 MB was reduced to ~45 MB).
  • Regression tests: 4 tests covering pass-through, truncation marker, UTF-8 boundary safety, and end-to-end exporter output.

MEDIUM — Aggregation hot-path double-lookups + redundant clones
aggregate_users Pass 1 and UserScratch::absorb both used the pattern *map.entry(k).or_insert(0) = map.get(&k).copied().unwrap_or(0).saturating_add(...) which does two hashmap lookups and two host.clone() calls per row. For the spec-max cluster (100 hosts × 50 procs = 5 000 rows) that is ~10 000 redundant operations per scrape tick on the single-threaded UI path.

  • Fix: collapse to a single let entry = map.entry(k).or_insert(0); *entry = entry.saturating_add(...); and pre-size total_vram_by_gpu from the combined GPU count.
  • Regression tests: existing aggregation_handles_large_cluster_quickly covers correctness; added an #[ignore]d adversarial stress test aggregation_survives_adversarial_50k_per_host (10 × 50 000 rows = 500 000 rows) for future quadratic-regression detection. Release-mode timing: ~0.5 s for 500 000 rows.

Confirmed non-findings

  • ALL_SMI_MOCK_PROCESSES env gate: used only in src/mock/templates/nvidia.rs (invoked from the mock-server template path). No prod exposure.
  • 50 000-row cap is per-host (not per-tick): confirmed. MetricsParser::parse_metrics is called per host, and MAX_TEXT_SIZE = 10 MB per host bounds the cluster-wide memory regardless of cap.
  • MetricBuilder label escaping: unchanged. Already strips control chars + escapes \\, ", \n, \r.

Test results

  • cargo test --lib: 763 passed, 1 ignored
  • cargo test --bin all-smi: 875 passed, 1 ignored
  • cargo test --test users_tab_integration_test: 3 passed
  • cargo clippy --all-targets -- -D warnings: clean
  • cargo fmt --all -- --check: clean
  • Adversarial stress (release): 500 000 rows aggregated in ~0.46 s

…artial chip

Addresses five pr-reviewer findings on the Users tab (issue #189):

- F1 (CRITICAL): A second `Enter` on the per-host table now renders the
  full process list for `(drill_user, drill_host)`, matching the
  issue-#189 spec. Before this, `drill_host` was set but nothing was
  drawn, leaving the operator stuck on the per-host view. ESC peels
  back one level at a time: first clears `drill_host` (return to
  per-host table), then clears `drill_user` (return to top table).

- F3: Split the single `data_version` counter into two. `data_version`
  now covers "any UI dirtiness" so the render loop wakes on sort /
  filter / drill key presses, while the new `collector_data_version`
  only advances when collectors push fresh data. The Users-tab
  aggregation cache keys on the collector counter, so typing a sort
  hotkey no longer re-groups the 5 000-row cluster on every keystroke.

- F4: Partial-coverage chip false-triggered on idle-but-configured
  hosts (connected, `--processes` enabled, zero processes running).
  `HostSnapshot` now carries an `is_connected` flag; a host counts as
  "reporting" when it has processes OR is connected. Only genuinely
  disconnected hosts drag the chip denominator down.

- F5: Replays of local recordings collapsed every GPU onto
  `gpu_index = 0` because local readers don't populate the `index`
  detail key. Aggregation now falls back to the GPU's positional
  order within its host-group when the detail key is absent.

- F6: Banner height above the Users-tab body is now computed
  dynamically from the number of visible chips (summary + optional
  partial chip + optional export toast + header), instead of the
  hardcoded approximation that sometimes ate a row from the body.

Tests added:
- `users_tab_enter_twice_drills_into_per_host_processes` (F1)
- `users_tab_sort_keypress_does_not_invalidate_aggregation_cache` (F3)
- `idle_connected_host_is_not_flagged_as_partial` (F4)
- `users_aggregation_assigns_positional_gpu_index_when_detail_missing` (F5)

All existing tests still pass: 764 lib + 879 bin + 3 integration.
Clippy + fmt clean.
Add three security notes to the README Security notes block documenting
the hardening that shipped with the Users tab:
- CSV export O_NOFOLLOW + mode 0600 symlink refusal (Unix) / share_mode(0) (Windows)
- Formula injection mitigation for user/command fields via RFC 4180 quoting + single-quote prefix
- Per-field label caps in the remote Prometheus parser (user: 128 B, command: 256 B, global: 100 labels / 1024 B each)

Add a Users tab entry at the top of the v0.21.0 (upcoming) changelog
line covering the V hotkey, all in-tab keys (u/m/p/n/t/Enter/ESC/e/f),
power approximation, CSV export path, partial-coverage chip, and the
security hardening.
@inureyes

Copy link
Copy Markdown
Member Author

PR Finalization Complete

Summary

Tests: verified passing — 764 lib + 879 bin + 3 integration; all in-scope unit tests (user aggregation, sort stability, power weighting, partial-coverage chip, drill-down invariants, CSV injection guard) already present.

Documentation: updated README.md with:

  • Three new items in the Security notes block (under Filtering & Alerts) covering:
    • Users tab CSV export (e) O_NOFOLLOW + mode 0600 symlink refusal on Unix / share_mode(0) on Windows
    • Formula injection mitigation in CSV cells (user/command fields via RFC 4180 quoting + single-quote prefix)
    • Per-field label caps in the remote Prometheus parser (user: 128 B, command: 256 B; global: 100 labels / 1024 B per key-value)
  • Users tab entry prepended to the v0.21.0 (upcoming) changelog line, covering V hotkey, all in-tab keys (u/m/p/n/t/Enter/ESC/e/f), power approximation, CSV export path, partial-coverage chip, and the security hardening

The Users tab section itself (V hotkey, in-tab keys, power approximation formula with * caveat, CSV export path ~/.cache/all-smi/users-\<timestamp\>.csv, partial-coverage chip) and the help overlay (src/ui/help.rs) listings for V and all in-tab keys were already complete and required no changes.

Lint/Format: cargo fmt clean, cargo clippy -D warnings clean.

Final HEAD: f42a18c

@inureyes inureyes added status:done Completed and removed status:review Under review labels Apr 20, 2026
@inureyes inureyes merged commit cede28e into main Apr 20, 2026
4 checks passed
@inureyes inureyes deleted the feat/189-users-tab branch April 20, 2026 16:30
@inureyes inureyes self-assigned this May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:medium Medium priority issue status:done Completed type:enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(view): cluster-wide user/process aggregation tab ('V' key)

1 participant