Skip to content

feat(view): cluster-wide user/process aggregation tab ('V' key) #189

Description

@inureyes

Summary

Add a new "Users" tab to remote view mode that aggregates process information across all scraped hosts, grouping by user so operators can answer "who is using the cluster right now and how much?" at a glance. Drill-down to a selected user shows per-node, per-GPU, per-process detail.

Motivation

all-smi already supports --processes in api mode, which emits per-process metrics with a username label. But the remote view has no tab that consumes those metrics cluster-wide: process data remains a local concept. For platform operators (Slurm, Backend.AI, Kubernetes GPU pools) the most valuable operational question is cluster-level: who is using what, where, for how long, at what power cost? An aggregation tab turns per-node process data into a first-class operator signal.

Current state

  • all-smi api --processes emits per-process Prometheus metrics with labels including pid, user, command, and the host tag.
  • The remote view currently renders an "All" tab (summary across hosts) and per-host tabs, but no user-grouped view.
  • src/network/metrics_parser.rs parses the GPU/CPU/memory metric families — process metrics may or may not be fully parsed into a remote-side structure yet; must be verified and extended as needed.
  • src/api/metrics/ emits the process metric family when --processes is passed.

Proposed design

Tab and keybindings

  • New tab accessible via V (mnemonic: "Users").
  • Within the tab:
    • u sort by username (default)
    • m sort by total GPU memory
    • p sort by total power (derived)
    • n sort by node count
    • t sort by oldest process start / longest TIME+
    • Enter drill down on the highlighted user
    • ESC exits drill-down
    • e exports the current view to CSV at ~/.cache/all-smi/users-<timestamp>.csv
    • f toggles a system-process filter (hide uid < 1000 / root by default)

Top-level aggregation table

Columns (all visible only if width permits; collapse low-priority columns on narrow terminals):

USER            NODES   GPUs   PROCS   VRAM        POWER*    LONGEST    CMD (top-1 by GPU mem)
inureyes        3       12     18      384 GiB     2.3 kW   6d 03:12   python train.py --bs=...
yeonji          1       4      1       48 GiB      0.9 kW   2:15:02    /opt/llm/infer -m ...
root            5       0      7       0           —        —          containerd-shim

POWER* is an approximation: sum(gpu.power_consumption * (user_vram_on_gpu / gpu_total_vram_used_across_all_users)) per GPU, summed across GPUs a user touches. Mark with * in header tooltip and document the methodology.

LONGEST uses TIME+ fields already tracked per process (wall clock since start).

Drill-down view

On Enter, show per-node rows for that user:

inureyes / 3 nodes, 12 GPUs
─ dgx-01   GPU 0-3     VRAM 128 GiB   Power 760 W   4 PIDs    python train.py --bs=128 ...
─ dgx-02   GPU 4-7     VRAM 128 GiB   Power 780 W   4 PIDs    python train.py --bs=128 ...
─ dgx-03   GPU 0-3     VRAM 128 GiB   Power 760 W   4 PIDs    python eval.py --split=val

Enter again drills to the full process list for that user on the selected node (reuse the existing process renderer).

Partial coverage indicator

If some hosts don't have --processes enabled in their API, show a chip ⚠ partial coverage: 3 of 5 nodes reporting process data at the top of the tab so operators don't misread the numbers.

Implementation plan

Files to add / modify:

  • src/network/metrics_parser.rs — confirm process-family parsing; extend to build a Vec<ParsedProcessRow> with host, pid, user, command, gpu_index, gpu_memory_bytes, cpu_pct, start_time.
  • src/api/metrics/ (or equivalent submodule) — verify the per-process metric family includes labels host, user, pid, gpu_index. Add any missing labels, add all_smi_process_start_time_seconds counter for TIME+ derivation.
  • New src/ui/aggregation/user.rs:
    • aggregate_users(&[HostSnapshot]) -> Vec<UserAggregate> pure function (heavily unit-testable).
    • Handles (host, pid) identity keying so the same PID on different hosts isn't collapsed.
    • Computes the power approximation with documented formula.
  • src/ui/tabs.rs — add the Users variant; integrate with tab cycling and V hotkey.
  • New src/ui/renderers/user_renderer.rs — table + drill-down rendering using the existing widgets for styling.
  • src/app_state.rs — add users_tab_state { sort, selected_user, drill_host }, CSV export path.
  • src/view/render_snapshot.rs — include the aggregated user view so it flows through the snapshot/replay pipeline (so replays also include the Users tab).
  • src/mock/generator.rs — generate synthetic process entries per mock node so the mock server exercises this tab. Gate behind ALL_SMI_MOCK_PROCESSES=1 (default on — since --processes is already a flag, just emit them in mock always, or follow the existing convention for flagged mock data).
  • src/ui/help.rs — add the V shortcut and the in-tab keys.

Acceptance criteria

  • all-smi view --hostfile hosts.csv with hosts in API mode --processes shows a Users tab.
  • Sorting hotkeys (u, m, p, n, t) re-rank the table correctly.
  • Enter drills down; ESC returns.
  • CSV export produces a well-formed file (header + one row per user).
  • If zero hosts report process data, the tab shows a clear "no process data; enable --processes on API mode" hint — not an empty table.
  • Partial-coverage chip is visible when some hosts report and others don't.
  • Unit tests on aggregate_users covering: empty input, same PID on two hosts, root filter, user with processes on multiple GPUs across multiple hosts, oldest TIME+ computation.
  • Mock cluster with 5 simulated nodes × 3 simulated users renders correctly.
  • Works over replayed snapshots (see companion record/replay issue).
  • Documentation updated: README Users tab section; help overlay lists the new shortcuts.

Edge cases & non-goals

  • Very large clusters (100+ nodes, 50+ processes each): aggregation must complete within one render budget (< 50 ms). Use a single pass; do not re-group on every keypress — cache the aggregation keyed on snapshot version.
  • Power approximation: document the formula; ensure it never produces negative values (cap at zero if numerical issues arise). Mark as "approximation" in the UI.
  • Username may be missing on some hosts (Windows API mode). Render as ? and include such rows in a separate "unattributed" group.
  • Non-goal: querying external user directories (LDAP) — user is whatever the API reports.
  • Non-goal: alerts on user-level usage — follow-up.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions