Summary
Add a new "Users" tab to remote view mode that aggregates process information across all scraped hosts, grouping by user so operators can answer "who is using the cluster right now and how much?" at a glance. Drill-down to a selected user shows per-node, per-GPU, per-process detail.
Motivation
all-smi already supports --processes in api mode, which emits per-process metrics with a username label. But the remote view has no tab that consumes those metrics cluster-wide: process data remains a local concept. For platform operators (Slurm, Backend.AI, Kubernetes GPU pools) the most valuable operational question is cluster-level: who is using what, where, for how long, at what power cost? An aggregation tab turns per-node process data into a first-class operator signal.
Current state
all-smi api --processes emits per-process Prometheus metrics with labels including pid, user, command, and the host tag.
- The remote view currently renders an "All" tab (summary across hosts) and per-host tabs, but no user-grouped view.
src/network/metrics_parser.rs parses the GPU/CPU/memory metric families — process metrics may or may not be fully parsed into a remote-side structure yet; must be verified and extended as needed.
src/api/metrics/ emits the process metric family when --processes is passed.
Proposed design
Tab and keybindings
- New tab accessible via
V (mnemonic: "Users").
- Within the tab:
u sort by username (default)
m sort by total GPU memory
p sort by total power (derived)
n sort by node count
t sort by oldest process start / longest TIME+
Enter drill down on the highlighted user
ESC exits drill-down
e exports the current view to CSV at ~/.cache/all-smi/users-<timestamp>.csv
f toggles a system-process filter (hide uid < 1000 / root by default)
Top-level aggregation table
Columns (all visible only if width permits; collapse low-priority columns on narrow terminals):
USER NODES GPUs PROCS VRAM POWER* LONGEST CMD (top-1 by GPU mem)
inureyes 3 12 18 384 GiB 2.3 kW 6d 03:12 python train.py --bs=...
yeonji 1 4 1 48 GiB 0.9 kW 2:15:02 /opt/llm/infer -m ...
root 5 0 7 0 — — containerd-shim
POWER* is an approximation: sum(gpu.power_consumption * (user_vram_on_gpu / gpu_total_vram_used_across_all_users)) per GPU, summed across GPUs a user touches. Mark with * in header tooltip and document the methodology.
LONGEST uses TIME+ fields already tracked per process (wall clock since start).
Drill-down view
On Enter, show per-node rows for that user:
inureyes / 3 nodes, 12 GPUs
─ dgx-01 GPU 0-3 VRAM 128 GiB Power 760 W 4 PIDs python train.py --bs=128 ...
─ dgx-02 GPU 4-7 VRAM 128 GiB Power 780 W 4 PIDs python train.py --bs=128 ...
─ dgx-03 GPU 0-3 VRAM 128 GiB Power 760 W 4 PIDs python eval.py --split=val
Enter again drills to the full process list for that user on the selected node (reuse the existing process renderer).
Partial coverage indicator
If some hosts don't have --processes enabled in their API, show a chip ⚠ partial coverage: 3 of 5 nodes reporting process data at the top of the tab so operators don't misread the numbers.
Implementation plan
Files to add / modify:
src/network/metrics_parser.rs — confirm process-family parsing; extend to build a Vec<ParsedProcessRow> with host, pid, user, command, gpu_index, gpu_memory_bytes, cpu_pct, start_time.
src/api/metrics/ (or equivalent submodule) — verify the per-process metric family includes labels host, user, pid, gpu_index. Add any missing labels, add all_smi_process_start_time_seconds counter for TIME+ derivation.
- New
src/ui/aggregation/user.rs:
aggregate_users(&[HostSnapshot]) -> Vec<UserAggregate> pure function (heavily unit-testable).
- Handles (host, pid) identity keying so the same PID on different hosts isn't collapsed.
- Computes the power approximation with documented formula.
src/ui/tabs.rs — add the Users variant; integrate with tab cycling and V hotkey.
- New
src/ui/renderers/user_renderer.rs — table + drill-down rendering using the existing widgets for styling.
src/app_state.rs — add users_tab_state { sort, selected_user, drill_host }, CSV export path.
src/view/render_snapshot.rs — include the aggregated user view so it flows through the snapshot/replay pipeline (so replays also include the Users tab).
src/mock/generator.rs — generate synthetic process entries per mock node so the mock server exercises this tab. Gate behind ALL_SMI_MOCK_PROCESSES=1 (default on — since --processes is already a flag, just emit them in mock always, or follow the existing convention for flagged mock data).
src/ui/help.rs — add the V shortcut and the in-tab keys.
Acceptance criteria
Edge cases & non-goals
- Very large clusters (100+ nodes, 50+ processes each): aggregation must complete within one render budget (< 50 ms). Use a single pass; do not re-group on every keypress — cache the aggregation keyed on snapshot version.
- Power approximation: document the formula; ensure it never produces negative values (cap at zero if numerical issues arise). Mark as "approximation" in the UI.
- Username may be missing on some hosts (Windows API mode). Render as
? and include such rows in a separate "unattributed" group.
- Non-goal: querying external user directories (LDAP) —
user is whatever the API reports.
- Non-goal: alerts on user-level usage — follow-up.
Summary
Add a new "Users" tab to remote
viewmode that aggregates process information across all scraped hosts, grouping by user so operators can answer "who is using the cluster right now and how much?" at a glance. Drill-down to a selected user shows per-node, per-GPU, per-process detail.Motivation
all-smialready supports--processesinapimode, which emits per-process metrics with ausernamelabel. But the remoteviewhas no tab that consumes those metrics cluster-wide: process data remains a local concept. For platform operators (Slurm, Backend.AI, Kubernetes GPU pools) the most valuable operational question is cluster-level: who is using what, where, for how long, at what power cost? An aggregation tab turns per-node process data into a first-class operator signal.Current state
all-smi api --processesemits per-process Prometheus metrics with labels includingpid,user,command, and the host tag.src/network/metrics_parser.rsparses the GPU/CPU/memory metric families — process metrics may or may not be fully parsed into a remote-side structure yet; must be verified and extended as needed.src/api/metrics/emits the process metric family when--processesis passed.Proposed design
Tab and keybindings
V(mnemonic: "Users").usort by username (default)msort by total GPU memorypsort by total power (derived)nsort by node counttsort by oldest process start / longest TIME+Enterdrill down on the highlighted userESCexits drill-downeexports the current view to CSV at~/.cache/all-smi/users-<timestamp>.csvftoggles a system-process filter (hide uid < 1000 / root by default)Top-level aggregation table
Columns (all visible only if width permits; collapse low-priority columns on narrow terminals):
POWER*is an approximation:sum(gpu.power_consumption * (user_vram_on_gpu / gpu_total_vram_used_across_all_users))per GPU, summed across GPUs a user touches. Mark with*in header tooltip and document the methodology.LONGESTusesTIME+fields already tracked per process (wall clock since start).Drill-down view
On Enter, show per-node rows for that user:
Enteragain drills to the full process list for that user on the selected node (reuse the existing process renderer).Partial coverage indicator
If some hosts don't have
--processesenabled in their API, show a chip⚠ partial coverage: 3 of 5 nodes reporting process dataat the top of the tab so operators don't misread the numbers.Implementation plan
Files to add / modify:
src/network/metrics_parser.rs— confirm process-family parsing; extend to build aVec<ParsedProcessRow>withhost,pid,user,command,gpu_index,gpu_memory_bytes,cpu_pct,start_time.src/api/metrics/(or equivalent submodule) — verify the per-process metric family includes labelshost,user,pid,gpu_index. Add any missing labels, addall_smi_process_start_time_secondscounter for TIME+ derivation.src/ui/aggregation/user.rs:aggregate_users(&[HostSnapshot]) -> Vec<UserAggregate>pure function (heavily unit-testable).src/ui/tabs.rs— add theUsersvariant; integrate with tab cycling andVhotkey.src/ui/renderers/user_renderer.rs— table + drill-down rendering using the existingwidgetsfor styling.src/app_state.rs— addusers_tab_state { sort, selected_user, drill_host }, CSV export path.src/view/render_snapshot.rs— include the aggregated user view so it flows through the snapshot/replay pipeline (so replays also include the Users tab).src/mock/generator.rs— generate synthetic process entries per mock node so the mock server exercises this tab. Gate behindALL_SMI_MOCK_PROCESSES=1(default on — since--processesis already a flag, just emit them in mock always, or follow the existing convention for flagged mock data).src/ui/help.rs— add theVshortcut and the in-tab keys.Acceptance criteria
all-smi view --hostfile hosts.csvwith hosts in API mode--processesshows aUserstab.u,m,p,n,t) re-rank the table correctly.Enterdrills down;ESCreturns.aggregate_userscovering: empty input, same PID on two hosts, root filter, user with processes on multiple GPUs across multiple hosts, oldest TIME+ computation.Edge cases & non-goals
?and include such rows in a separate "unattributed" group.useris whatever the API reports.