feat(tui): topology view tab ('T') — NvLink/NUMA/PCIe graph and matrix by inureyes · Pull Request #200 · lablup/all-smi

inureyes · 2026-04-20T16:57:29Z

Summary

Adds a dedicated Topology tab (T) that visualises intra-node GPU
interconnect structure — NvLink connections (GPU↔GPU, GPU↔NvSwitch),
NUMA affinity, and PCIe lanes. Includes both a graph-style ASCII layout
and a nvidia-smi topo -m-equivalent matrix fallback. Works on systems
without NvLink by dropping to a PCIe-only rendering.

Implementation

New module: src/ui/topology/ — pure-logic core split across
five files, each under the 500-line soft limit:
- mod.rs — TopologyModel + TopologyViewMode assembled from
  GpuInfo.nvlink_remote_devices + numa_node_id + detail.
- classify_edge.rs — NVn bandwidth-hint classifier with a
  dominant-generation picker; falls back to generic NV when
  bandwidth is unknown.
- layout.rs — NUMA-aware grid layout; picks horizontal vs
  vertical box stacking based on terminal width.
- graph_render.rs — ASCII NUMA boxes with GPUs and edges.
- matrix_render.rs — nvidia-smi topo -m table, fits the column
  widths to the GPU count.
Orchestrator: src/ui/renderers/topology_renderer.rs — draws
the panel for the selected host; falls back to matrix on terminals
narrower than 100 columns so the content never overflows on
80-column sessions.
Event routing: T jumps to the tab; M toggles graph/matrix
while the tab is active. Mode precedence ladder updated:
filter-edit > replay-timecode > Users-tab keys > Topology-tab keys >
global > replay.
Data model: extends NvLinkRemoteDevice with
bandwidth_mb_s: Option<u32> so the NVn generation classifier can
derive labels like NV5 from the hint. Existing construction sites
(NVIDIA reader, mock templates, test fixtures) updated.
Prometheus round-trip: adds bandwidth_mb_s as an optional
label on all_smi_nvlink_remote_device_type. Parser accepts the
label when present, rejects absurd upstream values, and remains
backward-compatible with pre-feat(tui): topology view tab ('T') — NvLink/NUMA/PCIe graph and matrix #190 exporters that omit it.
Mock template: ALL_SMI_MOCK_TOPOLOGY=1 emits a DGX-style
8-GPU, 2-NUMA, 64-link (7 GPU + 1 switch per GPU) topology so the
tab can be exercised without real hardware.
Help overlay + README updated to document T, M, and
graceful-degradation behaviour.

Graceful degradation

No NvLink present → graph shows NUMA boxes with PCIe-only GPUs.
Non-NVIDIA hosts → NUMA + PCIe groupings only; no SYS/NVn
vocabulary.
nvlink_remote_devices empty → dim "no active NvLinks" placeholder.
No NUMA topology → single synthetic NUMA ? box.
Terminals < 100 columns → automatic matrix fallback with a hint.

Testing

Unit tests (library + binaries): layout with 2/4/8 GPUs on 1 and 2
NUMAs; edge classification covering full mesh / switch mesh / no
NvLink; matrix formatting (header + legend + cell sizing); graph
rendering (horizontal and vertical stacking); view mode toggle;
NvLink bandwidth round-trip + backward-compat with old exporters.
Integration: hardware_details_integration_test confirms the
Prometheus exporter + network parser handle the new label without
breaking older scrapes.
Mock template unit tests (bypass env-var to avoid test-thread
races): NUMA split, 64-link count, instance labelling, empty-input
no-op.
cargo test --lib (817 pass), cargo test --bin all-smi (932
pass), cargo test --bin all-smi-mock-server --features=mock (52
pass), cargo clippy --all-targets -- -D warnings, and
cargo fmt --all -- --check all succeed.

Test plan

Verify the Topology tab is reachable via T in remote mode
against a real or mocked cluster.
Confirm M toggles between graph and matrix modes without
data loss.
On a terminal resized below 100 columns, confirm the graph
mode shows the "matrix fallback" hint and switches to matrix
rendering.
With ALL_SMI_MOCK_TOPOLOGY=1, confirm the mock server emits
the DGX-style topology and the TUI renders it correctly.
Scrape a pre-feat(tui): topology view tab ('T') — NvLink/NUMA/PCIe graph and matrix #190 exporter (no bandwidth_mb_s label) from
the new TUI and confirm NvLink rows still appear in the matrix.

Closes #190

…tion Implements the per-host Topology tab requested in issue #190. Adds a new reserved tab that ships with remote and replay modes, accessible via 'T' and toggled between graph and matrix modes with 'M'. Graph mode renders NUMA zones as ASCII boxes with GPUs inside and NvLink/NvSwitch edges between them; NUMA boxes stack side-by-side on wide terminals and fall back to vertical stacking on narrower ones. Matrix mode mirrors `nvidia-smi topo -m` with CPU affinity + NUMA columns. Graceful-degradation paths cover hosts without NvLink (PCIe only), non-NVIDIA hosts (NUMA groups only), hosts without NUMA (single synthetic "NUMA ?" box), and terminals narrower than 100 columns (automatic matrix fallback so nothing overflows 80-col sessions). Extends `NvLinkRemoteDevice` with `bandwidth_mb_s: Option<u32>` so the NVn generation classifier can derive labels like "NV5" from the hint. The new label serialises through the Prometheus exporter and round- trips via the network parser with backward compatibility: pre-#190 exporters that omit `bandwidth_mb_s` continue to parse cleanly. Mock clusters can exercise the tab via `ALL_SMI_MOCK_TOPOLOGY=1`, which emits a DGX-style 8-GPU, 2-NUMA, 64-link topology from every synthetic NVIDIA node. Closes #190

C1: Topology tab now tracks the operator's host selection. Previously pressing `T` unconditionally overwrote `state.current_tab`, so the renderer always fell through to the first host-shaped tab. Stash the previously-selected host tab in `topology_last_host_tab` (both on `T` and on Left/Right arrow navigation), propagate it through the render snapshot, and have `topology_target_host` honour it when still present in the tab strip. Remote and replay tab updaters clear the cached name when the stashed host disappears (disconnect, switched recording) so the renderer falls back to the first host instead of displaying stale data. H1: README claimed `Tab`/`Shift-Tab` cycles Topology hosts, but only the arrow keys are wired up. Correct the wording and note that the tab now remembers the last-selected host. M1: Hide the matrix `CPU Affinity` column until NVML `nvmlDeviceGetCpuAffinity` plumbing lands — shipping a column that always says `-` is just noise. Drop the dead `cpu_affinity` helper and shrink the `pick_cell_width` overhead accordingly (27 → 13 cells) so narrower terminals can now render the matrix. M2: Clarify `bandwidth_to_generation` doc. NvLink Gen 2/3/4 all share the ~25 GB/s per-link ceiling and are collapsed into the `Some(4)` bucket; `Some(2)` and `Some(3)` are never returned by design. M3: Simplify `pick_cell_width` tail. The trailing `if MIN_CELL * gpu_count <= usable` branch is dead because the preceding `for cw in (MIN_CELL..=MAX_CELL).rev()` already covers the MIN_CELL case. Return 0 directly after the loop. Tests: add three `topology_target_host_*` cases covering the remembered host, the empty fallback, and the stale-host fallback paths. Refit the `falls_back_to_summary_under_80_col` and matrix legend tests to the new overhead / column layout.

Add 9 tests covering the Topology tab hardening from issue #190: - t_key_jumps_to_topology_tab_and_remembers_host: T hotkey jumps to the Topology tab and stashes the previously-selected host tab in topology_last_host_tab so the renderer honours the operator's host selection on return. - t_key_is_noop_when_topology_tab_absent: silent no-op in local mode where the Topology tab is never inserted. - remember_current_host_tab_skips_reserved_tabs: All / Users / Topology reserved tabs are never stashed. - remember_current_host_tab_stashes_host_tab: host tabs are stashed correctly. - m_key_toggles_topology_view_mode_when_topology_active: uppercase M cycles Graph → Matrix → Graph. - lowercase_m_also_toggles_topology_view_mode: lowercase m accepted to reduce muscle-memory friction. - m_key_does_not_toggle_topology_mode_outside_topology_tab: M outside the Topology tab hits the global GPU-sort binding, not the mode toggle. - test_snapshot_capture_preserves_topology_state: topology_view_mode and topology_last_host_tab survive RenderSnapshot::capture. - test_topology_state_roundtrips_through_as_app_state: both fields survive the full capture → as_app_state round-trip.

inureyes · 2026-04-20T17:23:31Z

PR finalization complete.

Tests: Added 9 tests covering topology tab hardening in src/view/event_handler.rs and src/view/render_snapshot.rs:

T key navigation (jumps to tab, remembers host, no-op in local mode)
M/m key toggle (cycles Graph/Matrix when Topology tab is active; does not fire outside it)
remember_current_host_tab (skips reserved tabs, stashes host tabs)
Snapshot round-trip for topology_view_mode and topology_last_host_tab

Documentation: README "Topology View" section already present with Graph/Matrix mode explanation, T/M keybindings, graceful degradation notes, and ALL_SMI_MOCK_TOPOLOGY=1 mock flag. Help overlay already lists T (line 190) and M (lines 233–239).

Lint/Format: cargo fmt and cargo clippy -- -D warnings both clean.

Final test counts: 817 lib + 944 bin (up from 935 bin). All passing.

inureyes added type:enhancement New feature or request priority:medium Medium priority issue device:nvidia-gpu NVIDIA GPU related status:review Under review labels Apr 20, 2026

inureyes added 2 commits April 21, 2026 02:16

inureyes added status:done Completed and removed status:review Under review labels Apr 20, 2026

inureyes merged commit 93d1e5e into main Apr 20, 2026
4 checks passed

inureyes deleted the feat/190-topology-tab branch April 20, 2026 17:31

inureyes self-assigned this May 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(tui): topology view tab ('T') — NvLink/NUMA/PCIe graph and matrix#200

feat(tui): topology view tab ('T') — NvLink/NUMA/PCIe graph and matrix#200
inureyes merged 3 commits into
mainfrom
feat/190-topology-tab

inureyes commented Apr 20, 2026

Uh oh!

inureyes commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

inureyes commented Apr 20, 2026

Summary

Implementation

Graceful degradation

Testing

Test plan

Uh oh!

inureyes commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant