Skip to content

feat(tui): topology view tab ('T') — NvLink/NUMA/PCIe graph and matrix#200

Merged
inureyes merged 3 commits into
mainfrom
feat/190-topology-tab
Apr 20, 2026
Merged

feat(tui): topology view tab ('T') — NvLink/NUMA/PCIe graph and matrix#200
inureyes merged 3 commits into
mainfrom
feat/190-topology-tab

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

Adds a dedicated Topology tab (T) that visualises intra-node GPU
interconnect structure — NvLink connections (GPU↔GPU, GPU↔NvSwitch),
NUMA affinity, and PCIe lanes. Includes both a graph-style ASCII layout
and a nvidia-smi topo -m-equivalent matrix fallback. Works on systems
without NvLink by dropping to a PCIe-only rendering.

Implementation

  • New module: src/ui/topology/ — pure-logic core split across
    five files, each under the 500-line soft limit:
    • mod.rsTopologyModel + TopologyViewMode assembled from
      GpuInfo.nvlink_remote_devices + numa_node_id + detail.
    • classify_edge.rsNVn bandwidth-hint classifier with a
      dominant-generation picker; falls back to generic NV when
      bandwidth is unknown.
    • layout.rs — NUMA-aware grid layout; picks horizontal vs
      vertical box stacking based on terminal width.
    • graph_render.rs — ASCII NUMA boxes with GPUs and edges.
    • matrix_render.rsnvidia-smi topo -m table, fits the column
      widths to the GPU count.
  • Orchestrator: src/ui/renderers/topology_renderer.rs — draws
    the panel for the selected host; falls back to matrix on terminals
    narrower than 100 columns so the content never overflows on
    80-column sessions.
  • Event routing: T jumps to the tab; M toggles graph/matrix
    while the tab is active. Mode precedence ladder updated:
    filter-edit > replay-timecode > Users-tab keys > Topology-tab keys >
    global > replay.
  • Data model: extends NvLinkRemoteDevice with
    bandwidth_mb_s: Option<u32> so the NVn generation classifier can
    derive labels like NV5 from the hint. Existing construction sites
    (NVIDIA reader, mock templates, test fixtures) updated.
  • Prometheus round-trip: adds bandwidth_mb_s as an optional
    label on all_smi_nvlink_remote_device_type. Parser accepts the
    label when present, rejects absurd upstream values, and remains
    backward-compatible with pre-feat(tui): topology view tab ('T') — NvLink/NUMA/PCIe graph and matrix #190 exporters that omit it.
  • Mock template: ALL_SMI_MOCK_TOPOLOGY=1 emits a DGX-style
    8-GPU, 2-NUMA, 64-link (7 GPU + 1 switch per GPU) topology so the
    tab can be exercised without real hardware.
  • Help overlay + README updated to document T, M, and
    graceful-degradation behaviour.

Graceful degradation

  • No NvLink present → graph shows NUMA boxes with PCIe-only GPUs.
  • Non-NVIDIA hosts → NUMA + PCIe groupings only; no SYS/NVn
    vocabulary.
  • nvlink_remote_devices empty → dim "no active NvLinks" placeholder.
  • No NUMA topology → single synthetic NUMA ? box.
  • Terminals < 100 columns → automatic matrix fallback with a hint.

Testing

  • Unit tests (library + binaries): layout with 2/4/8 GPUs on 1 and 2
    NUMAs; edge classification covering full mesh / switch mesh / no
    NvLink; matrix formatting (header + legend + cell sizing); graph
    rendering (horizontal and vertical stacking); view mode toggle;
    NvLink bandwidth round-trip + backward-compat with old exporters.
  • Integration: hardware_details_integration_test confirms the
    Prometheus exporter + network parser handle the new label without
    breaking older scrapes.
  • Mock template unit tests (bypass env-var to avoid test-thread
    races): NUMA split, 64-link count, instance labelling, empty-input
    no-op.
  • cargo test --lib (817 pass), cargo test --bin all-smi (932
    pass), cargo test --bin all-smi-mock-server --features=mock (52
    pass), cargo clippy --all-targets -- -D warnings, and
    cargo fmt --all -- --check all succeed.

Test plan

  • Verify the Topology tab is reachable via T in remote mode
    against a real or mocked cluster.
  • Confirm M toggles between graph and matrix modes without
    data loss.
  • On a terminal resized below 100 columns, confirm the graph
    mode shows the "matrix fallback" hint and switches to matrix
    rendering.
  • With ALL_SMI_MOCK_TOPOLOGY=1, confirm the mock server emits
    the DGX-style topology and the TUI renders it correctly.
  • Scrape a pre-feat(tui): topology view tab ('T') — NvLink/NUMA/PCIe graph and matrix #190 exporter (no bandwidth_mb_s label) from
    the new TUI and confirm NvLink rows still appear in the matrix.

Closes #190

…tion

Implements the per-host Topology tab requested in issue #190. Adds a
new reserved tab that ships with remote and replay modes, accessible
via 'T' and toggled between graph and matrix modes with 'M'.

Graph mode renders NUMA zones as ASCII boxes with GPUs inside and
NvLink/NvSwitch edges between them; NUMA boxes stack side-by-side on
wide terminals and fall back to vertical stacking on narrower ones.
Matrix mode mirrors `nvidia-smi topo -m` with CPU affinity + NUMA
columns.

Graceful-degradation paths cover hosts without NvLink (PCIe only),
non-NVIDIA hosts (NUMA groups only), hosts without NUMA (single
synthetic "NUMA ?" box), and terminals narrower than 100 columns
(automatic matrix fallback so nothing overflows 80-col sessions).

Extends `NvLinkRemoteDevice` with `bandwidth_mb_s: Option<u32>` so the
NVn generation classifier can derive labels like "NV5" from the hint.
The new label serialises through the Prometheus exporter and round-
trips via the network parser with backward compatibility: pre-#190
exporters that omit `bandwidth_mb_s` continue to parse cleanly.

Mock clusters can exercise the tab via `ALL_SMI_MOCK_TOPOLOGY=1`,
which emits a DGX-style 8-GPU, 2-NUMA, 64-link topology from every
synthetic NVIDIA node.

Closes #190
@inureyes inureyes added type:enhancement New feature or request priority:medium Medium priority issue device:nvidia-gpu NVIDIA GPU related status:review Under review labels Apr 20, 2026
C1: Topology tab now tracks the operator's host selection. Previously
pressing `T` unconditionally overwrote `state.current_tab`, so the
renderer always fell through to the first host-shaped tab. Stash the
previously-selected host tab in `topology_last_host_tab` (both on `T`
and on Left/Right arrow navigation), propagate it through the render
snapshot, and have `topology_target_host` honour it when still present
in the tab strip. Remote and replay tab updaters clear the cached name
when the stashed host disappears (disconnect, switched recording) so
the renderer falls back to the first host instead of displaying stale
data.

H1: README claimed `Tab`/`Shift-Tab` cycles Topology hosts, but only the
arrow keys are wired up. Correct the wording and note that the tab now
remembers the last-selected host.

M1: Hide the matrix `CPU Affinity` column until NVML
`nvmlDeviceGetCpuAffinity` plumbing lands — shipping a column that
always says `-` is just noise. Drop the dead `cpu_affinity` helper and
shrink the `pick_cell_width` overhead accordingly (27 → 13 cells) so
narrower terminals can now render the matrix.

M2: Clarify `bandwidth_to_generation` doc. NvLink Gen 2/3/4 all share
the ~25 GB/s per-link ceiling and are collapsed into the `Some(4)`
bucket; `Some(2)` and `Some(3)` are never returned by design.

M3: Simplify `pick_cell_width` tail. The trailing
`if MIN_CELL * gpu_count <= usable` branch is dead because the
preceding `for cw in (MIN_CELL..=MAX_CELL).rev()` already covers the
MIN_CELL case. Return 0 directly after the loop.

Tests: add three `topology_target_host_*` cases covering the remembered
host, the empty fallback, and the stale-host fallback paths. Refit the
`falls_back_to_summary_under_80_col` and matrix legend tests to the
new overhead / column layout.
Add 9 tests covering the Topology tab hardening from issue #190:

- t_key_jumps_to_topology_tab_and_remembers_host: T hotkey jumps to
  the Topology tab and stashes the previously-selected host tab in
  topology_last_host_tab so the renderer honours the operator's host
  selection on return.
- t_key_is_noop_when_topology_tab_absent: silent no-op in local mode
  where the Topology tab is never inserted.
- remember_current_host_tab_skips_reserved_tabs: All / Users / Topology
  reserved tabs are never stashed.
- remember_current_host_tab_stashes_host_tab: host tabs are stashed
  correctly.
- m_key_toggles_topology_view_mode_when_topology_active: uppercase M
  cycles Graph → Matrix → Graph.
- lowercase_m_also_toggles_topology_view_mode: lowercase m accepted
  to reduce muscle-memory friction.
- m_key_does_not_toggle_topology_mode_outside_topology_tab: M outside
  the Topology tab hits the global GPU-sort binding, not the mode
  toggle.
- test_snapshot_capture_preserves_topology_state: topology_view_mode
  and topology_last_host_tab survive RenderSnapshot::capture.
- test_topology_state_roundtrips_through_as_app_state: both fields
  survive the full capture → as_app_state round-trip.
@inureyes

Copy link
Copy Markdown
Member Author

PR finalization complete.

Tests: Added 9 tests covering topology tab hardening in src/view/event_handler.rs and src/view/render_snapshot.rs:

  • T key navigation (jumps to tab, remembers host, no-op in local mode)
  • M/m key toggle (cycles Graph/Matrix when Topology tab is active; does not fire outside it)
  • remember_current_host_tab (skips reserved tabs, stashes host tabs)
  • Snapshot round-trip for topology_view_mode and topology_last_host_tab

Documentation: README "Topology View" section already present with Graph/Matrix mode explanation, T/M keybindings, graceful degradation notes, and ALL_SMI_MOCK_TOPOLOGY=1 mock flag. Help overlay already lists T (line 190) and M (lines 233–239).

Lint/Format: cargo fmt and cargo clippy -- -D warnings both clean.

Final test counts: 817 lib + 944 bin (up from 935 bin). All passing.

@inureyes inureyes added status:done Completed and removed status:review Under review labels Apr 20, 2026
@inureyes inureyes merged commit 93d1e5e into main Apr 20, 2026
4 checks passed
@inureyes inureyes deleted the feat/190-topology-tab branch April 20, 2026 17:31
@inureyes inureyes self-assigned this May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

device:nvidia-gpu NVIDIA GPU related priority:medium Medium priority issue status:done Completed type:enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(tui): topology view tab ('T') — NvLink/NUMA/PCIe graph and matrix

1 participant