Skip to content

feat(tui): topology view tab ('T') — NvLink/NUMA/PCIe graph and matrix #190

Description

@inureyes

Summary

Add a dedicated topology tab (T) that visualizes intra-node GPU interconnect structure: NvLink connections (GPU↔GPU, GPU↔NvSwitch), NUMA affinity, and PCIe lanes. Includes a graph-style ASCII layout and a nvidia-smi topo -m-style matrix mode. Works on systems without NvLink by falling back to a PCIe-only rendering. This surfaces data the project already collects but currently has no home.

Motivation

Modern GPU work depends heavily on topology: collective communication (NCCL all-reduce) performance is dominated by NvLink availability, NUMA affinity between CPU and GPU changes throughput meaningfully, and operators frequently need to answer "are these 8 GPUs fully-meshed or switch-connected?" without a second tool. all-smi already reads nvlink_remote_devices, numa_node_id, and PCIe info per GPU (from recent PRs #175 and the hardware-details series) — the data round-trips through the Prometheus exporter too. But no rendering surfaces this topology; it's invisible to the user.

Current state

  • GpuInfo.nvlink_remote_devices: Vec<NvLinkRemoteDevice> is populated by the NVIDIA reader.
  • GpuInfo.numa_node_id: Option<i32> is populated on Linux hosts with NUMA topology.
  • PCIe info lives in GpuInfo.detail map.
  • None of the above is rendered anywhere in the TUI — only visible via Prometheus metrics.

Proposed design

Tab and modes

  • New tab T (mnemonic: "Topology"). In remote mode, default to showing the currently selected host's topology (same host as the per-host tab); Tab/Shift-Tab cycles nodes. In local mode, always the local node.
  • Two render modes, toggled with M:
    • Graph mode (default): ASCII graph showing NUMA zones as boxes, GPUs inside, edges for NvLink and NvSwitch. Active links green, degraded yellow, inactive gray.
    • Matrix mode: nvidia-smi topo -m-equivalent table with CPU affinity and NUMA columns.

Graph rendering example

NODE: dgx-01   |   8× H200 (SXM)   |   PCIe Gen5 x16
┌─────── NUMA 0 ──────────┐   ┌─────── NUMA 1 ──────────┐
│ [GPU 0] ── NV8 ── [GPU 1]│   │ [GPU 4] ── NV8 ── [GPU 5]│
│    │  \            /   │ │   │    │  \            /   │ │
│   NV4   nvsw ── nvsw  NV4│   │   NV4   nvsw ── nvsw  NV4│
│    │  /            \   │ │   │    │  /            \   │ │
│ [GPU 2] ── NV8 ── [GPU 3]│   │ [GPU 6] ── NV8 ── [GPU 7]│
└─────────────────────────┘   └─────────────────────────┘

Active NvLinks: 16 / 16    Remote classes: gpu=8, switch=8
CPU Affinity:   GPU0-3 → CPU 0-55, 112-167   GPU4-7 → CPU 56-111, 168-223

Matrix rendering example

            GPU0  GPU1  GPU2  GPU3  GPU4  GPU5  GPU6  GPU7   CPU Affinity   NUMA
GPU0         X    NV8   NV8   NV8   SYS   SYS   SYS   SYS    0-55,112-167   0
GPU1        NV8    X    NV8   NV8   SYS   SYS   SYS   SYS    0-55,112-167   0
...
Legend:  X=self   NVn=NvLink Gen-n   PXB=PCIe bridge   SYS=PCIe across NUMA   NODE=PCIe same NUMA

Graceful degradation

  • No NvLink present → graph shows NUMA boxes with PCIe-annotated GPUs only.
  • Not NVIDIA → topology mode still renders NUMA + PCIe groupings; no SYS/NVn columns.
  • nvlink_remote_devices empty → dim "no active NvLinks" placeholder.

Implementation plan

Files to add / modify:

  • src/ui/tabs.rs — add Topology tab variant, integrate with tab cycle, T hotkey.
  • New src/ui/topology/mod.rs:
    • TopologyModel built from GpuInfo + chassis/NUMA info.
    • layout.rs — groups GPUs by NUMA, decides horizontal vs vertical stacking based on terminal width; places NvSwitch nodes deterministically.
    • graph_render.rs — ASCII edges; uses box-drawing chars already used in src/ui/widgets.rs.
    • matrix_render.rs — tabular view; column widths sized to fit GPU count.
    • classify_edge() — derives NVn bandwidth hint from active link count (if NVML provides per-link bandwidth; otherwise fallback to "NV"). If per-link bandwidth isn't currently captured, extend NvLinkRemoteDevice with bandwidth_mb_s: Option<u32> and populate from NVML. This extension must serialize cleanly through Prometheus and the network parser.
  • New src/ui/renderers/topology_renderer.rs — draws a Topology panel for the selected host.
  • src/device/readers/nvidia_hardware.rs / nvidia.rs — ensure all needed data (per-link state, remote type, bandwidth) is exposed; extend if needed.
  • src/device/types.rs — optional new fields as needed; bump a mod-internal schema version for recorded snapshots (see Record/Replay issue) if the shape changes.
  • src/api/metrics/ — export any new fields added above as Prometheus labels/metrics so remote view sees them. Follow the gpu_index/gpu_uuid label convention established in PR feat: standardize NPU Prometheus labels to npu_index/npu_uuid #181.
  • src/network/metrics_parser.rs — parse those new labels/metrics back into GpuInfo for the remote path.
  • src/mock/ — mock templates for DGX-like topology (8 GPUs, 2 NUMA nodes, full NvLink mesh) so the topology tab is exercised without real hardware. Gate with ALL_SMI_MOCK_TOPOLOGY=1 following the existing mock env-var convention.
  • src/ui/help.rs — document T and the in-tab M mode toggle.

Acceptance criteria

  • On a system with NvLink, T opens a topology tab showing GPUs grouped by NUMA with NvLink edges.
  • M within the tab switches to matrix view; M again returns to graph view.
  • Non-NVIDIA systems show NUMA + PCIe layout without errors.
  • Systems without NUMA (e.g., single-socket workstation) show a single NUMA box.
  • Remote view: selecting different nodes via tabs updates the topology accordingly.
  • With ALL_SMI_MOCK_TOPOLOGY=1 the mock server produces a DGX-like topology the tab renders correctly.
  • Any new fields added to GpuInfo/NvLinkRemoteDevice serialize through Prometheus and round-trip via src/network/metrics_parser.rs with backward compatibility — old exporters continue to be parseable.
  • Layout must not overflow on 80-column terminals (fall back to matrix-only under a minimum width).
  • cargo test covers: layout with 2 GPUs, 4 GPUs, 8 GPUs on 1 / 2 NUMAs; edge classification; matrix formatting.
  • README gains a "Topology View" section.

Edge cases & non-goals

  • 16+ GPU systems (HGX / Grace): layout must flow into multiple rows; matrix mode remains the reliable fallback.
  • Active link counts differing between endpoints of a link: treat as degraded, color yellow.
  • Non-goal: inter-node topology (NVLink Fabric across chassis) — v1 is per-node only.
  • Non-goal: animating traffic — static structural view only.
  • NvSwitch classification relies on NvLinkRemoteType::Switch. If it isn't exposed on a particular system, show a neutral ? node.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions