You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add a dedicated topology tab (T) that visualizes intra-node GPU interconnect structure: NvLink connections (GPU↔GPU, GPU↔NvSwitch), NUMA affinity, and PCIe lanes. Includes a graph-style ASCII layout and a nvidia-smi topo -m-style matrix mode. Works on systems without NvLink by falling back to a PCIe-only rendering. This surfaces data the project already collects but currently has no home.
Motivation
Modern GPU work depends heavily on topology: collective communication (NCCL all-reduce) performance is dominated by NvLink availability, NUMA affinity between CPU and GPU changes throughput meaningfully, and operators frequently need to answer "are these 8 GPUs fully-meshed or switch-connected?" without a second tool. all-smi already reads nvlink_remote_devices, numa_node_id, and PCIe info per GPU (from recent PRs #175 and the hardware-details series) — the data round-trips through the Prometheus exporter too. But no rendering surfaces this topology; it's invisible to the user.
Current state
GpuInfo.nvlink_remote_devices: Vec<NvLinkRemoteDevice> is populated by the NVIDIA reader.
GpuInfo.numa_node_id: Option<i32> is populated on Linux hosts with NUMA topology.
PCIe info lives in GpuInfo.detail map.
None of the above is rendered anywhere in the TUI — only visible via Prometheus metrics.
Proposed design
Tab and modes
New tab T (mnemonic: "Topology"). In remote mode, default to showing the currently selected host's topology (same host as the per-host tab); Tab/Shift-Tab cycles nodes. In local mode, always the local node.
Two render modes, toggled with M:
Graph mode (default): ASCII graph showing NUMA zones as boxes, GPUs inside, edges for NvLink and NvSwitch. Active links green, degraded yellow, inactive gray.
Matrix mode: nvidia-smi topo -m-equivalent table with CPU affinity and NUMA columns.
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA
GPU0 X NV8 NV8 NV8 SYS SYS SYS SYS 0-55,112-167 0
GPU1 NV8 X NV8 NV8 SYS SYS SYS SYS 0-55,112-167 0
...
Legend: X=self NVn=NvLink Gen-n PXB=PCIe bridge SYS=PCIe across NUMA NODE=PCIe same NUMA
Graceful degradation
No NvLink present → graph shows NUMA boxes with PCIe-annotated GPUs only.
Not NVIDIA → topology mode still renders NUMA + PCIe groupings; no SYS/NVn columns.
nvlink_remote_devices empty → dim "no active NvLinks" placeholder.
Implementation plan
Files to add / modify:
src/ui/tabs.rs — add Topology tab variant, integrate with tab cycle, T hotkey.
New src/ui/topology/mod.rs:
TopologyModel built from GpuInfo + chassis/NUMA info.
layout.rs — groups GPUs by NUMA, decides horizontal vs vertical stacking based on terminal width; places NvSwitch nodes deterministically.
graph_render.rs — ASCII edges; uses box-drawing chars already used in src/ui/widgets.rs.
matrix_render.rs — tabular view; column widths sized to fit GPU count.
classify_edge() — derives NVn bandwidth hint from active link count (if NVML provides per-link bandwidth; otherwise fallback to "NV"). If per-link bandwidth isn't currently captured, extend NvLinkRemoteDevice with bandwidth_mb_s: Option<u32> and populate from NVML. This extension must serialize cleanly through Prometheus and the network parser.
New src/ui/renderers/topology_renderer.rs — draws a Topology panel for the selected host.
src/device/readers/nvidia_hardware.rs / nvidia.rs — ensure all needed data (per-link state, remote type, bandwidth) is exposed; extend if needed.
src/device/types.rs — optional new fields as needed; bump a mod-internal schema version for recorded snapshots (see Record/Replay issue) if the shape changes.
src/network/metrics_parser.rs — parse those new labels/metrics back into GpuInfo for the remote path.
src/mock/ — mock templates for DGX-like topology (8 GPUs, 2 NUMA nodes, full NvLink mesh) so the topology tab is exercised without real hardware. Gate with ALL_SMI_MOCK_TOPOLOGY=1 following the existing mock env-var convention.
src/ui/help.rs — document T and the in-tab M mode toggle.
Acceptance criteria
On a system with NvLink, T opens a topology tab showing GPUs grouped by NUMA with NvLink edges.
M within the tab switches to matrix view; M again returns to graph view.
Non-NVIDIA systems show NUMA + PCIe layout without errors.
Systems without NUMA (e.g., single-socket workstation) show a single NUMA box.
Remote view: selecting different nodes via tabs updates the topology accordingly.
With ALL_SMI_MOCK_TOPOLOGY=1 the mock server produces a DGX-like topology the tab renders correctly.
Any new fields added to GpuInfo/NvLinkRemoteDevice serialize through Prometheus and round-trip via src/network/metrics_parser.rs with backward compatibility — old exporters continue to be parseable.
Layout must not overflow on 80-column terminals (fall back to matrix-only under a minimum width).
cargo test covers: layout with 2 GPUs, 4 GPUs, 8 GPUs on 1 / 2 NUMAs; edge classification; matrix formatting.
README gains a "Topology View" section.
Edge cases & non-goals
16+ GPU systems (HGX / Grace): layout must flow into multiple rows; matrix mode remains the reliable fallback.
Active link counts differing between endpoints of a link: treat as degraded, color yellow.
Non-goal: inter-node topology (NVLink Fabric across chassis) — v1 is per-node only.
Summary
Add a dedicated topology tab (
T) that visualizes intra-node GPU interconnect structure: NvLink connections (GPU↔GPU, GPU↔NvSwitch), NUMA affinity, and PCIe lanes. Includes a graph-style ASCII layout and anvidia-smi topo -m-style matrix mode. Works on systems without NvLink by falling back to a PCIe-only rendering. This surfaces data the project already collects but currently has no home.Motivation
Modern GPU work depends heavily on topology: collective communication (NCCL all-reduce) performance is dominated by NvLink availability, NUMA affinity between CPU and GPU changes throughput meaningfully, and operators frequently need to answer "are these 8 GPUs fully-meshed or switch-connected?" without a second tool.
all-smialready readsnvlink_remote_devices,numa_node_id, and PCIe info per GPU (from recent PRs #175 and the hardware-details series) — the data round-trips through the Prometheus exporter too. But no rendering surfaces this topology; it's invisible to the user.Current state
GpuInfo.nvlink_remote_devices: Vec<NvLinkRemoteDevice>is populated by the NVIDIA reader.GpuInfo.numa_node_id: Option<i32>is populated on Linux hosts with NUMA topology.GpuInfo.detailmap.Proposed design
Tab and modes
T(mnemonic: "Topology"). In remote mode, default to showing the currently selected host's topology (same host as the per-host tab);Tab/Shift-Tabcycles nodes. In local mode, always the local node.M:nvidia-smi topo -m-equivalent table with CPU affinity and NUMA columns.Graph rendering example
Matrix rendering example
Graceful degradation
SYS/NVncolumns.nvlink_remote_devicesempty → dim "no active NvLinks" placeholder.Implementation plan
Files to add / modify:
src/ui/tabs.rs— addTopologytab variant, integrate with tab cycle,Thotkey.src/ui/topology/mod.rs:TopologyModelbuilt fromGpuInfo + chassis/NUMA info.layout.rs— groups GPUs by NUMA, decides horizontal vs vertical stacking based on terminal width; places NvSwitch nodes deterministically.graph_render.rs— ASCII edges; uses box-drawing chars already used insrc/ui/widgets.rs.matrix_render.rs— tabular view; column widths sized to fit GPU count.classify_edge()— derivesNVnbandwidth hint from active link count (if NVML provides per-link bandwidth; otherwise fallback to "NV"). If per-link bandwidth isn't currently captured, extendNvLinkRemoteDevicewithbandwidth_mb_s: Option<u32>and populate from NVML. This extension must serialize cleanly through Prometheus and the network parser.src/ui/renderers/topology_renderer.rs— draws a Topology panel for the selected host.src/device/readers/nvidia_hardware.rs/nvidia.rs— ensure all needed data (per-link state, remote type, bandwidth) is exposed; extend if needed.src/device/types.rs— optional new fields as needed; bump amod-internalschema version for recorded snapshots (see Record/Replay issue) if the shape changes.src/api/metrics/— export any new fields added above as Prometheus labels/metrics so remote view sees them. Follow thegpu_index/gpu_uuidlabel convention established in PR feat: standardize NPU Prometheus labels to npu_index/npu_uuid #181.src/network/metrics_parser.rs— parse those new labels/metrics back intoGpuInfofor the remote path.src/mock/— mock templates for DGX-like topology (8 GPUs, 2 NUMA nodes, full NvLink mesh) so the topology tab is exercised without real hardware. Gate withALL_SMI_MOCK_TOPOLOGY=1following the existing mock env-var convention.src/ui/help.rs— documentTand the in-tabMmode toggle.Acceptance criteria
Topens a topology tab showing GPUs grouped by NUMA with NvLink edges.Mwithin the tab switches to matrix view;Magain returns to graph view.ALL_SMI_MOCK_TOPOLOGY=1the mock server produces a DGX-like topology the tab renders correctly.GpuInfo/NvLinkRemoteDeviceserialize through Prometheus and round-trip viasrc/network/metrics_parser.rswith backward compatibility — old exporters continue to be parseable.cargo testcovers: layout with 2 GPUs, 4 GPUs, 8 GPUs on 1 / 2 NUMAs; edge classification; matrix formatting.Edge cases & non-goals
NvLinkRemoteType::Switch. If it isn't exposed on a particular system, show a neutral?node.