feat(tui): topology view tab ('T') — NvLink/NUMA/PCIe graph and matrix#200
Merged
Conversation
…tion Implements the per-host Topology tab requested in issue #190. Adds a new reserved tab that ships with remote and replay modes, accessible via 'T' and toggled between graph and matrix modes with 'M'. Graph mode renders NUMA zones as ASCII boxes with GPUs inside and NvLink/NvSwitch edges between them; NUMA boxes stack side-by-side on wide terminals and fall back to vertical stacking on narrower ones. Matrix mode mirrors `nvidia-smi topo -m` with CPU affinity + NUMA columns. Graceful-degradation paths cover hosts without NvLink (PCIe only), non-NVIDIA hosts (NUMA groups only), hosts without NUMA (single synthetic "NUMA ?" box), and terminals narrower than 100 columns (automatic matrix fallback so nothing overflows 80-col sessions). Extends `NvLinkRemoteDevice` with `bandwidth_mb_s: Option<u32>` so the NVn generation classifier can derive labels like "NV5" from the hint. The new label serialises through the Prometheus exporter and round- trips via the network parser with backward compatibility: pre-#190 exporters that omit `bandwidth_mb_s` continue to parse cleanly. Mock clusters can exercise the tab via `ALL_SMI_MOCK_TOPOLOGY=1`, which emits a DGX-style 8-GPU, 2-NUMA, 64-link topology from every synthetic NVIDIA node. Closes #190
C1: Topology tab now tracks the operator's host selection. Previously pressing `T` unconditionally overwrote `state.current_tab`, so the renderer always fell through to the first host-shaped tab. Stash the previously-selected host tab in `topology_last_host_tab` (both on `T` and on Left/Right arrow navigation), propagate it through the render snapshot, and have `topology_target_host` honour it when still present in the tab strip. Remote and replay tab updaters clear the cached name when the stashed host disappears (disconnect, switched recording) so the renderer falls back to the first host instead of displaying stale data. H1: README claimed `Tab`/`Shift-Tab` cycles Topology hosts, but only the arrow keys are wired up. Correct the wording and note that the tab now remembers the last-selected host. M1: Hide the matrix `CPU Affinity` column until NVML `nvmlDeviceGetCpuAffinity` plumbing lands — shipping a column that always says `-` is just noise. Drop the dead `cpu_affinity` helper and shrink the `pick_cell_width` overhead accordingly (27 → 13 cells) so narrower terminals can now render the matrix. M2: Clarify `bandwidth_to_generation` doc. NvLink Gen 2/3/4 all share the ~25 GB/s per-link ceiling and are collapsed into the `Some(4)` bucket; `Some(2)` and `Some(3)` are never returned by design. M3: Simplify `pick_cell_width` tail. The trailing `if MIN_CELL * gpu_count <= usable` branch is dead because the preceding `for cw in (MIN_CELL..=MAX_CELL).rev()` already covers the MIN_CELL case. Return 0 directly after the loop. Tests: add three `topology_target_host_*` cases covering the remembered host, the empty fallback, and the stale-host fallback paths. Refit the `falls_back_to_summary_under_80_col` and matrix legend tests to the new overhead / column layout.
Add 9 tests covering the Topology tab hardening from issue #190: - t_key_jumps_to_topology_tab_and_remembers_host: T hotkey jumps to the Topology tab and stashes the previously-selected host tab in topology_last_host_tab so the renderer honours the operator's host selection on return. - t_key_is_noop_when_topology_tab_absent: silent no-op in local mode where the Topology tab is never inserted. - remember_current_host_tab_skips_reserved_tabs: All / Users / Topology reserved tabs are never stashed. - remember_current_host_tab_stashes_host_tab: host tabs are stashed correctly. - m_key_toggles_topology_view_mode_when_topology_active: uppercase M cycles Graph → Matrix → Graph. - lowercase_m_also_toggles_topology_view_mode: lowercase m accepted to reduce muscle-memory friction. - m_key_does_not_toggle_topology_mode_outside_topology_tab: M outside the Topology tab hits the global GPU-sort binding, not the mode toggle. - test_snapshot_capture_preserves_topology_state: topology_view_mode and topology_last_host_tab survive RenderSnapshot::capture. - test_topology_state_roundtrips_through_as_app_state: both fields survive the full capture → as_app_state round-trip.
Member
Author
|
PR finalization complete. Tests: Added 9 tests covering topology tab hardening in
Documentation: README "Topology View" section already present with Graph/Matrix mode explanation, Lint/Format: Final test counts: 817 lib + 944 bin (up from 935 bin). All passing. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a dedicated Topology tab (
T) that visualises intra-node GPUinterconnect structure — NvLink connections (GPU↔GPU, GPU↔NvSwitch),
NUMA affinity, and PCIe lanes. Includes both a graph-style ASCII layout
and a
nvidia-smi topo -m-equivalent matrix fallback. Works on systemswithout NvLink by dropping to a PCIe-only rendering.
Implementation
src/ui/topology/— pure-logic core split acrossfive files, each under the 500-line soft limit:
mod.rs—TopologyModel+TopologyViewModeassembled fromGpuInfo.nvlink_remote_devices+numa_node_id+detail.classify_edge.rs—NVnbandwidth-hint classifier with adominant-generation picker; falls back to generic
NVwhenbandwidth is unknown.
layout.rs— NUMA-aware grid layout; picks horizontal vsvertical box stacking based on terminal width.
graph_render.rs— ASCII NUMA boxes with GPUs and edges.matrix_render.rs—nvidia-smi topo -mtable, fits the columnwidths to the GPU count.
src/ui/renderers/topology_renderer.rs— drawsthe panel for the selected host; falls back to matrix on terminals
narrower than 100 columns so the content never overflows on
80-column sessions.
Tjumps to the tab;Mtoggles graph/matrixwhile the tab is active. Mode precedence ladder updated:
filter-edit > replay-timecode > Users-tab keys > Topology-tab keys >
global > replay.
NvLinkRemoteDevicewithbandwidth_mb_s: Option<u32>so theNVngeneration classifier canderive labels like
NV5from the hint. Existing construction sites(NVIDIA reader, mock templates, test fixtures) updated.
bandwidth_mb_sas an optionallabel on
all_smi_nvlink_remote_device_type. Parser accepts thelabel when present, rejects absurd upstream values, and remains
backward-compatible with pre-feat(tui): topology view tab ('T') — NvLink/NUMA/PCIe graph and matrix #190 exporters that omit it.
ALL_SMI_MOCK_TOPOLOGY=1emits a DGX-style8-GPU, 2-NUMA, 64-link (7 GPU + 1 switch per GPU) topology so the
tab can be exercised without real hardware.
T,M, andgraceful-degradation behaviour.
Graceful degradation
SYS/NVnvocabulary.
nvlink_remote_devicesempty → dim "no active NvLinks" placeholder.NUMA ?box.Testing
NUMAs; edge classification covering full mesh / switch mesh / no
NvLink; matrix formatting (header + legend + cell sizing); graph
rendering (horizontal and vertical stacking); view mode toggle;
NvLink bandwidth round-trip + backward-compat with old exporters.
hardware_details_integration_testconfirms thePrometheus exporter + network parser handle the new label without
breaking older scrapes.
races): NUMA split, 64-link count, instance labelling, empty-input
no-op.
cargo test --lib(817 pass),cargo test --bin all-smi(932pass),
cargo test --bin all-smi-mock-server --features=mock(52pass),
cargo clippy --all-targets -- -D warnings, andcargo fmt --all -- --checkall succeed.Test plan
Tin remote modeagainst a real or mocked cluster.
Mtoggles between graph and matrix modes withoutdata loss.
mode shows the "matrix fallback" hint and switches to matrix
rendering.
ALL_SMI_MOCK_TOPOLOGY=1, confirm the mock server emitsthe DGX-style topology and the TUI renders it correctly.
bandwidth_mb_slabel) fromthe new TUI and confirm NvLink rows still appear in the matrix.
Closes #190