Fix Fabric Manager detection failure on HGX GPU systems (A100/H100/H200) by seokho-son · Pull Request #2313 · cloud-barista/cb-tumblebug

seokho-son · 2026-02-06T13:46:58Z

Fix Fabric Manager detection failure on HGX GPU systems (A100/H100/H200)
also, support llm-d partially

…tput When grep -c finds 0 matches, it outputs '0' AND exits code 1, so '|| echo 0' also outputs '0'. The variable becomes '0\n0', which fails the integer test and loops forever. Fix: move fallback '|| VAR=0' outside the command substitution so grep's stdout count is captured cleanly.

…fig reset containerd config default overwrites any nvidia-ctk runtime settings that installCudaDriver.sh previously configured. This caused 'CUDA system not yet initialized' errors in GPU Operator validator pods because containerd had no nvidia runtime registered. Fix: after generating default config, re-run nvidia-ctk runtime configure if GPU is detected and nvidia-ctk is available.

1. Add --set-as-default to nvidia-ctk runtime configure (all 3 scripts) Without this flag, containerd default_runtime_name stays 'runc', causing GPU Operator validator pods to fail with 'CUDA system not yet initialized' because they don't explicitly request nvidia runtime. 2. Add nvidia-ctk re-apply in k8s-control-plane-setup.sh Same fix already in k8s-worker-setup.sh: containerd config default overwrites nvidia-ctk settings, need to re-apply after config reset. 3. Fix grep -c fallback in installCudaDriver.sh GPU_COUNT detection grep -c with no match outputs '0' and exits 1, causing || echo '0' to also output '0', resulting in GPU_COUNT='0\n0' which breaks integer comparison. Move fallback outside $() substitution. Affected scripts: - scripts/usecases/k8s/k8s-worker-setup.sh - scripts/usecases/k8s/k8s-control-plane-setup.sh - scripts/usecases/llm/installCudaDriver.sh

Root cause: all 3 NVSwitch/multi-GPU detection methods failed: 1. GPU_COUNT regex was inverted: pattern 'nvidia.*3d controller' expects 'nvidia' before '3d controller', but lspci format is: '00:1e.0 3D controller: NVIDIA Corporation A100-SXM4-40GB' Class comes BEFORE vendor, so the pattern never matched datacenter GPUs (3D controller class). GPU_COUNT was always 0. Fix: two-stage grep (grep nvidia | grep -c '3d controller|vga') 2. NVSwitch PCI detection: plain lspci shows NVSwitch as 'Bridge: NVIDIA Corporation Device 2200' without 'nvswitch' text. Fix: fallback to lspci -n with known NVSwitch PCI device IDs. 3. /dev/nvidia-nvswitch* files don't exist before first driver load (typically requires reboot after package install). Fix: add nvidia-smi topology as additional detection method. Also improved DRIVER_MAJOR detection for Fabric Manager version matching: dpkg pattern may miss drivers installed via cuda-drivers metapackage, now falls back to nvidia-smi --query-gpu=driver_version.

seokho-son · 2026-02-06T13:49:19Z

/approve

Copilot

Pull request overview

This PR improves automation scripts for HGX-class GPU systems by making NVSwitch/Fabric Manager detection more robust, and adjusts NVIDIA container runtime configuration to better support GPU Operator and llm-d deployments.

Changes:

Enhance NVSwitch detection logic and add a fallback method to derive Fabric Manager package version from nvidia-smi.
Configure containerd’s NVIDIA runtime with --set-as-default and re-apply runtime config after regenerating containerd defaults in K8s setup scripts.
Adjust llm-d deployment readiness counting logic in the wait loop.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
scripts/usecases/llm/installCudaDriver.sh	Adds additional NVSwitch detection paths and improves Fabric Manager version detection; updates containerd runtime configuration.
scripts/usecases/llm/deploy-llm-d.sh	Modifies the pod readiness loop logic for llm-d deployments.
scripts/usecases/k8s/k8s-worker-setup.sh	Re-applies NVIDIA runtime configuration after containerd default config regeneration on workers.
scripts/usecases/k8s/k8s-control-plane-setup.sh	Re-applies NVIDIA runtime configuration after containerd default config regeneration on control planes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-06T13:51:50Z

scripts/usecases/llm/installCudaDriver.sh

+    # They appear as "Bridge" class, not "3D controller" or "VGA"
+    NVSWITCH_PCI=$(sudo lspci -n 2>/dev/null | grep -i "10de:2[23]" || true)


The fallback NVSwitch PCI detection lspci -n | grep "10de:2[23]" is too broad and can match regular NVIDIA GPU device IDs as well (since it doesn’t check PCI class code or exact NVSwitch device IDs). This can cause false NVSwitch detection and unnecessary Fabric Manager installation. Consider filtering by bridge class (e.g., class code 0604) and/or matching a tighter allowlist of known NVSwitch device IDs rather than a prefix regex.

Suggested change

# They appear as "Bridge" class, not "3D controller" or "VGA"

NVSWITCH_PCI=$(sudo lspci -n 2>/dev/null | grep -i "10de:2[23]" || true)

# They appear as "Bridge" class (PCI class code 0604), not "3D controller" or "VGA"

NVSWITCH_PCI=$(sudo lspci -n 2>/dev/null | grep -i -E "0604: 10de:(2200|22a0|22b0|22c0|2320)" || true)

Copilot · 2026-02-06T13:51:50Z

scripts/usecases/llm/installCudaDriver.sh

+# NVSwitch detection via nvidia-smi topology (works if driver is loaded)
+NVSWITCH_TOPO=""
+if [ -z "$NVSWITCH_PCI" ] && [ -z "$NVSWITCH_DEV" ] && command -v nvidia-smi &>/dev/null; then
+    NVSWITCH_TOPO=$(nvidia-smi topo -m 2>/dev/null | grep -i "nvswitch\|NV[0-9]" || true)


nvidia-smi topo -m | grep "...|NV[0-9]" will also match plain NVLink connections on non-NVSwitch systems (e.g., 2-GPU NVLink bridged PCIe servers), which can incorrectly trigger Fabric Manager installation. The topology-based check should look for an NVSwitch-specific marker (e.g., NVSwitch/NVSW node names) rather than any NV# link type.

Suggested change

NVSWITCH_TOPO=$(nvidia-smi topo -m 2>/dev/null | grep -i "nvswitch\|NV[0-9]" || true)

# Look specifically for NVSwitch nodes (e.g., "NVSwitch", "NVSW") rather than generic NVLink labels like "NV1"

NVSWITCH_TOPO=$(nvidia-smi topo -m 2>/dev/null | grep -i "nvswitch\|nvsw" || true)

Copilot · 2026-02-06T13:51:51Z

scripts/usecases/llm/deploy-llm-d.sh

+    READY_PODS=$(kubectl get pods -n "$LLM_D_NAMESPACE" -l app.kubernetes.io/name=llm-d -o jsonpath='{.items[*].status.containerStatuses[*].ready}' 2>/dev/null | tr ' ' '\n' | grep -c true 2>/dev/null) || READY_PODS=0
+    TOTAL_PODS=$(kubectl get pods -n "$LLM_D_NAMESPACE" -l app.kubernetes.io/name=llm-d --no-headers 2>/dev/null | wc -l | tr -d ' ') || TOTAL_PODS=0

    if [ "$READY_PODS" -gt 0 ] && [ "$READY_PODS" -eq "$TOTAL_PODS" ]; then
        echo "  All pods ready ($READY_PODS/$TOTAL_PODS)"


The readiness loop compares READY_PODS computed from .status.containerStatuses[*].ready (counts containers) against TOTAL_PODS computed from kubectl get pods (counts pods). If llm-d pods run multiple containers (e.g., sidecars), this condition may never become true even when all pods are Ready. Prefer counting pod Ready conditions (e.g., .status.conditions[?(@.type=="Ready")].status) or use kubectl wait --for=condition=Ready pod -l ... with a timeout.

seokho-son added 4 commits February 6, 2026 20:38

Copilot AI review requested due to automatic review settings February 6, 2026 13:46

seokho-son requested a review from yunkon-kim as a code owner February 6, 2026 13:46

Copilot started reviewing on behalf of seokho-son February 6, 2026 13:47 View session

github-actions bot approved these changes Feb 6, 2026

View reviewed changes

github-actions bot added the approved This PR is approved and will be merged soon. label Feb 6, 2026

cb-github-robot merged commit c568e75 into cloud-barista:main Feb 6, 2026
5 of 6 checks passed

Copilot AI reviewed Feb 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Fabric Manager detection failure on HGX GPU systems (A100/H100/H200)#2313

Fix Fabric Manager detection failure on HGX GPU systems (A100/H100/H200)#2313
cb-github-robot merged 4 commits intocloud-barista:mainfrom
seokho-son:main

seokho-son commented Feb 6, 2026

Uh oh!

seokho-son commented Feb 6, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 6, 2026

Uh oh!

Copilot AI Feb 6, 2026

Uh oh!

Copilot AI Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		# They appear as "Bridge" class, not "3D controller" or "VGA"
		NVSWITCH_PCI=$(sudo lspci -n 2>/dev/null \| grep -i "10de:2[23]" \|\| true)

	NVSWITCH_TOPO=$(nvidia-smi topo -m 2>/dev/null \| grep -i "nvswitch\\|NV[0-9]" \|\| true)
	# Look specifically for NVSwitch nodes (e.g., "NVSwitch", "NVSW") rather than generic NVLink labels like "NV1"
	NVSWITCH_TOPO=$(nvidia-smi topo -m 2>/dev/null \| grep -i "nvswitch\\|nvsw" \|\| true)

Conversation

seokho-son commented Feb 6, 2026

Uh oh!

seokho-son commented Feb 6, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants