Fix Fabric Manager detection failure on HGX GPU systems (A100/H100/H200)#2313
Fix Fabric Manager detection failure on HGX GPU systems (A100/H100/H200)#2313cb-github-robot merged 4 commits intocloud-barista:mainfrom
Conversation
…tput When grep -c finds 0 matches, it outputs '0' AND exits code 1, so '|| echo 0' also outputs '0'. The variable becomes '0\n0', which fails the integer test and loops forever. Fix: move fallback '|| VAR=0' outside the command substitution so grep's stdout count is captured cleanly.
…fig reset containerd config default overwrites any nvidia-ctk runtime settings that installCudaDriver.sh previously configured. This caused 'CUDA system not yet initialized' errors in GPU Operator validator pods because containerd had no nvidia runtime registered. Fix: after generating default config, re-run nvidia-ctk runtime configure if GPU is detected and nvidia-ctk is available.
1. Add --set-as-default to nvidia-ctk runtime configure (all 3 scripts) Without this flag, containerd default_runtime_name stays 'runc', causing GPU Operator validator pods to fail with 'CUDA system not yet initialized' because they don't explicitly request nvidia runtime. 2. Add nvidia-ctk re-apply in k8s-control-plane-setup.sh Same fix already in k8s-worker-setup.sh: containerd config default overwrites nvidia-ctk settings, need to re-apply after config reset. 3. Fix grep -c fallback in installCudaDriver.sh GPU_COUNT detection grep -c with no match outputs '0' and exits 1, causing || echo '0' to also output '0', resulting in GPU_COUNT='0\n0' which breaks integer comparison. Move fallback outside $() substitution. Affected scripts: - scripts/usecases/k8s/k8s-worker-setup.sh - scripts/usecases/k8s/k8s-control-plane-setup.sh - scripts/usecases/llm/installCudaDriver.sh
Root cause: all 3 NVSwitch/multi-GPU detection methods failed:
1. GPU_COUNT regex was inverted: pattern 'nvidia.*3d controller' expects
'nvidia' before '3d controller', but lspci format is:
'00:1e.0 3D controller: NVIDIA Corporation A100-SXM4-40GB'
Class comes BEFORE vendor, so the pattern never matched datacenter
GPUs (3D controller class). GPU_COUNT was always 0.
Fix: two-stage grep (grep nvidia | grep -c '3d controller|vga')
2. NVSwitch PCI detection: plain lspci shows NVSwitch as
'Bridge: NVIDIA Corporation Device 2200' without 'nvswitch' text.
Fix: fallback to lspci -n with known NVSwitch PCI device IDs.
3. /dev/nvidia-nvswitch* files don't exist before first driver load
(typically requires reboot after package install).
Fix: add nvidia-smi topology as additional detection method.
Also improved DRIVER_MAJOR detection for Fabric Manager version matching:
dpkg pattern may miss drivers installed via cuda-drivers metapackage,
now falls back to nvidia-smi --query-gpu=driver_version.
|
/approve |
There was a problem hiding this comment.
Pull request overview
This PR improves automation scripts for HGX-class GPU systems by making NVSwitch/Fabric Manager detection more robust, and adjusts NVIDIA container runtime configuration to better support GPU Operator and llm-d deployments.
Changes:
- Enhance NVSwitch detection logic and add a fallback method to derive Fabric Manager package version from
nvidia-smi. - Configure containerd’s NVIDIA runtime with
--set-as-defaultand re-apply runtime config after regenerating containerd defaults in K8s setup scripts. - Adjust llm-d deployment readiness counting logic in the wait loop.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| scripts/usecases/llm/installCudaDriver.sh | Adds additional NVSwitch detection paths and improves Fabric Manager version detection; updates containerd runtime configuration. |
| scripts/usecases/llm/deploy-llm-d.sh | Modifies the pod readiness loop logic for llm-d deployments. |
| scripts/usecases/k8s/k8s-worker-setup.sh | Re-applies NVIDIA runtime configuration after containerd default config regeneration on workers. |
| scripts/usecases/k8s/k8s-control-plane-setup.sh | Re-applies NVIDIA runtime configuration after containerd default config regeneration on control planes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # They appear as "Bridge" class, not "3D controller" or "VGA" | ||
| NVSWITCH_PCI=$(sudo lspci -n 2>/dev/null | grep -i "10de:2[23]" || true) |
There was a problem hiding this comment.
The fallback NVSwitch PCI detection lspci -n | grep "10de:2[23]" is too broad and can match regular NVIDIA GPU device IDs as well (since it doesn’t check PCI class code or exact NVSwitch device IDs). This can cause false NVSwitch detection and unnecessary Fabric Manager installation. Consider filtering by bridge class (e.g., class code 0604) and/or matching a tighter allowlist of known NVSwitch device IDs rather than a prefix regex.
| # They appear as "Bridge" class, not "3D controller" or "VGA" | |
| NVSWITCH_PCI=$(sudo lspci -n 2>/dev/null | grep -i "10de:2[23]" || true) | |
| # They appear as "Bridge" class (PCI class code 0604), not "3D controller" or "VGA" | |
| NVSWITCH_PCI=$(sudo lspci -n 2>/dev/null | grep -i -E "0604: 10de:(2200|22a0|22b0|22c0|2320)" || true) |
| # NVSwitch detection via nvidia-smi topology (works if driver is loaded) | ||
| NVSWITCH_TOPO="" | ||
| if [ -z "$NVSWITCH_PCI" ] && [ -z "$NVSWITCH_DEV" ] && command -v nvidia-smi &>/dev/null; then | ||
| NVSWITCH_TOPO=$(nvidia-smi topo -m 2>/dev/null | grep -i "nvswitch\|NV[0-9]" || true) |
There was a problem hiding this comment.
nvidia-smi topo -m | grep "...|NV[0-9]" will also match plain NVLink connections on non-NVSwitch systems (e.g., 2-GPU NVLink bridged PCIe servers), which can incorrectly trigger Fabric Manager installation. The topology-based check should look for an NVSwitch-specific marker (e.g., NVSwitch/NVSW node names) rather than any NV# link type.
| NVSWITCH_TOPO=$(nvidia-smi topo -m 2>/dev/null | grep -i "nvswitch\|NV[0-9]" || true) | |
| # Look specifically for NVSwitch nodes (e.g., "NVSwitch", "NVSW") rather than generic NVLink labels like "NV1" | |
| NVSWITCH_TOPO=$(nvidia-smi topo -m 2>/dev/null | grep -i "nvswitch\|nvsw" || true) |
| READY_PODS=$(kubectl get pods -n "$LLM_D_NAMESPACE" -l app.kubernetes.io/name=llm-d -o jsonpath='{.items[*].status.containerStatuses[*].ready}' 2>/dev/null | tr ' ' '\n' | grep -c true 2>/dev/null) || READY_PODS=0 | ||
| TOTAL_PODS=$(kubectl get pods -n "$LLM_D_NAMESPACE" -l app.kubernetes.io/name=llm-d --no-headers 2>/dev/null | wc -l | tr -d ' ') || TOTAL_PODS=0 | ||
|
|
||
| if [ "$READY_PODS" -gt 0 ] && [ "$READY_PODS" -eq "$TOTAL_PODS" ]; then | ||
| echo " All pods ready ($READY_PODS/$TOTAL_PODS)" |
There was a problem hiding this comment.
The readiness loop compares READY_PODS computed from .status.containerStatuses[*].ready (counts containers) against TOTAL_PODS computed from kubectl get pods (counts pods). If llm-d pods run multiple containers (e.g., sidecars), this condition may never become true even when all pods are Ready. Prefer counting pod Ready conditions (e.g., .status.conditions[?(@.type=="Ready")].status) or use kubectl wait --for=condition=Ready pod -l ... with a timeout.
Fix Fabric Manager detection failure on HGX GPU systems (A100/H100/H200)
also, support llm-d partially