Skip to content

Fix Fabric Manager detection failure on HGX GPU systems (A100/H100/H200)#2313

Merged
cb-github-robot merged 4 commits intocloud-barista:mainfrom
seokho-son:main
Feb 6, 2026
Merged

Fix Fabric Manager detection failure on HGX GPU systems (A100/H100/H200)#2313
cb-github-robot merged 4 commits intocloud-barista:mainfrom
seokho-son:main

Conversation

@seokho-son
Copy link
Copy Markdown
Member

Fix Fabric Manager detection failure on HGX GPU systems (A100/H100/H200)
also, support llm-d partially

…tput

When grep -c finds 0 matches, it outputs '0' AND exits code 1,
so '|| echo 0' also outputs '0'. The variable becomes '0\n0',
which fails the integer test and loops forever.

Fix: move fallback '|| VAR=0' outside the command substitution
so grep's stdout count is captured cleanly.
…fig reset

containerd config default overwrites any nvidia-ctk runtime settings
that installCudaDriver.sh previously configured. This caused
'CUDA system not yet initialized' errors in GPU Operator validator
pods because containerd had no nvidia runtime registered.

Fix: after generating default config, re-run nvidia-ctk runtime
configure if GPU is detected and nvidia-ctk is available.
1. Add --set-as-default to nvidia-ctk runtime configure (all 3 scripts)
   Without this flag, containerd default_runtime_name stays 'runc', causing
   GPU Operator validator pods to fail with 'CUDA system not yet initialized'
   because they don't explicitly request nvidia runtime.

2. Add nvidia-ctk re-apply in k8s-control-plane-setup.sh
   Same fix already in k8s-worker-setup.sh: containerd config default
   overwrites nvidia-ctk settings, need to re-apply after config reset.

3. Fix grep -c fallback in installCudaDriver.sh GPU_COUNT detection
   grep -c with no match outputs '0' and exits 1, causing || echo '0' to
   also output '0', resulting in GPU_COUNT='0\n0' which breaks integer
   comparison. Move fallback outside $() substitution.

Affected scripts:
- scripts/usecases/k8s/k8s-worker-setup.sh
- scripts/usecases/k8s/k8s-control-plane-setup.sh
- scripts/usecases/llm/installCudaDriver.sh
Root cause: all 3 NVSwitch/multi-GPU detection methods failed:

1. GPU_COUNT regex was inverted: pattern 'nvidia.*3d controller' expects
   'nvidia' before '3d controller', but lspci format is:
     '00:1e.0 3D controller: NVIDIA Corporation A100-SXM4-40GB'
   Class comes BEFORE vendor, so the pattern never matched datacenter
   GPUs (3D controller class). GPU_COUNT was always 0.
   Fix: two-stage grep (grep nvidia | grep -c '3d controller|vga')

2. NVSwitch PCI detection: plain lspci shows NVSwitch as
   'Bridge: NVIDIA Corporation Device 2200' without 'nvswitch' text.
   Fix: fallback to lspci -n with known NVSwitch PCI device IDs.

3. /dev/nvidia-nvswitch* files don't exist before first driver load
   (typically requires reboot after package install).
   Fix: add nvidia-smi topology as additional detection method.

Also improved DRIVER_MAJOR detection for Fabric Manager version matching:
dpkg pattern may miss drivers installed via cuda-drivers metapackage,
now falls back to nvidia-smi --query-gpu=driver_version.
Copilot AI review requested due to automatic review settings February 6, 2026 13:46
@seokho-son
Copy link
Copy Markdown
Member Author

/approve

@github-actions github-actions bot added the approved This PR is approved and will be merged soon. label Feb 6, 2026
@cb-github-robot cb-github-robot merged commit c568e75 into cloud-barista:main Feb 6, 2026
5 of 6 checks passed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves automation scripts for HGX-class GPU systems by making NVSwitch/Fabric Manager detection more robust, and adjusts NVIDIA container runtime configuration to better support GPU Operator and llm-d deployments.

Changes:

  • Enhance NVSwitch detection logic and add a fallback method to derive Fabric Manager package version from nvidia-smi.
  • Configure containerd’s NVIDIA runtime with --set-as-default and re-apply runtime config after regenerating containerd defaults in K8s setup scripts.
  • Adjust llm-d deployment readiness counting logic in the wait loop.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
scripts/usecases/llm/installCudaDriver.sh Adds additional NVSwitch detection paths and improves Fabric Manager version detection; updates containerd runtime configuration.
scripts/usecases/llm/deploy-llm-d.sh Modifies the pod readiness loop logic for llm-d deployments.
scripts/usecases/k8s/k8s-worker-setup.sh Re-applies NVIDIA runtime configuration after containerd default config regeneration on workers.
scripts/usecases/k8s/k8s-control-plane-setup.sh Re-applies NVIDIA runtime configuration after containerd default config regeneration on control planes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +456 to +457
# They appear as "Bridge" class, not "3D controller" or "VGA"
NVSWITCH_PCI=$(sudo lspci -n 2>/dev/null | grep -i "10de:2[23]" || true)
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback NVSwitch PCI detection lspci -n | grep "10de:2[23]" is too broad and can match regular NVIDIA GPU device IDs as well (since it doesn’t check PCI class code or exact NVSwitch device IDs). This can cause false NVSwitch detection and unnecessary Fabric Manager installation. Consider filtering by bridge class (e.g., class code 0604) and/or matching a tighter allowlist of known NVSwitch device IDs rather than a prefix regex.

Suggested change
# They appear as "Bridge" class, not "3D controller" or "VGA"
NVSWITCH_PCI=$(sudo lspci -n 2>/dev/null | grep -i "10de:2[23]" || true)
# They appear as "Bridge" class (PCI class code 0604), not "3D controller" or "VGA"
NVSWITCH_PCI=$(sudo lspci -n 2>/dev/null | grep -i -E "0604: 10de:(2200|22a0|22b0|22c0|2320)" || true)

Copilot uses AI. Check for mistakes.
# NVSwitch detection via nvidia-smi topology (works if driver is loaded)
NVSWITCH_TOPO=""
if [ -z "$NVSWITCH_PCI" ] && [ -z "$NVSWITCH_DEV" ] && command -v nvidia-smi &>/dev/null; then
NVSWITCH_TOPO=$(nvidia-smi topo -m 2>/dev/null | grep -i "nvswitch\|NV[0-9]" || true)
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvidia-smi topo -m | grep "...|NV[0-9]" will also match plain NVLink connections on non-NVSwitch systems (e.g., 2-GPU NVLink bridged PCIe servers), which can incorrectly trigger Fabric Manager installation. The topology-based check should look for an NVSwitch-specific marker (e.g., NVSwitch/NVSW node names) rather than any NV# link type.

Suggested change
NVSWITCH_TOPO=$(nvidia-smi topo -m 2>/dev/null | grep -i "nvswitch\|NV[0-9]" || true)
# Look specifically for NVSwitch nodes (e.g., "NVSwitch", "NVSW") rather than generic NVLink labels like "NV1"
NVSWITCH_TOPO=$(nvidia-smi topo -m 2>/dev/null | grep -i "nvswitch\|nvsw" || true)

Copilot uses AI. Check for mistakes.
Comment on lines +243 to 247
READY_PODS=$(kubectl get pods -n "$LLM_D_NAMESPACE" -l app.kubernetes.io/name=llm-d -o jsonpath='{.items[*].status.containerStatuses[*].ready}' 2>/dev/null | tr ' ' '\n' | grep -c true 2>/dev/null) || READY_PODS=0
TOTAL_PODS=$(kubectl get pods -n "$LLM_D_NAMESPACE" -l app.kubernetes.io/name=llm-d --no-headers 2>/dev/null | wc -l | tr -d ' ') || TOTAL_PODS=0

if [ "$READY_PODS" -gt 0 ] && [ "$READY_PODS" -eq "$TOTAL_PODS" ]; then
echo " All pods ready ($READY_PODS/$TOTAL_PODS)"
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The readiness loop compares READY_PODS computed from .status.containerStatuses[*].ready (counts containers) against TOTAL_PODS computed from kubectl get pods (counts pods). If llm-d pods run multiple containers (e.g., sidecars), this condition may never become true even when all pods are Ready. Prefer counting pod Ready conditions (e.g., .status.conditions[?(@.type=="Ready")].status) or use kubectl wait --for=condition=Ready pod -l ... with a timeout.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved This PR is approved and will be merged soon.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants