Skip to content

Fix dependency related bug in nvidia gpu driver setup#2312

Merged
cb-github-robot merged 3 commits intocloud-barista:mainfrom
seokho-son:main
Feb 6, 2026
Merged

Fix dependency related bug in nvidia gpu driver setup#2312
cb-github-robot merged 3 commits intocloud-barista:mainfrom
seokho-son:main

Conversation

@seokho-son
Copy link
Copy Markdown
Member

Fix dependency related bug in nvidia gpu driver setup

now, it supports various gpu models.

Signed-off-by: Seokho Son <shsongist@gmail.com>
…mpatibility

Blackwell architecture GPUs (e.g., RTX PRO SERVER 6000, PCI 2bb5) require
open kernel modules (nvidia-open) instead of proprietary (nvidia-dkms).
The proprietary driver installs successfully but nvidia-smi returns
'No devices were found' on Blackwell GPUs.

Switch cuda-drivers -> cuda-drivers-open and nvidia-driver-550 ->
nvidia-driver-550-open. Open modules work on all Turing+ (2018+) GPUs,
covering all modern cloud instances (T4, A10G, A100, L4, H100, B200, etc.).
On multi-GPU HGX systems (A100/H100/H200 SXM with NVSwitch), Fabric Manager
version MUST match the installed driver major version. Previously the script
installed unversioned 'nvidia-fabricmanager' meta-package, which could pull
a different version than the driver (especially in fallback path with
nvidia-driver-550-open), causing 'Version mismatch between FM and driver'
error and making only GPU 0 accessible.

Now extracts the installed driver major version (e.g., 570) from dpkg and
installs nvidia-fabricmanager-570 to guarantee version alignment.

Also enables nvidia-persistenced on multi-GPU systems to avoid ~2s cold-start
GPU initialization latency.
Copilot AI review requested due to automatic review settings February 6, 2026 09:51
@seokho-son
Copy link
Copy Markdown
Member Author

/approve

@github-actions github-actions bot added the approved This PR is approved and will be merged soon. label Feb 6, 2026
@cb-github-robot cb-github-robot merged commit 6295c7c into cloud-barista:main Feb 6, 2026
6 checks passed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes dependency-related bugs in the NVIDIA GPU driver installation script and adds support for various GPU models. The changes improve reliability, error handling, and platform compatibility for automated driver installations on Ubuntu 22.04/24.04 systems.

Changes:

  • Refactored broken package detection and cleanup to handle failed NVIDIA/CUDA installations more robustly
  • Added comprehensive DKMS prerequisites installation including GCC version matching for HWE kernels
  • Switched from cuda-drivers to cuda-drivers-open with fallback to specific driver version
  • Enhanced architecture support (ARM64/x86_64) and improved Ubuntu version detection using /etc/os-release
  • Improved Fabric Manager installation with automatic driver version matching and enhanced multi-GPU detection
  • Added better error handling with set -o pipefail and PIPESTATUS checks throughout

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

echo " Cleaning up leftover packages from failed attempt..."
LEFTOVER_PKGS=$(dpkg -l 2>/dev/null | grep -E "nvidia|cuda|libnvidia" | grep -v "^ii " | grep -v "^un " | awk '{print $2}' || true)
if [ -n "$LEFTOVER_PKGS" ]; then
sudo dpkg --force-all --purge $LEFTOVER_PKGS 2>&1 | tail -5 || true
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing quotes around $LEFTOVER_PKGS variable. If package names contain spaces or special characters, this will cause word splitting and glob expansion. The variable should be quoted to prevent these issues.

Copilot uses AI. Check for mistakes.
sudo dpkg --configure -a 2>/dev/null || true
fi
# Also remove any version-less nvidia libs that were auto-installed
VERSIONLESS_PKGS=$(dpkg -l 2>/dev/null | grep "^ii" | awk '{print $2}' | grep -E "^libnvidia-|^nvidia-" | grep -v -- "-[0-9]" || true)
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern 'grep -v -- "-[0-9]"' to filter out version-suffixed packages is overly broad. It will exclude any package with a hyphen followed by any digit anywhere in the name, not just version numbers at the end. For example, it would exclude "nvidia-headless-450-server" which has "-450-" in the middle. Consider using a more specific pattern like 'grep -v -- "-[0-9]+$"' to match only version numbers at the end of the package name.

Suggested change
VERSIONLESS_PKGS=$(dpkg -l 2>/dev/null | grep "^ii" | awk '{print $2}' | grep -E "^libnvidia-|^nvidia-" | grep -v -- "-[0-9]" || true)
VERSIONLESS_PKGS=$(dpkg -l 2>/dev/null | grep "^ii" | awk '{print $2}' | grep -E "^libnvidia-|^nvidia-" | grep -v -- "-[0-9]\+$" || true)

Copilot uses AI. Check for mistakes.
echo "=========================================="
echo "Setting up NVIDIA CUDA Repository..."
echo "=========================================="
echo "\n========== NVIDIA CUDA Repository =========="
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The echo command with "\n" does not produce a newline in bash unless the -e flag is used. In the default echo behavior, "\n" is printed literally as two characters (backslash and n). Use either 'echo -e "\n========== NVIDIA CUDA Repository =========="' or simply 'echo ""' followed by 'echo "========== NVIDIA CUDA Repository =========="'.

Suggested change
echo "\n========== NVIDIA CUDA Repository =========="
echo ""
echo "========== NVIDIA CUDA Repository =========="

Copilot uses AI. Check for mistakes.
echo "=========================================="
echo "Installing NVIDIA Container Toolkit..."
echo "=========================================="
echo "\n========== NVIDIA Container Toolkit =========="
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The echo command with "\n" does not produce a newline in bash unless the -e flag is used. In the default echo behavior, "\n" is printed literally as two characters (backslash and n). Use either 'echo -e "\n========== NVIDIA Container Toolkit =========="' or simply 'echo ""' followed by 'echo "========== NVIDIA Container Toolkit =========="'.

Suggested change
echo "\n========== NVIDIA Container Toolkit =========="
echo ""
echo "========== NVIDIA Container Toolkit =========="

Copilot uses AI. Check for mistakes.
Comment on lines +334 to +342
sudo dpkg --force-all --purge $LEFTOVER_PKGS 2>&1 | tail -5 || true
sudo rm -rf /var/lib/dkms/nvidia 2>/dev/null || true
sudo dpkg --configure -a 2>/dev/null || true
fi
# Also remove any version-less nvidia libs that were auto-installed
VERSIONLESS_PKGS=$(dpkg -l 2>/dev/null | grep "^ii" | awk '{print $2}' | grep -E "^libnvidia-|^nvidia-" | grep -v -- "-[0-9]" || true)
if [ -n "$VERSIONLESS_PKGS" ]; then
echo " Removing version-less nvidia packages: $(echo $VERSIONLESS_PKGS | tr '\n' ' ')"
sudo dpkg --force-all --purge $VERSIONLESS_PKGS 2>&1 | tail -3 || true
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing quotes around $VERSIONLESS_PKGS variable. If package names contain spaces or special characters, this will cause word splitting and glob expansion. The variable should be quoted to prevent these issues.

Suggested change
sudo dpkg --force-all --purge $LEFTOVER_PKGS 2>&1 | tail -5 || true
sudo rm -rf /var/lib/dkms/nvidia 2>/dev/null || true
sudo dpkg --configure -a 2>/dev/null || true
fi
# Also remove any version-less nvidia libs that were auto-installed
VERSIONLESS_PKGS=$(dpkg -l 2>/dev/null | grep "^ii" | awk '{print $2}' | grep -E "^libnvidia-|^nvidia-" | grep -v -- "-[0-9]" || true)
if [ -n "$VERSIONLESS_PKGS" ]; then
echo " Removing version-less nvidia packages: $(echo $VERSIONLESS_PKGS | tr '\n' ' ')"
sudo dpkg --force-all --purge $VERSIONLESS_PKGS 2>&1 | tail -3 || true
sudo dpkg --force-all --purge "$LEFTOVER_PKGS" 2>&1 | tail -5 || true
sudo rm -rf /var/lib/dkms/nvidia 2>/dev/null || true
sudo dpkg --configure -a 2>/dev/null || true
fi
# Also remove any version-less nvidia libs that were auto-installed
VERSIONLESS_PKGS=$(dpkg -l 2>/dev/null | grep "^ii" | awk '{print $2}' | grep -E "^libnvidia-|^nvidia-" | grep -v -- "-[0-9]" || true)
if [ -n "$VERSIONLESS_PKGS" ]; then
echo " Removing version-less nvidia packages: $(echo "$VERSIONLESS_PKGS" | tr '\n' ' ')"
sudo dpkg --force-all --purge "$VERSIONLESS_PKGS" 2>&1 | tail -3 || true

Copilot uses AI. Check for mistakes.
Comment on lines +55 to +62
for i in $(seq 1 12); do
if ! sudo fuser /var/lib/dpkg/lock-frontend >/dev/null 2>&1 && \
! sudo fuser /var/lib/apt/lists/lock >/dev/null 2>&1; then
break
fi
echo " Waiting for apt lock... (${i}/12)"
sleep 5
done
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wait loop checks only two lock files (dpkg/lock-frontend and apt/lists/lock) but line 65 attempts to remove four lock files. The loop should also check /var/cache/apt/archives/lock for consistency, or line 65 should only remove the locks that were checked. The previous implementation checked all three locks that matter (the old code checked /var/cache/apt/archives/lock as well).

Copilot uses AI. Check for mistakes.
fi

echo ""
echo "Attempting fallback: install nvidia-driver-550-open..."
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback hardcodes nvidia-driver-550-open. This version may become outdated as newer driver versions are released. Consider dynamically determining the latest available version from the repository, or at minimum add a comment noting that this version should be updated periodically. Alternatively, you could attempt to extract the recommended version from cuda-drivers-open package metadata before falling back.

Copilot uses AI. Check for mistakes.
Comment on lines +260 to +264
if [ "$ARCH" = "arm64" ]; then
ARCH_PATH="sbsa" # ARM server (e.g., Grace Hopper)
else
ARCH_PATH="x86_64"
fi
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The architecture detection assumes only 'arm64' and 'x86_64' are possible, but dpkg --print-architecture can return other values like 'aarch64' (synonym for arm64), 'i386', 'armhf', etc. While unlikely on GPU-capable systems, the script should handle these cases more robustly. Consider checking if ARCH contains 'arm' or 'aarch' to be more comprehensive, or explicitly handle unsupported architectures with a clear error message.

Suggested change
if [ "$ARCH" = "arm64" ]; then
ARCH_PATH="sbsa" # ARM server (e.g., Grace Hopper)
else
ARCH_PATH="x86_64"
fi
case "$ARCH" in
arm64|aarch64|arm*)
ARCH_PATH="sbsa" # ARM server (e.g., Grace Hopper)
;;
amd64|x86_64)
ARCH_PATH="x86_64"
;;
*)
echo "ERROR: Unsupported architecture '$ARCH'. This script supports amd64/x86_64 and arm64/aarch64 only." >&2
exit 1
;;
esac

Copilot uses AI. Check for mistakes.
# Use open kernel modules (nvidia-open) instead of proprietary (nvidia-dkms).
# Open modules are REQUIRED for Blackwell+ GPUs and work on all Turing+ (2018+) GPUs,
# which covers all modern cloud GPU instances (T4, A10G, A100, L4, H100, H200, B200, etc.).
echo "\n========== NVIDIA Driver =========="
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The echo command with "\n" does not produce a newline in bash unless the -e flag is used. In the default echo behavior, "\n" is printed literally as two characters (backslash and n). Use either 'echo -e "\n========== NVIDIA Driver =========="' or simply 'echo ""' followed by 'echo "========== NVIDIA Driver =========="'.

Suggested change
echo "\n========== NVIDIA Driver =========="
echo ""
echo "========== NVIDIA Driver =========="

Copilot uses AI. Check for mistakes.
echo "=========================================="
echo "Installing CUDA Toolkit..."
echo "=========================================="
echo "\n========== CUDA Toolkit =========="
Copy link

Copilot AI Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The echo command with "\n" does not produce a newline in bash unless the -e flag is used. In the default echo behavior, "\n" is printed literally as two characters (backslash and n). Use either 'echo -e "\n========== CUDA Toolkit =========="' or simply 'echo ""' followed by 'echo "========== CUDA Toolkit =========="'.

Suggested change
echo "\n========== CUDA Toolkit =========="
echo ""
echo "========== CUDA Toolkit =========="

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved This PR is approved and will be merged soon.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants