Fix dependency related bug in nvidia gpu driver setup#2312
Fix dependency related bug in nvidia gpu driver setup#2312cb-github-robot merged 3 commits intocloud-barista:mainfrom
Conversation
Signed-off-by: Seokho Son <shsongist@gmail.com>
…mpatibility Blackwell architecture GPUs (e.g., RTX PRO SERVER 6000, PCI 2bb5) require open kernel modules (nvidia-open) instead of proprietary (nvidia-dkms). The proprietary driver installs successfully but nvidia-smi returns 'No devices were found' on Blackwell GPUs. Switch cuda-drivers -> cuda-drivers-open and nvidia-driver-550 -> nvidia-driver-550-open. Open modules work on all Turing+ (2018+) GPUs, covering all modern cloud instances (T4, A10G, A100, L4, H100, B200, etc.).
On multi-GPU HGX systems (A100/H100/H200 SXM with NVSwitch), Fabric Manager version MUST match the installed driver major version. Previously the script installed unversioned 'nvidia-fabricmanager' meta-package, which could pull a different version than the driver (especially in fallback path with nvidia-driver-550-open), causing 'Version mismatch between FM and driver' error and making only GPU 0 accessible. Now extracts the installed driver major version (e.g., 570) from dpkg and installs nvidia-fabricmanager-570 to guarantee version alignment. Also enables nvidia-persistenced on multi-GPU systems to avoid ~2s cold-start GPU initialization latency.
|
/approve |
There was a problem hiding this comment.
Pull request overview
This pull request fixes dependency-related bugs in the NVIDIA GPU driver installation script and adds support for various GPU models. The changes improve reliability, error handling, and platform compatibility for automated driver installations on Ubuntu 22.04/24.04 systems.
Changes:
- Refactored broken package detection and cleanup to handle failed NVIDIA/CUDA installations more robustly
- Added comprehensive DKMS prerequisites installation including GCC version matching for HWE kernels
- Switched from
cuda-driverstocuda-drivers-openwith fallback to specific driver version - Enhanced architecture support (ARM64/x86_64) and improved Ubuntu version detection using
/etc/os-release - Improved Fabric Manager installation with automatic driver version matching and enhanced multi-GPU detection
- Added better error handling with
set -o pipefailand PIPESTATUS checks throughout
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| echo " Cleaning up leftover packages from failed attempt..." | ||
| LEFTOVER_PKGS=$(dpkg -l 2>/dev/null | grep -E "nvidia|cuda|libnvidia" | grep -v "^ii " | grep -v "^un " | awk '{print $2}' || true) | ||
| if [ -n "$LEFTOVER_PKGS" ]; then | ||
| sudo dpkg --force-all --purge $LEFTOVER_PKGS 2>&1 | tail -5 || true |
There was a problem hiding this comment.
Missing quotes around $LEFTOVER_PKGS variable. If package names contain spaces or special characters, this will cause word splitting and glob expansion. The variable should be quoted to prevent these issues.
| sudo dpkg --configure -a 2>/dev/null || true | ||
| fi | ||
| # Also remove any version-less nvidia libs that were auto-installed | ||
| VERSIONLESS_PKGS=$(dpkg -l 2>/dev/null | grep "^ii" | awk '{print $2}' | grep -E "^libnvidia-|^nvidia-" | grep -v -- "-[0-9]" || true) |
There was a problem hiding this comment.
The regex pattern 'grep -v -- "-[0-9]"' to filter out version-suffixed packages is overly broad. It will exclude any package with a hyphen followed by any digit anywhere in the name, not just version numbers at the end. For example, it would exclude "nvidia-headless-450-server" which has "-450-" in the middle. Consider using a more specific pattern like 'grep -v -- "-[0-9]+$"' to match only version numbers at the end of the package name.
| VERSIONLESS_PKGS=$(dpkg -l 2>/dev/null | grep "^ii" | awk '{print $2}' | grep -E "^libnvidia-|^nvidia-" | grep -v -- "-[0-9]" || true) | |
| VERSIONLESS_PKGS=$(dpkg -l 2>/dev/null | grep "^ii" | awk '{print $2}' | grep -E "^libnvidia-|^nvidia-" | grep -v -- "-[0-9]\+$" || true) |
| echo "==========================================" | ||
| echo "Setting up NVIDIA CUDA Repository..." | ||
| echo "==========================================" | ||
| echo "\n========== NVIDIA CUDA Repository ==========" |
There was a problem hiding this comment.
The echo command with "\n" does not produce a newline in bash unless the -e flag is used. In the default echo behavior, "\n" is printed literally as two characters (backslash and n). Use either 'echo -e "\n========== NVIDIA CUDA Repository =========="' or simply 'echo ""' followed by 'echo "========== NVIDIA CUDA Repository =========="'.
| echo "\n========== NVIDIA CUDA Repository ==========" | |
| echo "" | |
| echo "========== NVIDIA CUDA Repository ==========" |
| echo "==========================================" | ||
| echo "Installing NVIDIA Container Toolkit..." | ||
| echo "==========================================" | ||
| echo "\n========== NVIDIA Container Toolkit ==========" |
There was a problem hiding this comment.
The echo command with "\n" does not produce a newline in bash unless the -e flag is used. In the default echo behavior, "\n" is printed literally as two characters (backslash and n). Use either 'echo -e "\n========== NVIDIA Container Toolkit =========="' or simply 'echo ""' followed by 'echo "========== NVIDIA Container Toolkit =========="'.
| echo "\n========== NVIDIA Container Toolkit ==========" | |
| echo "" | |
| echo "========== NVIDIA Container Toolkit ==========" |
| sudo dpkg --force-all --purge $LEFTOVER_PKGS 2>&1 | tail -5 || true | ||
| sudo rm -rf /var/lib/dkms/nvidia 2>/dev/null || true | ||
| sudo dpkg --configure -a 2>/dev/null || true | ||
| fi | ||
| # Also remove any version-less nvidia libs that were auto-installed | ||
| VERSIONLESS_PKGS=$(dpkg -l 2>/dev/null | grep "^ii" | awk '{print $2}' | grep -E "^libnvidia-|^nvidia-" | grep -v -- "-[0-9]" || true) | ||
| if [ -n "$VERSIONLESS_PKGS" ]; then | ||
| echo " Removing version-less nvidia packages: $(echo $VERSIONLESS_PKGS | tr '\n' ' ')" | ||
| sudo dpkg --force-all --purge $VERSIONLESS_PKGS 2>&1 | tail -3 || true |
There was a problem hiding this comment.
Missing quotes around $VERSIONLESS_PKGS variable. If package names contain spaces or special characters, this will cause word splitting and glob expansion. The variable should be quoted to prevent these issues.
| sudo dpkg --force-all --purge $LEFTOVER_PKGS 2>&1 | tail -5 || true | |
| sudo rm -rf /var/lib/dkms/nvidia 2>/dev/null || true | |
| sudo dpkg --configure -a 2>/dev/null || true | |
| fi | |
| # Also remove any version-less nvidia libs that were auto-installed | |
| VERSIONLESS_PKGS=$(dpkg -l 2>/dev/null | grep "^ii" | awk '{print $2}' | grep -E "^libnvidia-|^nvidia-" | grep -v -- "-[0-9]" || true) | |
| if [ -n "$VERSIONLESS_PKGS" ]; then | |
| echo " Removing version-less nvidia packages: $(echo $VERSIONLESS_PKGS | tr '\n' ' ')" | |
| sudo dpkg --force-all --purge $VERSIONLESS_PKGS 2>&1 | tail -3 || true | |
| sudo dpkg --force-all --purge "$LEFTOVER_PKGS" 2>&1 | tail -5 || true | |
| sudo rm -rf /var/lib/dkms/nvidia 2>/dev/null || true | |
| sudo dpkg --configure -a 2>/dev/null || true | |
| fi | |
| # Also remove any version-less nvidia libs that were auto-installed | |
| VERSIONLESS_PKGS=$(dpkg -l 2>/dev/null | grep "^ii" | awk '{print $2}' | grep -E "^libnvidia-|^nvidia-" | grep -v -- "-[0-9]" || true) | |
| if [ -n "$VERSIONLESS_PKGS" ]; then | |
| echo " Removing version-less nvidia packages: $(echo "$VERSIONLESS_PKGS" | tr '\n' ' ')" | |
| sudo dpkg --force-all --purge "$VERSIONLESS_PKGS" 2>&1 | tail -3 || true |
| for i in $(seq 1 12); do | ||
| if ! sudo fuser /var/lib/dpkg/lock-frontend >/dev/null 2>&1 && \ | ||
| ! sudo fuser /var/lib/apt/lists/lock >/dev/null 2>&1; then | ||
| break | ||
| fi | ||
| echo " Waiting for apt lock... (${i}/12)" | ||
| sleep 5 | ||
| done |
There was a problem hiding this comment.
The wait loop checks only two lock files (dpkg/lock-frontend and apt/lists/lock) but line 65 attempts to remove four lock files. The loop should also check /var/cache/apt/archives/lock for consistency, or line 65 should only remove the locks that were checked. The previous implementation checked all three locks that matter (the old code checked /var/cache/apt/archives/lock as well).
| fi | ||
|
|
||
| echo "" | ||
| echo "Attempting fallback: install nvidia-driver-550-open..." |
There was a problem hiding this comment.
The fallback hardcodes nvidia-driver-550-open. This version may become outdated as newer driver versions are released. Consider dynamically determining the latest available version from the repository, or at minimum add a comment noting that this version should be updated periodically. Alternatively, you could attempt to extract the recommended version from cuda-drivers-open package metadata before falling back.
| if [ "$ARCH" = "arm64" ]; then | ||
| ARCH_PATH="sbsa" # ARM server (e.g., Grace Hopper) | ||
| else | ||
| ARCH_PATH="x86_64" | ||
| fi |
There was a problem hiding this comment.
The architecture detection assumes only 'arm64' and 'x86_64' are possible, but dpkg --print-architecture can return other values like 'aarch64' (synonym for arm64), 'i386', 'armhf', etc. While unlikely on GPU-capable systems, the script should handle these cases more robustly. Consider checking if ARCH contains 'arm' or 'aarch' to be more comprehensive, or explicitly handle unsupported architectures with a clear error message.
| if [ "$ARCH" = "arm64" ]; then | |
| ARCH_PATH="sbsa" # ARM server (e.g., Grace Hopper) | |
| else | |
| ARCH_PATH="x86_64" | |
| fi | |
| case "$ARCH" in | |
| arm64|aarch64|arm*) | |
| ARCH_PATH="sbsa" # ARM server (e.g., Grace Hopper) | |
| ;; | |
| amd64|x86_64) | |
| ARCH_PATH="x86_64" | |
| ;; | |
| *) | |
| echo "ERROR: Unsupported architecture '$ARCH'. This script supports amd64/x86_64 and arm64/aarch64 only." >&2 | |
| exit 1 | |
| ;; | |
| esac |
| # Use open kernel modules (nvidia-open) instead of proprietary (nvidia-dkms). | ||
| # Open modules are REQUIRED for Blackwell+ GPUs and work on all Turing+ (2018+) GPUs, | ||
| # which covers all modern cloud GPU instances (T4, A10G, A100, L4, H100, H200, B200, etc.). | ||
| echo "\n========== NVIDIA Driver ==========" |
There was a problem hiding this comment.
The echo command with "\n" does not produce a newline in bash unless the -e flag is used. In the default echo behavior, "\n" is printed literally as two characters (backslash and n). Use either 'echo -e "\n========== NVIDIA Driver =========="' or simply 'echo ""' followed by 'echo "========== NVIDIA Driver =========="'.
| echo "\n========== NVIDIA Driver ==========" | |
| echo "" | |
| echo "========== NVIDIA Driver ==========" |
| echo "==========================================" | ||
| echo "Installing CUDA Toolkit..." | ||
| echo "==========================================" | ||
| echo "\n========== CUDA Toolkit ==========" |
There was a problem hiding this comment.
The echo command with "\n" does not produce a newline in bash unless the -e flag is used. In the default echo behavior, "\n" is printed literally as two characters (backslash and n). Use either 'echo -e "\n========== CUDA Toolkit =========="' or simply 'echo ""' followed by 'echo "========== CUDA Toolkit =========="'.
| echo "\n========== CUDA Toolkit ==========" | |
| echo "" | |
| echo "========== CUDA Toolkit ==========" |
Fix dependency related bug in nvidia gpu driver setup
now, it supports various gpu models.