Support llm-d deployment#2310
Conversation
Signed-off-by: Seokho Son <shsongist@gmail.com>
Signed-off-by: Seokho Son <shsongist@gmail.com>
Signed-off-by: Seokho Son <shsongist@gmail.com>
|
/approve |
There was a problem hiding this comment.
Pull request overview
Adds automation to support llm-d deployments on Kubernetes and improves GPU driver/container runtime setup for unattended VM/K8s node provisioning.
Changes:
- Expanded NVIDIA driver installer to support Ubuntu 22.04/24.04, container toolkit setup, optional CUDA toolkit, and NVSwitch checks.
- Added a new
deploy-llm-d.shscript to validate prerequisites and install llm-d via Helm. - Enhanced K8s control-plane/worker setup scripts for non-interactive runs and llm-d/GPU-related outputs.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 20 comments.
| File | Description |
|---|---|
| scripts/usecases/llm/installCudaDriver.sh | Reworked driver install to include repo/keyring flow, optional toolkit, container toolkit config, and reboot handling. |
| scripts/usecases/llm/deploy-llm-d.sh | New script to check cluster prerequisites and deploy llm-d via Helm, then print endpoints/status info. |
| scripts/usecases/k8s/k8s-worker-setup.sh | Adds non-interactive/apt cleanup behavior and GPU detection output for worker nodes. |
| scripts/usecases/k8s/k8s-control-plane-setup.sh | Adds llm-d mode to install extra CRDs/GPU Operator and improves idempotency when cluster already exists. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| sudo DEBIAN_FRONTEND=noninteractive apt-get install -y \ | ||
| -o Dpkg::Options::="--force-confdef" \ | ||
| -o Dpkg::Options::="--force-confold" \ | ||
| cuda-drivers 2>&1 | \ | ||
| grep -E "^(Get:|Unpacking|Setting up|cuda-drivers|nvidia)" || true |
There was a problem hiding this comment.
The apt-get install cuda-drivers exit status is masked by piping to grep and then || true (and there’s no set -o pipefail). If apt-get fails, the script will still continue as if the driver installed. Don’t swallow the exit code; keep filtered output but check PIPESTATUS[0] / enable pipefail.
| # Detect Ubuntu version | ||
| UBUNTU_VERSION=$(lsb_release -rs | tr -d '.') | ||
| if [[ "$UBUNTU_VERSION" == "2404" ]]; then | ||
| REPO_URL="https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64" | ||
| KEYRING_URL="${REPO_URL}/cuda-keyring_1.1-1_all.deb" |
There was a problem hiding this comment.
This section relies on lsb_release to detect Ubuntu version; on minimal images lsb-release may be missing, causing repo selection to fail. Consider installing/checking required tooling (lsb-release, wget, curl, gnupg, ca-certificates) before repo setup so the script fails fast with a clear message.
| --version|-v) | ||
| LLM_D_VERSION="$2" | ||
| shift 2 |
There was a problem hiding this comment.
--version / LLM_D_VERSION is accepted and stored but never used to select a chart/app version during installation. Either wire this into the helm upgrade --install call (e.g., chart --version and/or image tag values) or remove the flag to avoid implying version pinning works when it doesn’t.
| --with-gateway) | ||
| INSTALL_GATEWAY=true | ||
| shift |
There was a problem hiding this comment.
--with-gateway sets INSTALL_GATEWAY=true, but the flag is never used later (no Gateway/Istio installation or configuration occurs). Either implement the gateway install path or remove the option/help text to avoid implying external access is configured.
| # Check for NVIDIA GPU | ||
| echo "Checking for NVIDIA GPU..." | ||
| GPU_INFO=$(sudo lspci | grep -i nvidia) | ||
| GPU_INFO=$(sudo lspci | grep -i nvidia || true) | ||
| if [ -z "$GPU_INFO" ]; then | ||
| echo "No NVIDIA GPU detected or an error occurred. Exiting..." | ||
| sudo lspci | ||
| exit 1 | ||
| else | ||
| echo "NVIDIA GPU detected:" | ||
| echo "$GPU_INFO" | ||
| echo "Check root disk size" | ||
| df -h / | awk '$NF=="/" {print "Total: "$2, "Available: "$4}' | ||
| echo "ERROR: No NVIDIA GPU detected!" |
There was a problem hiding this comment.
lspci is required for GPU detection, but it may not be present on minimal Ubuntu images (pciutils). If it’s missing, this check will fail (or false-negative) and the script exits early. Install/check pciutils (and any other required tools) before running the pre-flight GPU check, and surface a clear error if installation fails.
| # Kill any stuck apt/dpkg processes (only if they are hanging) | ||
| sudo pkill -9 -f "apt-get|dpkg" 2>/dev/null || true | ||
| sleep 2 |
There was a problem hiding this comment.
Force-killing apt-get/dpkg with pkill -9 can interrupt legitimate package operations (e.g., unattended-upgrades) and leave the node in a broken dpkg state. Prefer waiting for locks and failing clearly, or only terminating the specific PID(s) holding the locks after a timeout.
| # Remove stale lock files | ||
| sudo rm -f /var/lib/dpkg/lock-frontend 2>/dev/null || true | ||
| sudo rm -f /var/lib/dpkg/lock 2>/dev/null || true | ||
| sudo rm -f /var/cache/apt/archives/lock 2>/dev/null || true | ||
| sudo rm -f /var/lib/apt/lists/lock 2>/dev/null || true |
There was a problem hiding this comment.
Manually deleting dpkg/apt lock files is risky if a package operation is still running or if lock semantics differ. Consider avoiding lock-file deletion, or only removing after verifying no process holds the locks, and exit with a clear error when locks persist.
| HELM_VALUES="" | ||
| if [ -n "$MODEL_NAME" ]; then | ||
| HELM_VALUES="$HELM_VALUES --set model.name=$MODEL_NAME" | ||
| fi | ||
| HELM_VALUES="$HELM_VALUES --set replicaCount=$REPLICAS" | ||
| HELM_VALUES="$HELM_VALUES --set resources.limits.nvidia\\.com/gpu=1" | ||
|
|
||
| # Install llm-d | ||
| echo "Installing llm-d Helm chart..." | ||
| helm upgrade --install llm-d llm-d/llm-d \ | ||
| --namespace "$LLM_D_NAMESPACE" \ | ||
| $HELM_VALUES \ |
There was a problem hiding this comment.
The Helm values aggregation logic builds HELM_VALUES by concatenating untrusted CLI parameters like MODEL_NAME and REPLICAS and later injects this string unquoted into the helm upgrade command, which allows shell metacharacters (e.g., ;, &&) in those parameters to break out of the intended argument and execute arbitrary commands. Because this script is designed for unattended remote execution, an attacker who can influence --model or --replicas can achieve command execution on the host under the script’s privileges. To fix this, avoid concatenating untrusted data into a single shell string and instead pass --set values as separately quoted arguments or via a Helm values file.
| HELM_VALUES="" | |
| if [ -n "$MODEL_NAME" ]; then | |
| HELM_VALUES="$HELM_VALUES --set model.name=$MODEL_NAME" | |
| fi | |
| HELM_VALUES="$HELM_VALUES --set replicaCount=$REPLICAS" | |
| HELM_VALUES="$HELM_VALUES --set resources.limits.nvidia\\.com/gpu=1" | |
| # Install llm-d | |
| echo "Installing llm-d Helm chart..." | |
| helm upgrade --install llm-d llm-d/llm-d \ | |
| --namespace "$LLM_D_NAMESPACE" \ | |
| $HELM_VALUES \ | |
| HELM_ARGS=() | |
| if [ -n "$MODEL_NAME" ]; then | |
| HELM_ARGS+=(--set "model.name=$MODEL_NAME") | |
| fi | |
| HELM_ARGS+=(--set "replicaCount=$REPLICAS") | |
| HELM_ARGS+=(--set "resources.limits.nvidia\\.com/gpu=1") | |
| # Install llm-d | |
| echo "Installing llm-d Helm chart..." | |
| helm upgrade --install llm-d llm-d/llm-d \ | |
| --namespace "$LLM_D_NAMESPACE" \ | |
| "${HELM_ARGS[@]}" \ |
| helm repo add llm-d https://llm-d.github.io/llm-d/ 2>/dev/null || true | ||
| helm repo update > /dev/null 2>&1 | ||
|
|
||
| # Create namespace | ||
| kubectl create namespace "$LLM_D_NAMESPACE" --dry-run=client -o yaml | kubectl apply -f - > /dev/null | ||
|
|
||
| # Prepare Helm values | ||
| HELM_VALUES="" | ||
| if [ -n "$MODEL_NAME" ]; then | ||
| HELM_VALUES="$HELM_VALUES --set model.name=$MODEL_NAME" | ||
| fi | ||
| HELM_VALUES="$HELM_VALUES --set replicaCount=$REPLICAS" | ||
| HELM_VALUES="$HELM_VALUES --set resources.limits.nvidia\\.com/gpu=1" | ||
|
|
||
| # Install llm-d | ||
| echo "Installing llm-d Helm chart..." | ||
| helm upgrade --install llm-d llm-d/llm-d \ | ||
| --namespace "$LLM_D_NAMESPACE" \ | ||
| $HELM_VALUES \ | ||
| --wait --timeout 10m 2>&1 | grep -E "^(NAME:|STATUS:|REVISION:)" || true |
There was a problem hiding this comment.
This script automatically adds the external Helm repo https://llm-d.github.io/llm-d/ and installs/upgrades the llm-d/llm-d chart without pinning to a specific immutable version or verifying chart signatures, effectively granting third-party code cluster-admin level execution each time it runs. If the llm-d Helm repository or its distribution channel is compromised, an attacker could ship a malicious chart that gains access to cluster secrets or escalates privileges. To mitigate this supply chain risk, pin the chart to a specific immutable version/digest (or vendor it internally) and enable integrity verification where supported.
| # Initialize Kubernetes control plane | ||
| echo "Initializing Kubernetes control plane..." | ||
| sudo kubeadm init \ | ||
| --apiserver-advertise-address=${NODE_IP} \ | ||
| --pod-network-cidr=${POD_NETWORK_CIDR} \ | ||
| --apiserver-cert-extra-sans=${CERT_SANS} \ |
There was a problem hiding this comment.
The kubeadm init invocation interpolates NODE_IP, POD_NETWORK_CIDR, and CERT_SANS directly into the command without quoting or validation, allowing shell metacharacters in these CLI-controlled values to terminate the kubeadm command and execute arbitrary commands (e.g., by passing --ip '1.2.3.4; evil_cmd #'). Since this script is intended for unattended remote execution, a user who can influence these parameters can gain code execution on the control-plane node with root privileges. To prevent this, strictly validate these options as IP/CIDR/SAN strings and pass them to kubeadm as separately quoted arguments or via a configuration file instead of embedding them unquoted in the shell command.
| # Initialize Kubernetes control plane | |
| echo "Initializing Kubernetes control plane..." | |
| sudo kubeadm init \ | |
| --apiserver-advertise-address=${NODE_IP} \ | |
| --pod-network-cidr=${POD_NETWORK_CIDR} \ | |
| --apiserver-cert-extra-sans=${CERT_SANS} \ | |
| # Validate kubeadm parameters to prevent shell injection | |
| validate_kubeadm_param() { | |
| local name="$1" | |
| local value="$2" | |
| local pattern="$3" | |
| if [ -z "$value" ]; then | |
| echo "ERROR: $name is empty. Please provide a valid value." | |
| exit 1 | |
| fi | |
| if ! [[ "$value" =~ $pattern ]]; then | |
| echo "ERROR: $name contains invalid characters: '$value'" | |
| echo " Allowed pattern: $pattern" | |
| exit 1 | |
| fi | |
| } | |
| # NODE_IP: IPv4/IPv6 address characters only | |
| validate_kubeadm_param "NODE_IP" "${NODE_IP}" '^[0-9a-fA-F:\.]+$' | |
| # POD_NETWORK_CIDR: IPv4 CIDR (digits, dots, slash) | |
| validate_kubeadm_param "POD_NETWORK_CIDR" "${POD_NETWORK_CIDR}" '^[0-9\.\/]+$' | |
| # CERT_SANS: comma-separated list of hostnames/IPs (no spaces or shell metacharacters) | |
| validate_kubeadm_param "CERT_SANS" "${CERT_SANS}" '^[A-Za-z0-9\.\-:,]+$' | |
| # Initialize Kubernetes control plane | |
| echo "Initializing Kubernetes control plane..." | |
| sudo kubeadm init \ | |
| --apiserver-advertise-address="${NODE_IP}" \ | |
| --pod-network-cidr="${POD_NETWORK_CIDR}" \ | |
| --apiserver-cert-extra-sans="${CERT_SANS}" \ |
FYI: additional testing is required