Skip to content

Support llm-d deployment#2310

Merged
cb-github-robot merged 3 commits intocloud-barista:mainfrom
seokho-son:main
Feb 3, 2026
Merged

Support llm-d deployment#2310
cb-github-robot merged 3 commits intocloud-barista:mainfrom
seokho-son:main

Conversation

@seokho-son
Copy link
Copy Markdown
Member

  • Support llm-d deployment
  • Enhance gpu driver installation (for both VM / K8s Node)

FYI: additional testing is required

Signed-off-by: Seokho Son <shsongist@gmail.com>
Signed-off-by: Seokho Son <shsongist@gmail.com>
Signed-off-by: Seokho Son <shsongist@gmail.com>
Copilot AI review requested due to automatic review settings February 3, 2026 02:24
@seokho-son
Copy link
Copy Markdown
Member Author

/approve

@github-actions github-actions bot added the approved This PR is approved and will be merged soon. label Feb 3, 2026
@cb-github-robot cb-github-robot merged commit 6e20ec4 into cloud-barista:main Feb 3, 2026
6 checks passed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds automation to support llm-d deployments on Kubernetes and improves GPU driver/container runtime setup for unattended VM/K8s node provisioning.

Changes:

  • Expanded NVIDIA driver installer to support Ubuntu 22.04/24.04, container toolkit setup, optional CUDA toolkit, and NVSwitch checks.
  • Added a new deploy-llm-d.sh script to validate prerequisites and install llm-d via Helm.
  • Enhanced K8s control-plane/worker setup scripts for non-interactive runs and llm-d/GPU-related outputs.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 20 comments.

File Description
scripts/usecases/llm/installCudaDriver.sh Reworked driver install to include repo/keyring flow, optional toolkit, container toolkit config, and reboot handling.
scripts/usecases/llm/deploy-llm-d.sh New script to check cluster prerequisites and deploy llm-d via Helm, then print endpoints/status info.
scripts/usecases/k8s/k8s-worker-setup.sh Adds non-interactive/apt cleanup behavior and GPU detection output for worker nodes.
scripts/usecases/k8s/k8s-control-plane-setup.sh Adds llm-d mode to install extra CRDs/GPU Operator and improves idempotency when cluster already exists.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +245 to +249
sudo DEBIAN_FRONTEND=noninteractive apt-get install -y \
-o Dpkg::Options::="--force-confdef" \
-o Dpkg::Options::="--force-confold" \
cuda-drivers 2>&1 | \
grep -E "^(Get:|Unpacking|Setting up|cuda-drivers|nvidia)" || true
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The apt-get install cuda-drivers exit status is masked by piping to grep and then || true (and there’s no set -o pipefail). If apt-get fails, the script will still continue as if the driver installed. Don’t swallow the exit code; keep filtered output but check PIPESTATUS[0] / enable pipefail.

Copilot uses AI. Check for mistakes.
Comment on lines +203 to +207
# Detect Ubuntu version
UBUNTU_VERSION=$(lsb_release -rs | tr -d '.')
if [[ "$UBUNTU_VERSION" == "2404" ]]; then
REPO_URL="https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64"
KEYRING_URL="${REPO_URL}/cuda-keyring_1.1-1_all.deb"
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section relies on lsb_release to detect Ubuntu version; on minimal images lsb-release may be missing, causing repo selection to fail. Consider installing/checking required tooling (lsb-release, wget, curl, gnupg, ca-certificates) before repo setup so the script fails fast with a clear message.

Copilot uses AI. Check for mistakes.
Comment on lines +53 to +55
--version|-v)
LLM_D_VERSION="$2"
shift 2
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--version / LLM_D_VERSION is accepted and stored but never used to select a chart/app version during installation. Either wire this into the helm upgrade --install call (e.g., chart --version and/or image tag values) or remove the flag to avoid implying version pinning works when it doesn’t.

Copilot uses AI. Check for mistakes.
Comment on lines +65 to +67
--with-gateway)
INSTALL_GATEWAY=true
shift
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--with-gateway sets INSTALL_GATEWAY=true, but the flag is never used later (no Gateway/Istio installation or configuration occurs). Either implement the gateway install path or remove the option/help text to avoid implying external access is configured.

Copilot uses AI. Check for mistakes.
Comment on lines 156 to +160
# Check for NVIDIA GPU
echo "Checking for NVIDIA GPU..."
GPU_INFO=$(sudo lspci | grep -i nvidia)
GPU_INFO=$(sudo lspci | grep -i nvidia || true)
if [ -z "$GPU_INFO" ]; then
echo "No NVIDIA GPU detected or an error occurred. Exiting..."
sudo lspci
exit 1
else
echo "NVIDIA GPU detected:"
echo "$GPU_INFO"
echo "Check root disk size"
df -h / | awk '$NF=="/" {print "Total: "$2, "Available: "$4}'
echo "ERROR: No NVIDIA GPU detected!"
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lspci is required for GPU detection, but it may not be present on minimal Ubuntu images (pciutils). If it’s missing, this check will fail (or false-negative) and the script exits early. Install/check pciutils (and any other required tools) before running the pre-flight GPU check, and surface a clear error if installation fails.

Copilot uses AI. Check for mistakes.
Comment on lines +55 to +57
# Kill any stuck apt/dpkg processes (only if they are hanging)
sudo pkill -9 -f "apt-get|dpkg" 2>/dev/null || true
sleep 2
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Force-killing apt-get/dpkg with pkill -9 can interrupt legitimate package operations (e.g., unattended-upgrades) and leave the node in a broken dpkg state. Prefer waiting for locks and failing clearly, or only terminating the specific PID(s) holding the locks after a timeout.

Copilot uses AI. Check for mistakes.
Comment on lines +75 to +79
# Remove stale lock files
sudo rm -f /var/lib/dpkg/lock-frontend 2>/dev/null || true
sudo rm -f /var/lib/dpkg/lock 2>/dev/null || true
sudo rm -f /var/cache/apt/archives/lock 2>/dev/null || true
sudo rm -f /var/lib/apt/lists/lock 2>/dev/null || true
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manually deleting dpkg/apt lock files is risky if a package operation is still running or if lock semantics differ. Consider avoiding lock-file deletion, or only removing after verifying no process holds the locks, and exit with a clear error when locks persist.

Copilot uses AI. Check for mistakes.
Comment on lines +225 to +236
HELM_VALUES=""
if [ -n "$MODEL_NAME" ]; then
HELM_VALUES="$HELM_VALUES --set model.name=$MODEL_NAME"
fi
HELM_VALUES="$HELM_VALUES --set replicaCount=$REPLICAS"
HELM_VALUES="$HELM_VALUES --set resources.limits.nvidia\\.com/gpu=1"

# Install llm-d
echo "Installing llm-d Helm chart..."
helm upgrade --install llm-d llm-d/llm-d \
--namespace "$LLM_D_NAMESPACE" \
$HELM_VALUES \
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Helm values aggregation logic builds HELM_VALUES by concatenating untrusted CLI parameters like MODEL_NAME and REPLICAS and later injects this string unquoted into the helm upgrade command, which allows shell metacharacters (e.g., ;, &&) in those parameters to break out of the intended argument and execute arbitrary commands. Because this script is designed for unattended remote execution, an attacker who can influence --model or --replicas can achieve command execution on the host under the script’s privileges. To fix this, avoid concatenating untrusted data into a single shell string and instead pass --set values as separately quoted arguments or via a Helm values file.

Suggested change
HELM_VALUES=""
if [ -n "$MODEL_NAME" ]; then
HELM_VALUES="$HELM_VALUES --set model.name=$MODEL_NAME"
fi
HELM_VALUES="$HELM_VALUES --set replicaCount=$REPLICAS"
HELM_VALUES="$HELM_VALUES --set resources.limits.nvidia\\.com/gpu=1"
# Install llm-d
echo "Installing llm-d Helm chart..."
helm upgrade --install llm-d llm-d/llm-d \
--namespace "$LLM_D_NAMESPACE" \
$HELM_VALUES \
HELM_ARGS=()
if [ -n "$MODEL_NAME" ]; then
HELM_ARGS+=(--set "model.name=$MODEL_NAME")
fi
HELM_ARGS+=(--set "replicaCount=$REPLICAS")
HELM_ARGS+=(--set "resources.limits.nvidia\\.com/gpu=1")
# Install llm-d
echo "Installing llm-d Helm chart..."
helm upgrade --install llm-d llm-d/llm-d \
--namespace "$LLM_D_NAMESPACE" \
"${HELM_ARGS[@]}" \

Copilot uses AI. Check for mistakes.
Comment on lines +218 to +237
helm repo add llm-d https://llm-d.github.io/llm-d/ 2>/dev/null || true
helm repo update > /dev/null 2>&1

# Create namespace
kubectl create namespace "$LLM_D_NAMESPACE" --dry-run=client -o yaml | kubectl apply -f - > /dev/null

# Prepare Helm values
HELM_VALUES=""
if [ -n "$MODEL_NAME" ]; then
HELM_VALUES="$HELM_VALUES --set model.name=$MODEL_NAME"
fi
HELM_VALUES="$HELM_VALUES --set replicaCount=$REPLICAS"
HELM_VALUES="$HELM_VALUES --set resources.limits.nvidia\\.com/gpu=1"

# Install llm-d
echo "Installing llm-d Helm chart..."
helm upgrade --install llm-d llm-d/llm-d \
--namespace "$LLM_D_NAMESPACE" \
$HELM_VALUES \
--wait --timeout 10m 2>&1 | grep -E "^(NAME:|STATUS:|REVISION:)" || true
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script automatically adds the external Helm repo https://llm-d.github.io/llm-d/ and installs/upgrades the llm-d/llm-d chart without pinning to a specific immutable version or verifying chart signatures, effectively granting third-party code cluster-admin level execution each time it runs. If the llm-d Helm repository or its distribution channel is compromised, an attacker could ship a malicious chart that gains access to cluster secrets or escalates privileges. To mitigate this supply chain risk, pin the chart to a specific immutable version/digest (or vendor it internally) and enable integrity verification where supported.

Copilot uses AI. Check for mistakes.
Comment on lines +297 to +302
# Initialize Kubernetes control plane
echo "Initializing Kubernetes control plane..."
sudo kubeadm init \
--apiserver-advertise-address=${NODE_IP} \
--pod-network-cidr=${POD_NETWORK_CIDR} \
--apiserver-cert-extra-sans=${CERT_SANS} \
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kubeadm init invocation interpolates NODE_IP, POD_NETWORK_CIDR, and CERT_SANS directly into the command without quoting or validation, allowing shell metacharacters in these CLI-controlled values to terminate the kubeadm command and execute arbitrary commands (e.g., by passing --ip '1.2.3.4; evil_cmd #'). Since this script is intended for unattended remote execution, a user who can influence these parameters can gain code execution on the control-plane node with root privileges. To prevent this, strictly validate these options as IP/CIDR/SAN strings and pass them to kubeadm as separately quoted arguments or via a configuration file instead of embedding them unquoted in the shell command.

Suggested change
# Initialize Kubernetes control plane
echo "Initializing Kubernetes control plane..."
sudo kubeadm init \
--apiserver-advertise-address=${NODE_IP} \
--pod-network-cidr=${POD_NETWORK_CIDR} \
--apiserver-cert-extra-sans=${CERT_SANS} \
# Validate kubeadm parameters to prevent shell injection
validate_kubeadm_param() {
local name="$1"
local value="$2"
local pattern="$3"
if [ -z "$value" ]; then
echo "ERROR: $name is empty. Please provide a valid value."
exit 1
fi
if ! [[ "$value" =~ $pattern ]]; then
echo "ERROR: $name contains invalid characters: '$value'"
echo " Allowed pattern: $pattern"
exit 1
fi
}
# NODE_IP: IPv4/IPv6 address characters only
validate_kubeadm_param "NODE_IP" "${NODE_IP}" '^[0-9a-fA-F:\.]+$'
# POD_NETWORK_CIDR: IPv4 CIDR (digits, dots, slash)
validate_kubeadm_param "POD_NETWORK_CIDR" "${POD_NETWORK_CIDR}" '^[0-9\.\/]+$'
# CERT_SANS: comma-separated list of hostnames/IPs (no spaces or shell metacharacters)
validate_kubeadm_param "CERT_SANS" "${CERT_SANS}" '^[A-Za-z0-9\.\-:,]+$'
# Initialize Kubernetes control plane
echo "Initializing Kubernetes control plane..."
sudo kubeadm init \
--apiserver-advertise-address="${NODE_IP}" \
--pod-network-cidr="${POD_NETWORK_CIDR}" \
--apiserver-cert-extra-sans="${CERT_SANS}" \

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved This PR is approved and will be merged soon.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants