Support llm-d deployment by seokho-son · Pull Request #2310 · cloud-barista/cb-tumblebug

seokho-son · 2026-02-03T02:24:16Z

Support llm-d deployment
Enhance gpu driver installation (for both VM / K8s Node)

FYI: additional testing is required

Signed-off-by: Seokho Son <shsongist@gmail.com>

seokho-son · 2026-02-03T02:27:03Z

/approve

Copilot

Pull request overview

Adds automation to support llm-d deployments on Kubernetes and improves GPU driver/container runtime setup for unattended VM/K8s node provisioning.

Changes:

Expanded NVIDIA driver installer to support Ubuntu 22.04/24.04, container toolkit setup, optional CUDA toolkit, and NVSwitch checks.
Added a new deploy-llm-d.sh script to validate prerequisites and install llm-d via Helm.
Enhanced K8s control-plane/worker setup scripts for non-interactive runs and llm-d/GPU-related outputs.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 20 comments.

File	Description
scripts/usecases/llm/installCudaDriver.sh	Reworked driver install to include repo/keyring flow, optional toolkit, container toolkit config, and reboot handling.
scripts/usecases/llm/deploy-llm-d.sh	New script to check cluster prerequisites and deploy llm-d via Helm, then print endpoints/status info.
scripts/usecases/k8s/k8s-worker-setup.sh	Adds non-interactive/apt cleanup behavior and GPU detection output for worker nodes.
scripts/usecases/k8s/k8s-control-plane-setup.sh	Adds llm-d mode to install extra CRDs/GPU Operator and improves idempotency when cluster already exists.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-03T02:36:13Z

scripts/usecases/llm/installCudaDriver.sh

+sudo DEBIAN_FRONTEND=noninteractive apt-get install -y \
+    -o Dpkg::Options::="--force-confdef" \
+    -o Dpkg::Options::="--force-confold" \
+    cuda-drivers 2>&1 | \
+    grep -E "^(Get:|Unpacking|Setting up|cuda-drivers|nvidia)" || true


The apt-get install cuda-drivers exit status is masked by piping to grep and then || true (and there’s no set -o pipefail). If apt-get fails, the script will still continue as if the driver installed. Don’t swallow the exit code; keep filtered output but check PIPESTATUS[0] / enable pipefail.

Copilot · 2026-02-03T02:36:13Z

scripts/usecases/llm/installCudaDriver.sh

+# Detect Ubuntu version
+UBUNTU_VERSION=$(lsb_release -rs | tr -d '.')
+if [[ "$UBUNTU_VERSION" == "2404" ]]; then
+    REPO_URL="https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64"
+    KEYRING_URL="${REPO_URL}/cuda-keyring_1.1-1_all.deb"


This section relies on lsb_release to detect Ubuntu version; on minimal images lsb-release may be missing, causing repo selection to fail. Consider installing/checking required tooling (lsb-release, wget, curl, gnupg, ca-certificates) before repo setup so the script fails fast with a clear message.

Copilot · 2026-02-03T02:36:13Z

scripts/usecases/llm/deploy-llm-d.sh

+        --version|-v)
+            LLM_D_VERSION="$2"
+            shift 2


--version / LLM_D_VERSION is accepted and stored but never used to select a chart/app version during installation. Either wire this into the helm upgrade --install call (e.g., chart --version and/or image tag values) or remove the flag to avoid implying version pinning works when it doesn’t.

Copilot · 2026-02-03T02:36:13Z

scripts/usecases/llm/deploy-llm-d.sh

+        --with-gateway)
+            INSTALL_GATEWAY=true
+            shift


--with-gateway sets INSTALL_GATEWAY=true, but the flag is never used later (no Gateway/Istio installation or configuration occurs). Either implement the gateway install path or remove the option/help text to avoid implying external access is configured.

Copilot · 2026-02-03T02:36:14Z

scripts/usecases/llm/installCudaDriver.sh

 # Check for NVIDIA GPU
 echo "Checking for NVIDIA GPU..."
-GPU_INFO=$(sudo lspci | grep -i nvidia)
+GPU_INFO=$(sudo lspci | grep -i nvidia || true)
 if [ -z "$GPU_INFO" ]; then
-  echo "No NVIDIA GPU detected or an error occurred. Exiting..."
-  sudo lspci
-  exit 1
-else
-  echo "NVIDIA GPU detected:"
-  echo "$GPU_INFO"
-  echo "Check root disk size"
-  df -h / | awk '$NF=="/" {print "Total: "$2, "Available: "$4}'
+    echo "ERROR: No NVIDIA GPU detected!"


lspci is required for GPU detection, but it may not be present on minimal Ubuntu images (pciutils). If it’s missing, this check will fail (or false-negative) and the script exits early. Install/check pciutils (and any other required tools) before running the pre-flight GPU check, and surface a clear error if installation fails.

Copilot · 2026-02-03T02:36:17Z

scripts/usecases/k8s/k8s-control-plane-setup.sh

+# Kill any stuck apt/dpkg processes (only if they are hanging)
+sudo pkill -9 -f "apt-get|dpkg" 2>/dev/null || true
+sleep 2


Force-killing apt-get/dpkg with pkill -9 can interrupt legitimate package operations (e.g., unattended-upgrades) and leave the node in a broken dpkg state. Prefer waiting for locks and failing clearly, or only terminating the specific PID(s) holding the locks after a timeout.

Copilot · 2026-02-03T02:36:17Z

scripts/usecases/llm/installCudaDriver.sh

+# Remove stale lock files
+sudo rm -f /var/lib/dpkg/lock-frontend 2>/dev/null || true
+sudo rm -f /var/lib/dpkg/lock 2>/dev/null || true
+sudo rm -f /var/cache/apt/archives/lock 2>/dev/null || true
+sudo rm -f /var/lib/apt/lists/lock 2>/dev/null || true


Manually deleting dpkg/apt lock files is risky if a package operation is still running or if lock semantics differ. Consider avoiding lock-file deletion, or only removing after verifying no process holds the locks, and exit with a clear error when locks persist.

Copilot · 2026-02-03T02:36:17Z

scripts/usecases/llm/deploy-llm-d.sh

+HELM_VALUES=""
+if [ -n "$MODEL_NAME" ]; then
+    HELM_VALUES="$HELM_VALUES --set model.name=$MODEL_NAME"
+fi
+HELM_VALUES="$HELM_VALUES --set replicaCount=$REPLICAS"
+HELM_VALUES="$HELM_VALUES --set resources.limits.nvidia\\.com/gpu=1"
+
+# Install llm-d
+echo "Installing llm-d Helm chart..."
+helm upgrade --install llm-d llm-d/llm-d \
+    --namespace "$LLM_D_NAMESPACE" \
+    $HELM_VALUES \


The Helm values aggregation logic builds HELM_VALUES by concatenating untrusted CLI parameters like MODEL_NAME and REPLICAS and later injects this string unquoted into the helm upgrade command, which allows shell metacharacters (e.g., ;, &&) in those parameters to break out of the intended argument and execute arbitrary commands. Because this script is designed for unattended remote execution, an attacker who can influence --model or --replicas can achieve command execution on the host under the script’s privileges. To fix this, avoid concatenating untrusted data into a single shell string and instead pass --set values as separately quoted arguments or via a Helm values file.

Suggested change

HELM_VALUES=""

if [ -n "$MODEL_NAME" ]; then

HELM_VALUES="$HELM_VALUES --set model.name=$MODEL_NAME"

fi

HELM_VALUES="$HELM_VALUES --set replicaCount=$REPLICAS"

HELM_VALUES="$HELM_VALUES --set resources.limits.nvidia\\.com/gpu=1"

# Install llm-d

echo "Installing llm-d Helm chart..."

helm upgrade --install llm-d llm-d/llm-d \

--namespace "$LLM_D_NAMESPACE" \

$HELM_VALUES \

HELM_ARGS=()

if [ -n "$MODEL_NAME" ]; then

HELM_ARGS+=(--set "model.name=$MODEL_NAME")

fi

HELM_ARGS+=(--set "replicaCount=$REPLICAS")

HELM_ARGS+=(--set "resources.limits.nvidia\\.com/gpu=1")

# Install llm-d

echo "Installing llm-d Helm chart..."

helm upgrade --install llm-d llm-d/llm-d \

--namespace "$LLM_D_NAMESPACE" \

"${HELM_ARGS[@]}" \

Copilot · 2026-02-03T02:36:17Z

scripts/usecases/llm/deploy-llm-d.sh

+helm repo add llm-d https://llm-d.github.io/llm-d/ 2>/dev/null || true
+helm repo update > /dev/null 2>&1
+
+# Create namespace
+kubectl create namespace "$LLM_D_NAMESPACE" --dry-run=client -o yaml | kubectl apply -f - > /dev/null
+
+# Prepare Helm values
+HELM_VALUES=""
+if [ -n "$MODEL_NAME" ]; then
+    HELM_VALUES="$HELM_VALUES --set model.name=$MODEL_NAME"
+fi
+HELM_VALUES="$HELM_VALUES --set replicaCount=$REPLICAS"
+HELM_VALUES="$HELM_VALUES --set resources.limits.nvidia\\.com/gpu=1"
+
+# Install llm-d
+echo "Installing llm-d Helm chart..."
+helm upgrade --install llm-d llm-d/llm-d \
+    --namespace "$LLM_D_NAMESPACE" \
+    $HELM_VALUES \
+    --wait --timeout 10m 2>&1 | grep -E "^(NAME:|STATUS:|REVISION:)" || true


This script automatically adds the external Helm repo https://llm-d.github.io/llm-d/ and installs/upgrades the llm-d/llm-d chart without pinning to a specific immutable version or verifying chart signatures, effectively granting third-party code cluster-admin level execution each time it runs. If the llm-d Helm repository or its distribution channel is compromised, an attacker could ship a malicious chart that gains access to cluster secrets or escalates privileges. To mitigate this supply chain risk, pin the chart to a specific immutable version/digest (or vendor it internally) and enable integrity verification where supported.

Copilot · 2026-02-03T02:36:18Z

scripts/usecases/k8s/k8s-control-plane-setup.sh

+    # Initialize Kubernetes control plane
+    echo "Initializing Kubernetes control plane..."
+    sudo kubeadm init \
+        --apiserver-advertise-address=${NODE_IP} \
+        --pod-network-cidr=${POD_NETWORK_CIDR} \
+        --apiserver-cert-extra-sans=${CERT_SANS} \


The kubeadm init invocation interpolates NODE_IP, POD_NETWORK_CIDR, and CERT_SANS directly into the command without quoting or validation, allowing shell metacharacters in these CLI-controlled values to terminate the kubeadm command and execute arbitrary commands (e.g., by passing --ip '1.2.3.4; evil_cmd #'). Since this script is intended for unattended remote execution, a user who can influence these parameters can gain code execution on the control-plane node with root privileges. To prevent this, strictly validate these options as IP/CIDR/SAN strings and pass them to kubeadm as separately quoted arguments or via a configuration file instead of embedding them unquoted in the shell command.

Suggested change

# Initialize Kubernetes control plane

echo "Initializing Kubernetes control plane..."

sudo kubeadm init \

--apiserver-advertise-address=${NODE_IP} \

--pod-network-cidr=${POD_NETWORK_CIDR} \

--apiserver-cert-extra-sans=${CERT_SANS} \

# Validate kubeadm parameters to prevent shell injection

validate_kubeadm_param() {

local name="$1"

local value="$2"

local pattern="$3"

if [ -z "$value" ]; then

echo "ERROR: $name is empty. Please provide a valid value."

exit 1

fi

if ! [[ "$value" =~ $pattern ]]; then

echo "ERROR: $name contains invalid characters: '$value'"

echo " Allowed pattern: $pattern"

exit 1

fi

}

# NODE_IP: IPv4/IPv6 address characters only

validate_kubeadm_param "NODE_IP" "${NODE_IP}" '^[0-9a-fA-F:\.]+$'

# POD_NETWORK_CIDR: IPv4 CIDR (digits, dots, slash)

validate_kubeadm_param "POD_NETWORK_CIDR" "${POD_NETWORK_CIDR}" '^[0-9\.\/]+$'

# CERT_SANS: comma-separated list of hostnames/IPs (no spaces or shell metacharacters)

validate_kubeadm_param "CERT_SANS" "${CERT_SANS}" '^[A-Za-z0-9\.\-:,]+$'

# Initialize Kubernetes control plane

echo "Initializing Kubernetes control plane..."

sudo kubeadm init \

--apiserver-advertise-address="${NODE_IP}" \

--pod-network-cidr="${POD_NETWORK_CIDR}" \

--apiserver-cert-extra-sans="${CERT_SANS}" \

seokho-son added 3 commits February 2, 2026 18:03

Support llm-d deployment

75c29ba

Signed-off-by: Seokho Son <shsongist@gmail.com>

Fix dpkg issue

b1385e0

Signed-off-by: Seokho Son <shsongist@gmail.com>

Support k8s-control-plane-setup --llm-d

699ed3e

Signed-off-by: Seokho Son <shsongist@gmail.com>

seokho-son requested a review from yunkon-kim as a code owner February 3, 2026 02:24

Copilot AI review requested due to automatic review settings February 3, 2026 02:24

Copilot started reviewing on behalf of seokho-son February 3, 2026 02:24 View session

github-actions bot approved these changes Feb 3, 2026

View reviewed changes

github-actions bot added the approved This PR is approved and will be merged soon. label Feb 3, 2026

cb-github-robot merged commit 6e20ec4 into cloud-barista:main Feb 3, 2026
6 checks passed

Copilot AI reviewed Feb 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support llm-d deployment#2310

Support llm-d deployment#2310
cb-github-robot merged 3 commits intocloud-barista:mainfrom
seokho-son:main

seokho-son commented Feb 3, 2026

Uh oh!

seokho-son commented Feb 3, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Copilot AI Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-    # Initialize Kubernetes control plane
-    echo "Initializing Kubernetes control plane..."
-    sudo kubeadm init \
-        --apiserver-advertise-address=${NODE_IP} \
-        --pod-network-cidr=${POD_NETWORK_CIDR} \
-        --apiserver-cert-extra-sans=${CERT_SANS} \
+    # Validate kubeadm parameters to prevent shell injection
+    validate_kubeadm_param() {
+        local name="$1"
+        local value="$2"
+        local pattern="$3"
+        if [ -z "$value" ]; then
+            echo "ERROR: $name is empty. Please provide a valid value."
+            exit 1
+        fi
+        if ! [[ "$value" =~ $pattern ]]; then
+            echo "ERROR: $name contains invalid characters: '$value'"
+            echo "       Allowed pattern: $pattern"
+            exit 1
+        fi
+    }
+    # NODE_IP: IPv4/IPv6 address characters only
+    validate_kubeadm_param "NODE_IP" "${NODE_IP}" '^[0-9a-fA-F:\.]+$'
+    # POD_NETWORK_CIDR: IPv4 CIDR (digits, dots, slash)
+    validate_kubeadm_param "POD_NETWORK_CIDR" "${POD_NETWORK_CIDR}" '^[0-9\.\/]+$'
+    # CERT_SANS: comma-separated list of hostnames/IPs (no spaces or shell metacharacters)
+    validate_kubeadm_param "CERT_SANS" "${CERT_SANS}" '^[A-Za-z0-9\.\-:,]+$'
+    # Initialize Kubernetes control plane
+    echo "Initializing Kubernetes control plane..."
+    sudo kubeadm init \
+        --apiserver-advertise-address="${NODE_IP}" \
+        --pod-network-cidr="${POD_NETWORK_CIDR}" \
+        --apiserver-cert-extra-sans="${CERT_SANS}" \

Conversation

seokho-son commented Feb 3, 2026

Uh oh!

seokho-son commented Feb 3, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants