[Hotfix][CI] Unblock CI: pandas auto-heal + CUDA 12 build toolchain#3055
Conversation
Recent vLLM nightlies (>= 0.19.1rc1.dev325) eagerly `import pandas` from vllm/_aiter_ops.py without declaring pandas in their install deps. That causes `vllm serve` to fail at import time in our CI venv, so the k3 integration-test jobs time out waiting on port 8000. Install pandas explicitly in setup-env.sh as a workaround until vLLM either makes the import lazy or adds pandas to their declared deps. Signed-off-by: Samuel Shen <slshen@uchciago.edu>
There was a problem hiding this comment.
Code Review
This pull request adds an explicit installation of pandas using uv pip in the .buildkite/k3_harness/setup-env.sh script. This serves as a workaround for recent vLLM nightly builds that require pandas but do not list it as a dependency. I have no feedback to provide.
…meout Replace the explicit `pip install pandas` workaround with a general probe-and-retry loop: after installing vLLM, import its CLI entry point and auto-install any ModuleNotFoundError modules (capped at 5 retries). Self-heals the next time vLLM nightly adds another undeclared runtime dep, and every auto-install is logged so the drift is visible in the build output. Also extend `wait_for_server` with an optional log_file argument: when the port-wait times out, tail the last 200 lines to stderr so the real failure (e.g. an ImportError at vllm serve startup) shows up inline in the Buildkite job instead of requiring a trip through build artifacts. Wire the integration-test harness to pass its per-test log path. Signed-off-by: Samuel Shen <slshen@uchciago.edu>
| echo "Hit $MAX_AUTO_INSTALL auto-install retries; last missing module: $mod" >&2 | ||
| echo "$err" >&2 | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
Off-by-one: loop installs 4 modules, not 5
Low Severity
The guard if [[ "$i" == "$MAX_AUTO_INSTALL" ]] fires before uv pip install, so on the 5th iteration a missing module is detected but never installed — the script exits instead. Despite MAX_AUTO_INSTALL=5, only 4 modules can actually be auto-installed. The error message "Hit 5 auto-install retries" is also misleading since only 4 installs were attempted. Moving the guard after the install (or bumping the range to MAX_AUTO_INSTALL + 1) would fix both the count and the message.
Reviewed by Cursor Bugbot for commit 123a21b. Configure here.
The base image is `nvidia/cuda:13.0.2-devel-ubuntu24.04` (nvcc 13), but
vLLM nightly's bundled torch wheel is compiled against CUDA 12. During
the editable install, torch.utils.cpp_extension._check_cuda_version
aborts with:
RuntimeError: The detected CUDA version (13.0) mismatches the
version that was used to compile PyTorch (12.8).
Install nvidia-cuda-nvcc-cu12 (+ cccl/runtime headers) from PyPI and
point CUDA_HOME at that wheel for the build step only. Runtime still
uses the libcudart12 already present in the base image, and the rest of
the job keeps the system CUDA 13 environment untouched.
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
The previous commit tried `os.path.dirname(m.__file__)` to locate the
pip-installed CUDA 12 toolchain, but `nvidia.cuda_nvcc` ships as a PEP
420 namespace package (no __init__.py), so `__file__` is None and the
script crashed with:
TypeError: expected str, bytes or os.PathLike object, not NoneType
Use `__path__[0]` instead, and assert `$CUDA_HOME_CU12/bin/nvcc` is
executable before the LMCache build so any future layout change fails
loudly with a directory listing instead of a cryptic torch error.
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
…rdcoded path Run LMCache#2301 showed nvidia-cuda-nvcc-cu12==12.9.86 installs with an empty bin/ directory — the sanity check fired correctly, but the wheel is unusable. Pin to 12.8.* to match torch's reported CUDA version (the original mismatch error said torch was built against CUDA 12.8). Also stop hardcoding `$CUDA_HOME/bin/nvcc`: use `find` to locate the nvcc binary anywhere under the package root, then derive CUDA_HOME from its parent directory. This keeps us resilient if the wheel layout shifts again (e.g. moves to nvvm/bin/ or a versioned subfolder). Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Run LMCache#2302 proved the pip wheel `nvidia-cuda-nvcc-cu12` does NOT ship the nvcc compiler driver — its bin/ directory contains only ptxas: /opt/venv/.../nvidia/cuda_nvcc/bin/ptxas (31MB, the only binary) NVIDIA publishes nvcc only via apt / the runfile installer, not as a standalone pip wheel. The base image is `nvidia/cuda:13.0.2-devel` and already has NVIDIA's apt repo configured, so install just `cuda-compiler-12-8` alongside the existing CUDA 13 toolchain. That gives us nvcc 12.8 at /usr/local/cuda-12.8/bin/nvcc without a 5GB full toolkit pull. Point CUDA_HOME at it only for the LMCache editable install; system nvcc 13 stays on PATH for everything else. Guarded by an `-x` check so subsequent jobs on the same pod skip the apt-install when the compiler is already present (no-op on a fresh pod, fast on a reused one). Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Verified locally with the current vLLM nightly wheel
(2cdf86044d7e3bdcfbbb39a21a37b57ee9d4e65b): vLLM now ships
torch 2.11.0+cu130 (torch.version.cuda='13.0'), not cu128 as it did a
few hours ago. The hardcoded CUDA_HOME=/usr/local/cuda-12.8 override
from the previous commit turned the original mismatch into the inverse
error ('detected 12.8 mismatches PyTorch 13.0').
Replace the static override with a detect-and-heal block: read
torch.version.cuda, compare its major to the system nvcc major, and
only apt-install `cuda-compiler-<major>-<minor>` + set CUDA_HOME when
they actually disagree. When they agree (the common case on the CUDA
13 base image with a cu13 torch), the system nvcc is used directly.
Detection logic dry-run-verified against the real vLLM nightly wheel
in a clean venv before pushing.
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
There are 3 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 65c0d1f. Configure here.
| v = torch.version.cuda or '' | ||
| parts = v.split('.') + ['0'] | ||
| print(v, parts[0], parts[1]) | ||
| ") |
There was a problem hiding this comment.
CPU-only torch guard broken by read whitespace collapsing
Medium Severity
When torch.version.cuda is None (CPU-only), v becomes '', making parts = ['', '0']. Then print('', '', '0') outputs " 0" — three space-separated values where the first two are empty strings. Bash read strips leading whitespace and collapses this into a single token, assigning TORCH_CUDA="0" instead of "". The [[ -z "$TORCH_CUDA" ]] guard on line 83 then fails to detect the CPU-only case, falling through to the mismatch branch with empty TORCH_CUDA_MAJOR/TORCH_CUDA_MINOR, producing a bogus apt-get install cuda-compiler-- that fails the build.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 65c0d1f. Configure here.
| parts = v.split('.') + ['0'] | ||
| print(v, parts[0], parts[1]) | ||
| ") | ||
| SYS_NVCC_MAJOR=$(nvcc --version 2>/dev/null | sed -n 's/.*release \([0-9]\+\).*/\1/p' | head -1) |
There was a problem hiding this comment.
pipefail crashes script when nvcc is absent
Low Severity
The script runs under set -euo pipefail (line 4). When nvcc is not on PATH, nvcc --version exits 127. With pipefail, the pipeline's exit status is 127. Since set -e does trigger on failed command substitutions in variable assignments, the script exits immediately at this line — before reaching the graceful handling on lines 83–103 that explicitly checks for an empty SYS_NVCC_MAJOR. Appending || true to the assignment would let the downstream logic handle the missing-nvcc case as intended.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 65c0d1f. Configure here.
The base image was bumped to nvidia/cuda:13.0.2-devel-ubuntu24.04 in LMCache#2981, but setup-env.sh installs vLLM from the generic nightly index (wheels.vllm.ai/nightly/vllm/), which resolves non-deterministically to either a cu128 or a cu130 torch wheel. When the resolver picks cu128, torch.utils.cpp_extension._check_cuda_version aborts the LMCache editable install with: RuntimeError: The detected CUDA version (13.0) mismatches the version that was used to compile PyTorch (12.8). LMCache#3055 tried to paper over this at runtime by apt-installing cuda-compiler-12-8 on mismatch and pointing CUDA_HOME at /usr/local/cuda-12.8, but cuda-compiler-*-* only ships nvcc -- not the CUDA math-library dev headers (cusparse.h, cublas.h, etc.). The build then fails deeper with a cryptic: fatal error: cusparse.h: No such file or directory Every k3 build on dev since 10fd636 (2026-04-16 09:55 UTC) has failed this way because vLLM nightly happens to be publishing cu128 torch today. Extending the apt install list is a band-aid; the real fix is to stop resolving torch non-deterministically. Pin the install to vLLM's per-CUDA-major cu130 sub-index per the official vllm.ai install instructions. torch.version.cuda is now deterministically "13.0" and matches system nvcc 13, so _check_cuda_version passes with the system toolchain -- no CUDA_HOME override, no apt install, no HTML scraping. Also drop the HTML index scraper (no longer needed with the proper --extra-index-url flags) and replace the runtime alignment block with a small Python sanity check that fails with a clear message if the pin ever drifts again, so future breakage surfaces here instead of inside ninja. - Unblocks every k3 pipeline on dev. - Removes ~40 lines of CUDA version-alignment logic. - Keeps the pandas auto-heal loop from LMCache#3055 (that problem is independent of this one). Signed-off-by: Yihua Cheng <yihua98@uchicago.edu>
…age (#3061) The base image was bumped to nvidia/cuda:13.0.2-devel-ubuntu24.04 in #2981, but setup-env.sh installs vLLM from the generic nightly index (wheels.vllm.ai/nightly/vllm/), which resolves non-deterministically to either a cu128 or a cu130 torch wheel. When the resolver picks cu128, torch.utils.cpp_extension._check_cuda_version aborts the LMCache editable install with: RuntimeError: The detected CUDA version (13.0) mismatches the version that was used to compile PyTorch (12.8). #3055 tried to paper over this at runtime by apt-installing cuda-compiler-12-8 on mismatch and pointing CUDA_HOME at /usr/local/cuda-12.8, but cuda-compiler-*-* only ships nvcc -- not the CUDA math-library dev headers (cusparse.h, cublas.h, etc.). The build then fails deeper with a cryptic: fatal error: cusparse.h: No such file or directory Every k3 build on dev since 10fd636 (2026-04-16 09:55 UTC) has failed this way because vLLM nightly happens to be publishing cu128 torch today. Extending the apt install list is a band-aid; the real fix is to stop resolving torch non-deterministically. Pin the install to vLLM's per-CUDA-major cu130 sub-index per the official vllm.ai install instructions. torch.version.cuda is now deterministically "13.0" and matches system nvcc 13, so _check_cuda_version passes with the system toolchain -- no CUDA_HOME override, no apt install, no HTML scraping. Also drop the HTML index scraper (no longer needed with the proper --extra-index-url flags) and replace the runtime alignment block with a small Python sanity check that fails with a clear message if the pin ever drifts again, so future breakage surfaces here instead of inside ninja. - Unblocks every k3 pipeline on dev. - Removes ~40 lines of CUDA version-alignment logic. - Keeps the pandas auto-heal loop from #3055 (that problem is independent of this one). Signed-off-by: Yihua Cheng <yihua98@uchicago.edu>


Recent vLLM nightlies (>= 0.19.1rc1.dev325) eagerly
import pandasfrom vllm/_aiter_ops.py without declaring pandas in their install deps. That causesvllm serveto fail at import time in our CI venv, so the k3 integration-test jobs time out waiting on port 8000.Install pandas explicitly in setup-env.sh as a workaround until vLLM either makes the import lazy or adds pandas to their declared deps.
What this PR does / why we need it:
Special notes for your reviewers:
If applicable:
Note
Medium Risk
Touches CI bootstrap scripts and introduces apt-based CUDA compiler installation and dynamic pip installs, which can affect build determinism and fail in unexpected environments, but does not change runtime product code.
Overview
CI setup now auto-recovers from vLLM nightly undeclared runtime deps by probe-importing vLLM’s CLI and
uv pip installing missing modules onModuleNotFoundError(bounded retries, with logging).The environment bootstrap also aligns the CUDA compiler toolchain by comparing
torch.version.cudawith the systemnvccmajor version and, on mismatch, installing the matchingcuda-compiler-<major>-<minor>via apt and using thatCUDA_HOMEwhen building LMCache.Integration test harnesses were updated so
wait_for_servercan accept a log file and dump its tail on timeout;run-integration.shnow passes the vLLM log to surface startup errors directly in job output.Reviewed by Cursor Bugbot for commit 65c0d1f. Bugbot is set up for automated code reviews on this repo. Configure here.