[Hotfix][CI] Unblock CI: pandas auto-heal + CUDA 12 build toolchain by sammshen · Pull Request #3055 · LMCache/LMCache

sammshen · 2026-04-16T06:39:46Z

Recent vLLM nightlies (>= 0.19.1rc1.dev325) eagerly import pandas from vllm/_aiter_ops.py without declaring pandas in their install deps. That causes vllm serve to fail at import time in our CI venv, so the k3 integration-test jobs time out waiting on port 8000.

Install pandas explicitly in setup-env.sh as a workaround until vLLM either makes the import lazy or adds pandas to their declared deps.

What this PR does / why we need it:

Special notes for your reviewers:

If applicable:

this PR contains user facing changes - docs added
this PR contains unit tests

Note

Medium Risk
Touches CI bootstrap scripts and introduces apt-based CUDA compiler installation and dynamic pip installs, which can affect build determinism and fail in unexpected environments, but does not change runtime product code.

Overview
CI setup now auto-recovers from vLLM nightly undeclared runtime deps by probe-importing vLLM’s CLI and uv pip installing missing modules on ModuleNotFoundError (bounded retries, with logging).

The environment bootstrap also aligns the CUDA compiler toolchain by comparing torch.version.cuda with the system nvcc major version and, on mismatch, installing the matching cuda-compiler-<major>-<minor> via apt and using that CUDA_HOME when building LMCache.

Integration test harnesses were updated so wait_for_server can accept a log file and dump its tail on timeout; run-integration.sh now passes the vLLM log to surface startup errors directly in job output.

^{Reviewed by Cursor Bugbot for commit 65c0d1f. Bugbot is set up for automated code reviews on this repo. Configure here.}

Recent vLLM nightlies (>= 0.19.1rc1.dev325) eagerly `import pandas` from vllm/_aiter_ops.py without declaring pandas in their install deps. That causes `vllm serve` to fail at import time in our CI venv, so the k3 integration-test jobs time out waiting on port 8000. Install pandas explicitly in setup-env.sh as a workaround until vLLM either makes the import lazy or adds pandas to their declared deps. Signed-off-by: Samuel Shen <slshen@uchciago.edu>

gemini-code-assist

Code Review

This pull request adds an explicit installation of pandas using uv pip in the .buildkite/k3_harness/setup-env.sh script. This serves as a workaround for recent vLLM nightly builds that require pandas but do not list it as a dependency. I have no feedback to provide.

…meout Replace the explicit `pip install pandas` workaround with a general probe-and-retry loop: after installing vLLM, import its CLI entry point and auto-install any ModuleNotFoundError modules (capped at 5 retries). Self-heals the next time vLLM nightly adds another undeclared runtime dep, and every auto-install is logged so the drift is visible in the build output. Also extend `wait_for_server` with an optional log_file argument: when the port-wait times out, tail the last 200 lines to stderr so the real failure (e.g. an ImportError at vllm serve startup) shows up inline in the Buildkite job instead of requiring a trip through build artifacts. Wire the integration-test harness to pass its per-test log path. Signed-off-by: Samuel Shen <slshen@uchciago.edu>

cursor · 2026-04-16T06:48:26Z

+        echo "Hit $MAX_AUTO_INSTALL auto-install retries; last missing module: $mod" >&2
+        echo "$err" >&2
+        exit 1
+    fi


Off-by-one: loop installs 4 modules, not 5

Low Severity

The guard if [[ "$i" == "$MAX_AUTO_INSTALL" ]] fires before uv pip install, so on the 5th iteration a missing module is detected but never installed — the script exits instead. Despite MAX_AUTO_INSTALL=5, only 4 modules can actually be auto-installed. The error message "Hit 5 auto-install retries" is also misleading since only 4 installs were attempted. Moving the guard after the install (or bumping the range to MAX_AUTO_INSTALL + 1) would fix both the count and the message.

^{Reviewed by Cursor Bugbot for commit 123a21b. Configure here.}

The base image is `nvidia/cuda:13.0.2-devel-ubuntu24.04` (nvcc 13), but vLLM nightly's bundled torch wheel is compiled against CUDA 12. During the editable install, torch.utils.cpp_extension._check_cuda_version aborts with: RuntimeError: The detected CUDA version (13.0) mismatches the version that was used to compile PyTorch (12.8). Install nvidia-cuda-nvcc-cu12 (+ cccl/runtime headers) from PyPI and point CUDA_HOME at that wheel for the build step only. Runtime still uses the libcudart12 already present in the base image, and the rest of the job keeps the system CUDA 13 environment untouched. Signed-off-by: Samuel Shen <slshen@uchciago.edu>

The previous commit tried `os.path.dirname(m.__file__)` to locate the pip-installed CUDA 12 toolchain, but `nvidia.cuda_nvcc` ships as a PEP 420 namespace package (no __init__.py), so `__file__` is None and the script crashed with: TypeError: expected str, bytes or os.PathLike object, not NoneType Use `__path__[0]` instead, and assert `$CUDA_HOME_CU12/bin/nvcc` is executable before the LMCache build so any future layout change fails loudly with a directory listing instead of a cryptic torch error. Signed-off-by: Samuel Shen <slshen@uchciago.edu>

…rdcoded path Run LMCache#2301 showed nvidia-cuda-nvcc-cu12==12.9.86 installs with an empty bin/ directory — the sanity check fired correctly, but the wheel is unusable. Pin to 12.8.* to match torch's reported CUDA version (the original mismatch error said torch was built against CUDA 12.8). Also stop hardcoding `$CUDA_HOME/bin/nvcc`: use `find` to locate the nvcc binary anywhere under the package root, then derive CUDA_HOME from its parent directory. This keeps us resilient if the wheel layout shifts again (e.g. moves to nvvm/bin/ or a versioned subfolder). Signed-off-by: Samuel Shen <slshen@uchciago.edu>

Run LMCache#2302 proved the pip wheel `nvidia-cuda-nvcc-cu12` does NOT ship the nvcc compiler driver — its bin/ directory contains only ptxas: /opt/venv/.../nvidia/cuda_nvcc/bin/ptxas (31MB, the only binary) NVIDIA publishes nvcc only via apt / the runfile installer, not as a standalone pip wheel. The base image is `nvidia/cuda:13.0.2-devel` and already has NVIDIA's apt repo configured, so install just `cuda-compiler-12-8` alongside the existing CUDA 13 toolchain. That gives us nvcc 12.8 at /usr/local/cuda-12.8/bin/nvcc without a 5GB full toolkit pull. Point CUDA_HOME at it only for the LMCache editable install; system nvcc 13 stays on PATH for everything else. Guarded by an `-x` check so subsequent jobs on the same pod skip the apt-install when the compiler is already present (no-op on a fresh pod, fast on a reused one). Signed-off-by: Samuel Shen <slshen@uchciago.edu>

Verified locally with the current vLLM nightly wheel (2cdf86044d7e3bdcfbbb39a21a37b57ee9d4e65b): vLLM now ships torch 2.11.0+cu130 (torch.version.cuda='13.0'), not cu128 as it did a few hours ago. The hardcoded CUDA_HOME=/usr/local/cuda-12.8 override from the previous commit turned the original mismatch into the inverse error ('detected 12.8 mismatches PyTorch 13.0'). Replace the static override with a detect-and-heal block: read torch.version.cuda, compare its major to the system nvcc major, and only apt-install `cuda-compiler-<major>-<minor>` + set CUDA_HOME when they actually disagree. When they agree (the common case on the CUDA 13 base image with a cu13 torch), the system nvcc is used directly. Detection logic dry-run-verified against the real vLLM nightly wheel in a clean venv before pushing. Signed-off-by: Samuel Shen <slshen@uchciago.edu>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 65c0d1f. Configure here.}

cursor · 2026-04-16T07:35:59Z

+v = torch.version.cuda or ''
+parts = v.split('.') + ['0']
+print(v, parts[0], parts[1])
+")


CPU-only torch guard broken by read whitespace collapsing

Medium Severity

When torch.version.cuda is None (CPU-only), v becomes '', making parts = ['', '0']. Then print('', '', '0') outputs " 0" — three space-separated values where the first two are empty strings. Bash read strips leading whitespace and collapses this into a single token, assigning TORCH_CUDA="0" instead of "". The [[ -z "$TORCH_CUDA" ]] guard on line 83 then fails to detect the CPU-only case, falling through to the mismatch branch with empty TORCH_CUDA_MAJOR/TORCH_CUDA_MINOR, producing a bogus apt-get install cuda-compiler-- that fails the build.

Additional Locations (1)

.buildkite/k3_harness/setup-env.sh#L82-L85

^{Reviewed by Cursor Bugbot for commit 65c0d1f. Configure here.}

cursor · 2026-04-16T07:36:00Z

+parts = v.split('.') + ['0']
+print(v, parts[0], parts[1])
+")
+SYS_NVCC_MAJOR=$(nvcc --version 2>/dev/null | sed -n 's/.*release \([0-9]\+\).*/\1/p' | head -1)


pipefail crashes script when nvcc is absent

Low Severity

The script runs under set -euo pipefail (line 4). When nvcc is not on PATH, nvcc --version exits 127. With pipefail, the pipeline's exit status is 127. Since set -e does trigger on failed command substitutions in variable assignments, the script exits immediately at this line — before reaching the graceful handling on lines 83–103 that explicitly checks for an empty SYS_NVCC_MAJOR. Appending || true to the assignment would let the downstream logic handle the missing-nvcc case as intended.

Additional Locations (1)

.buildkite/k3_harness/setup-env.sh#L85-L88

^{Reviewed by Cursor Bugbot for commit 65c0d1f. Configure here.}

The base image was bumped to nvidia/cuda:13.0.2-devel-ubuntu24.04 in LMCache#2981, but setup-env.sh installs vLLM from the generic nightly index (wheels.vllm.ai/nightly/vllm/), which resolves non-deterministically to either a cu128 or a cu130 torch wheel. When the resolver picks cu128, torch.utils.cpp_extension._check_cuda_version aborts the LMCache editable install with: RuntimeError: The detected CUDA version (13.0) mismatches the version that was used to compile PyTorch (12.8). LMCache#3055 tried to paper over this at runtime by apt-installing cuda-compiler-12-8 on mismatch and pointing CUDA_HOME at /usr/local/cuda-12.8, but cuda-compiler-*-* only ships nvcc -- not the CUDA math-library dev headers (cusparse.h, cublas.h, etc.). The build then fails deeper with a cryptic: fatal error: cusparse.h: No such file or directory Every k3 build on dev since 10fd636 (2026-04-16 09:55 UTC) has failed this way because vLLM nightly happens to be publishing cu128 torch today. Extending the apt install list is a band-aid; the real fix is to stop resolving torch non-deterministically. Pin the install to vLLM's per-CUDA-major cu130 sub-index per the official vllm.ai install instructions. torch.version.cuda is now deterministically "13.0" and matches system nvcc 13, so _check_cuda_version passes with the system toolchain -- no CUDA_HOME override, no apt install, no HTML scraping. Also drop the HTML index scraper (no longer needed with the proper --extra-index-url flags) and replace the runtime alignment block with a small Python sanity check that fails with a clear message if the pin ever drifts again, so future breakage surfaces here instead of inside ninja. - Unblocks every k3 pipeline on dev. - Removes ~40 lines of CUDA version-alignment logic. - Keeps the pandas auto-heal loop from LMCache#3055 (that problem is independent of this one). Signed-off-by: Yihua Cheng <yihua98@uchicago.edu>

…age (#3061) The base image was bumped to nvidia/cuda:13.0.2-devel-ubuntu24.04 in #2981, but setup-env.sh installs vLLM from the generic nightly index (wheels.vllm.ai/nightly/vllm/), which resolves non-deterministically to either a cu128 or a cu130 torch wheel. When the resolver picks cu128, torch.utils.cpp_extension._check_cuda_version aborts the LMCache editable install with: RuntimeError: The detected CUDA version (13.0) mismatches the version that was used to compile PyTorch (12.8). #3055 tried to paper over this at runtime by apt-installing cuda-compiler-12-8 on mismatch and pointing CUDA_HOME at /usr/local/cuda-12.8, but cuda-compiler-*-* only ships nvcc -- not the CUDA math-library dev headers (cusparse.h, cublas.h, etc.). The build then fails deeper with a cryptic: fatal error: cusparse.h: No such file or directory Every k3 build on dev since 10fd636 (2026-04-16 09:55 UTC) has failed this way because vLLM nightly happens to be publishing cu128 torch today. Extending the apt install list is a band-aid; the real fix is to stop resolving torch non-deterministically. Pin the install to vLLM's per-CUDA-major cu130 sub-index per the official vllm.ai install instructions. torch.version.cuda is now deterministically "13.0" and matches system nvcc 13, so _check_cuda_version passes with the system toolchain -- no CUDA_HOME override, no apt install, no HTML scraping. Also drop the HTML index scraper (no longer needed with the proper --extra-index-url flags) and replace the runtime alignment block with a small Python sanity check that fails with a clear message if the pin ever drifts again, so future breakage surfaces here instead of inside ninja. - Unblocks every k3 pipeline on dev. - Removes ~40 lines of CUDA version-alignment logic. - Keeps the pandas auto-heal loop from #3055 (that problem is independent of this one). Signed-off-by: Yihua Cheng <yihua98@uchicago.edu>

sammshen requested review from ApostaC, deng451e and hickeyma as code owners April 16, 2026 06:39

gemini-code-assist Bot reviewed Apr 16, 2026

View reviewed changes

cursor Bot reviewed Apr 16, 2026

View reviewed changes

sammshen changed the title ~~[Chore][CI] Install pandas in setup-env to unblock vLLM nightly~~ [Chore][CI] Install pandas in setup-env to unblock vLLM nightly and prevent dep change breaks in the future Apr 16, 2026

sammshen changed the title ~~[Chore][CI] Install pandas in setup-env to unblock vLLM nightly and prevent dep change breaks in the future~~ [Hotfix][CI] Install pandas in setup-env to unblock vLLM nightly and prevent dep change breaks in the future Apr 16, 2026

sammshen changed the title ~~[Hotfix][CI] Install pandas in setup-env to unblock vLLM nightly and prevent dep change breaks in the future~~ [Hotfix][CI] Install pandas in setup-env to unblock vLLM nightly and prevent dep change breaks in the future and fix CUDA version Apr 16, 2026

sammshen changed the title ~~[Hotfix][CI] Install pandas in setup-env to unblock vLLM nightly and prevent dep change breaks in the future and fix CUDA version~~ [Chore][CI] Unblock CI: pandas auto-heal + CUDA 12 build toolchain Apr 16, 2026

ApostaC approved these changes Apr 16, 2026

View reviewed changes

ApostaC enabled auto-merge (squash) April 16, 2026 07:09

sammshen changed the title ~~[Chore][CI] Unblock CI: pandas auto-heal + CUDA 12 build toolchain~~ [Hotfix][CI] Unblock CI: pandas auto-heal + CUDA 12 build toolchain Apr 16, 2026

github-actions Bot added the full Run comprehensive tests on this PR label Apr 16, 2026

Samuel Shen added 3 commits April 16, 2026 03:13

cursor Bot reviewed Apr 16, 2026

View reviewed changes

DongDongJu approved these changes Apr 16, 2026

View reviewed changes

ApostaC merged commit 10fd636 into LMCache:dev Apr 16, 2026
38 checks passed

ApostaC mentioned this pull request Apr 16, 2026

[Hotfix][CI] Pin vLLM nightly to cu130 index to match CUDA 13 base image #3061

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hotfix][CI] Unblock CI: pandas auto-heal + CUDA 12 build toolchain#3055

[Hotfix][CI] Unblock CI: pandas auto-heal + CUDA 12 build toolchain#3055
ApostaC merged 7 commits intoLMCache:devfrom
sammshen:fix/ci-install-pandas-for-vllm-nightly

sammshen commented Apr 16, 2026 •

edited by cursor Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

cursor Bot Apr 16, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Apr 16, 2026

Uh oh!

cursor Bot Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sammshen commented Apr 16, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

cursor Bot Apr 16, 2026

Choose a reason for hiding this comment

Off-by-one: loop installs 4 modules, not 5

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 16, 2026

Choose a reason for hiding this comment

CPU-only torch guard broken by read whitespace collapsing

Uh oh!

cursor Bot Apr 16, 2026

Choose a reason for hiding this comment

pipefail crashes script when nvcc is absent

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sammshen commented Apr 16, 2026 •

edited by cursor Bot

Loading

CPU-only torch guard broken by `read` whitespace collapsing

`pipefail` crashes script when `nvcc` is absent