Skip to content

[Hotfix][CI] Unblock CI: pandas auto-heal + CUDA 12 build toolchain#3055

Merged
ApostaC merged 7 commits intoLMCache:devfrom
sammshen:fix/ci-install-pandas-for-vllm-nightly
Apr 16, 2026
Merged

[Hotfix][CI] Unblock CI: pandas auto-heal + CUDA 12 build toolchain#3055
ApostaC merged 7 commits intoLMCache:devfrom
sammshen:fix/ci-install-pandas-for-vllm-nightly

Conversation

@sammshen
Copy link
Copy Markdown
Contributor

@sammshen sammshen commented Apr 16, 2026

Recent vLLM nightlies (>= 0.19.1rc1.dev325) eagerly import pandas from vllm/_aiter_ops.py without declaring pandas in their install deps. That causes vllm serve to fail at import time in our CI venv, so the k3 integration-test jobs time out waiting on port 8000.

Install pandas explicitly in setup-env.sh as a workaround until vLLM either makes the import lazy or adds pandas to their declared deps.

What this PR does / why we need it:

Special notes for your reviewers:

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

Note

Medium Risk
Touches CI bootstrap scripts and introduces apt-based CUDA compiler installation and dynamic pip installs, which can affect build determinism and fail in unexpected environments, but does not change runtime product code.

Overview
CI setup now auto-recovers from vLLM nightly undeclared runtime deps by probe-importing vLLM’s CLI and uv pip installing missing modules on ModuleNotFoundError (bounded retries, with logging).

The environment bootstrap also aligns the CUDA compiler toolchain by comparing torch.version.cuda with the system nvcc major version and, on mismatch, installing the matching cuda-compiler-<major>-<minor> via apt and using that CUDA_HOME when building LMCache.

Integration test harnesses were updated so wait_for_server can accept a log file and dump its tail on timeout; run-integration.sh now passes the vLLM log to surface startup errors directly in job output.

Reviewed by Cursor Bugbot for commit 65c0d1f. Bugbot is set up for automated code reviews on this repo. Configure here.

Recent vLLM nightlies (>= 0.19.1rc1.dev325) eagerly `import pandas` from
vllm/_aiter_ops.py without declaring pandas in their install deps. That
causes `vllm serve` to fail at import time in our CI venv, so the k3
integration-test jobs time out waiting on port 8000.

Install pandas explicitly in setup-env.sh as a workaround until vLLM
either makes the import lazy or adds pandas to their declared deps.

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds an explicit installation of pandas using uv pip in the .buildkite/k3_harness/setup-env.sh script. This serves as a workaround for recent vLLM nightly builds that require pandas but do not list it as a dependency. I have no feedback to provide.

…meout

Replace the explicit `pip install pandas` workaround with a general
probe-and-retry loop: after installing vLLM, import its CLI entry point
and auto-install any ModuleNotFoundError modules (capped at 5 retries).
Self-heals the next time vLLM nightly adds another undeclared runtime
dep, and every auto-install is logged so the drift is visible in the
build output.

Also extend `wait_for_server` with an optional log_file argument: when
the port-wait times out, tail the last 200 lines to stderr so the real
failure (e.g. an ImportError at vllm serve startup) shows up inline in
the Buildkite job instead of requiring a trip through build artifacts.
Wire the integration-test harness to pass its per-test log path.

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
echo "Hit $MAX_AUTO_INSTALL auto-install retries; last missing module: $mod" >&2
echo "$err" >&2
exit 1
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Off-by-one: loop installs 4 modules, not 5

Low Severity

The guard if [[ "$i" == "$MAX_AUTO_INSTALL" ]] fires before uv pip install, so on the 5th iteration a missing module is detected but never installed — the script exits instead. Despite MAX_AUTO_INSTALL=5, only 4 modules can actually be auto-installed. The error message "Hit 5 auto-install retries" is also misleading since only 4 installs were attempted. Moving the guard after the install (or bumping the range to MAX_AUTO_INSTALL + 1) would fix both the count and the message.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 123a21b. Configure here.

@sammshen sammshen changed the title [Chore][CI] Install pandas in setup-env to unblock vLLM nightly [Chore][CI] Install pandas in setup-env to unblock vLLM nightly and prevent dep change breaks in the future Apr 16, 2026
@sammshen sammshen changed the title [Chore][CI] Install pandas in setup-env to unblock vLLM nightly and prevent dep change breaks in the future [Hotfix][CI] Install pandas in setup-env to unblock vLLM nightly and prevent dep change breaks in the future Apr 16, 2026
The base image is `nvidia/cuda:13.0.2-devel-ubuntu24.04` (nvcc 13), but
vLLM nightly's bundled torch wheel is compiled against CUDA 12. During
the editable install, torch.utils.cpp_extension._check_cuda_version
aborts with:

    RuntimeError: The detected CUDA version (13.0) mismatches the
    version that was used to compile PyTorch (12.8).

Install nvidia-cuda-nvcc-cu12 (+ cccl/runtime headers) from PyPI and
point CUDA_HOME at that wheel for the build step only. Runtime still
uses the libcudart12 already present in the base image, and the rest of
the job keeps the system CUDA 13 environment untouched.

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
@sammshen sammshen changed the title [Hotfix][CI] Install pandas in setup-env to unblock vLLM nightly and prevent dep change breaks in the future [Hotfix][CI] Install pandas in setup-env to unblock vLLM nightly and prevent dep change breaks in the future and fix CUDA version Apr 16, 2026
@sammshen sammshen changed the title [Hotfix][CI] Install pandas in setup-env to unblock vLLM nightly and prevent dep change breaks in the future and fix CUDA version [Chore][CI] Unblock CI: pandas auto-heal + CUDA 12 build toolchain Apr 16, 2026
@ApostaC ApostaC enabled auto-merge (squash) April 16, 2026 07:09
The previous commit tried `os.path.dirname(m.__file__)` to locate the
pip-installed CUDA 12 toolchain, but `nvidia.cuda_nvcc` ships as a PEP
420 namespace package (no __init__.py), so `__file__` is None and the
script crashed with:

    TypeError: expected str, bytes or os.PathLike object, not NoneType

Use `__path__[0]` instead, and assert `$CUDA_HOME_CU12/bin/nvcc` is
executable before the LMCache build so any future layout change fails
loudly with a directory listing instead of a cryptic torch error.

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
@sammshen sammshen changed the title [Chore][CI] Unblock CI: pandas auto-heal + CUDA 12 build toolchain [Hotfix][CI] Unblock CI: pandas auto-heal + CUDA 12 build toolchain Apr 16, 2026
@github-actions github-actions Bot added the full Run comprehensive tests on this PR label Apr 16, 2026
Samuel Shen added 3 commits April 16, 2026 03:13
…rdcoded path

Run LMCache#2301 showed nvidia-cuda-nvcc-cu12==12.9.86 installs with an empty
bin/ directory — the sanity check fired correctly, but the wheel is
unusable. Pin to 12.8.* to match torch's reported CUDA version (the
original mismatch error said torch was built against CUDA 12.8).

Also stop hardcoding `$CUDA_HOME/bin/nvcc`: use `find` to locate the
nvcc binary anywhere under the package root, then derive CUDA_HOME from
its parent directory. This keeps us resilient if the wheel layout
shifts again (e.g. moves to nvvm/bin/ or a versioned subfolder).

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Run LMCache#2302 proved the pip wheel `nvidia-cuda-nvcc-cu12` does NOT ship the
nvcc compiler driver — its bin/ directory contains only ptxas:

    /opt/venv/.../nvidia/cuda_nvcc/bin/ptxas  (31MB, the only binary)

NVIDIA publishes nvcc only via apt / the runfile installer, not as a
standalone pip wheel. The base image is `nvidia/cuda:13.0.2-devel` and
already has NVIDIA's apt repo configured, so install just
`cuda-compiler-12-8` alongside the existing CUDA 13 toolchain. That
gives us nvcc 12.8 at /usr/local/cuda-12.8/bin/nvcc without a 5GB full
toolkit pull. Point CUDA_HOME at it only for the LMCache editable
install; system nvcc 13 stays on PATH for everything else.

Guarded by an `-x` check so subsequent jobs on the same pod skip the
apt-install when the compiler is already present (no-op on a fresh
pod, fast on a reused one).

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Verified locally with the current vLLM nightly wheel
(2cdf86044d7e3bdcfbbb39a21a37b57ee9d4e65b): vLLM now ships
torch 2.11.0+cu130 (torch.version.cuda='13.0'), not cu128 as it did a
few hours ago. The hardcoded CUDA_HOME=/usr/local/cuda-12.8 override
from the previous commit turned the original mismatch into the inverse
error ('detected 12.8 mismatches PyTorch 13.0').

Replace the static override with a detect-and-heal block: read
torch.version.cuda, compare its major to the system nvcc major, and
only apt-install `cuda-compiler-<major>-<minor>` + set CUDA_HOME when
they actually disagree. When they agree (the common case on the CUDA
13 base image with a cu13 torch), the system nvcc is used directly.

Detection logic dry-run-verified against the real vLLM nightly wheel
in a clean venv before pushing.

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 65c0d1f. Configure here.

v = torch.version.cuda or ''
parts = v.split('.') + ['0']
print(v, parts[0], parts[1])
")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPU-only torch guard broken by read whitespace collapsing

Medium Severity

When torch.version.cuda is None (CPU-only), v becomes '', making parts = ['', '0']. Then print('', '', '0') outputs " 0" — three space-separated values where the first two are empty strings. Bash read strips leading whitespace and collapses this into a single token, assigning TORCH_CUDA="0" instead of "". The [[ -z "$TORCH_CUDA" ]] guard on line 83 then fails to detect the CPU-only case, falling through to the mismatch branch with empty TORCH_CUDA_MAJOR/TORCH_CUDA_MINOR, producing a bogus apt-get install cuda-compiler-- that fails the build.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 65c0d1f. Configure here.

parts = v.split('.') + ['0']
print(v, parts[0], parts[1])
")
SYS_NVCC_MAJOR=$(nvcc --version 2>/dev/null | sed -n 's/.*release \([0-9]\+\).*/\1/p' | head -1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pipefail crashes script when nvcc is absent

Low Severity

The script runs under set -euo pipefail (line 4). When nvcc is not on PATH, nvcc --version exits 127. With pipefail, the pipeline's exit status is 127. Since set -e does trigger on failed command substitutions in variable assignments, the script exits immediately at this line — before reaching the graceful handling on lines 83–103 that explicitly checks for an empty SYS_NVCC_MAJOR. Appending || true to the assignment would let the downstream logic handle the missing-nvcc case as intended.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 65c0d1f. Configure here.

@ApostaC ApostaC merged commit 10fd636 into LMCache:dev Apr 16, 2026
38 checks passed
ApostaC added a commit to ApostaC/LMCache that referenced this pull request Apr 16, 2026
The base image was bumped to nvidia/cuda:13.0.2-devel-ubuntu24.04 in
LMCache#2981, but setup-env.sh installs vLLM from the generic nightly index
(wheels.vllm.ai/nightly/vllm/), which resolves non-deterministically to
either a cu128 or a cu130 torch wheel. When the resolver picks cu128,
torch.utils.cpp_extension._check_cuda_version aborts the LMCache
editable install with:

    RuntimeError: The detected CUDA version (13.0) mismatches the
    version that was used to compile PyTorch (12.8).

LMCache#3055 tried to paper over this at runtime by apt-installing
cuda-compiler-12-8 on mismatch and pointing CUDA_HOME at
/usr/local/cuda-12.8, but cuda-compiler-*-* only ships nvcc -- not the
CUDA math-library dev headers (cusparse.h, cublas.h, etc.). The build
then fails deeper with a cryptic:

    fatal error: cusparse.h: No such file or directory

Every k3 build on dev since 10fd636 (2026-04-16 09:55 UTC) has failed
this way because vLLM nightly happens to be publishing cu128 torch
today. Extending the apt install list is a band-aid; the real fix is
to stop resolving torch non-deterministically.

Pin the install to vLLM's per-CUDA-major cu130 sub-index per the
official vllm.ai install instructions. torch.version.cuda is now
deterministically "13.0" and matches system nvcc 13, so
_check_cuda_version passes with the system toolchain -- no CUDA_HOME
override, no apt install, no HTML scraping.

Also drop the HTML index scraper (no longer needed with the proper
--extra-index-url flags) and replace the runtime alignment block with
a small Python sanity check that fails with a clear message if the
pin ever drifts again, so future breakage surfaces here instead of
inside ninja.

- Unblocks every k3 pipeline on dev.
- Removes ~40 lines of CUDA version-alignment logic.
- Keeps the pandas auto-heal loop from LMCache#3055 (that problem is
  independent of this one).

Signed-off-by: Yihua Cheng <yihua98@uchicago.edu>
sammshen pushed a commit that referenced this pull request Apr 16, 2026
…age (#3061)

The base image was bumped to nvidia/cuda:13.0.2-devel-ubuntu24.04 in
#2981, but setup-env.sh installs vLLM from the generic nightly index
(wheels.vllm.ai/nightly/vllm/), which resolves non-deterministically to
either a cu128 or a cu130 torch wheel. When the resolver picks cu128,
torch.utils.cpp_extension._check_cuda_version aborts the LMCache
editable install with:

    RuntimeError: The detected CUDA version (13.0) mismatches the
    version that was used to compile PyTorch (12.8).

#3055 tried to paper over this at runtime by apt-installing
cuda-compiler-12-8 on mismatch and pointing CUDA_HOME at
/usr/local/cuda-12.8, but cuda-compiler-*-* only ships nvcc -- not the
CUDA math-library dev headers (cusparse.h, cublas.h, etc.). The build
then fails deeper with a cryptic:

    fatal error: cusparse.h: No such file or directory

Every k3 build on dev since 10fd636 (2026-04-16 09:55 UTC) has failed
this way because vLLM nightly happens to be publishing cu128 torch
today. Extending the apt install list is a band-aid; the real fix is
to stop resolving torch non-deterministically.

Pin the install to vLLM's per-CUDA-major cu130 sub-index per the
official vllm.ai install instructions. torch.version.cuda is now
deterministically "13.0" and matches system nvcc 13, so
_check_cuda_version passes with the system toolchain -- no CUDA_HOME
override, no apt install, no HTML scraping.

Also drop the HTML index scraper (no longer needed with the proper
--extra-index-url flags) and replace the runtime alignment block with
a small Python sanity check that fails with a clear message if the
pin ever drifts again, so future breakage surfaces here instead of
inside ninja.

- Unblocks every k3 pipeline on dev.
- Removes ~40 lines of CUDA version-alignment logic.
- Keeps the pandas auto-heal loop from #3055 (that problem is
  independent of this one).

Signed-off-by: Yihua Cheng <yihua98@uchicago.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants