Skip to content

[Chore][CI] Upgrade CI base image to CUDA 13.0#2981

Merged
sammshen merged 1 commit intoLMCache:devfrom
sammshen:k3-cuda-130
Apr 8, 2026
Merged

[Chore][CI] Upgrade CI base image to CUDA 13.0#2981
sammshen merged 1 commit intoLMCache:devfrom
sammshen:k3-cuda-130

Conversation

@sammshen
Copy link
Copy Markdown
Contributor

@sammshen sammshen commented Apr 8, 2026

vLLM nightly now requires PyTorch 2.11.0 which is built against CUDA 13.0. Update the CI base image to match.

What this PR does / why we need it:

Special notes for your reviewers:

If applicable:

  • this PR contains user facing changes - docs added
  • this PR contains unit tests

Note

Medium Risk
Medium risk because it changes the CUDA base image and runtime libraries used by CI pods, which can break GPU-dependent builds/tests if the new CUDA 13 environment differs from previous images.

Overview
Updates the K3s Buildkite harness CI base image to CUDA 13 by switching the Docker base from NVIDIA’s CUDA DL image to nvidia/cuda:13.0.2-devel-ubuntu24.04.

Adjusts image setup to install libcudart12 and run a generic ldconfig (removing the prior CUDA 12.9 compat path), aligning the CI environment with newer CUDA/PyTorch requirements.

Reviewed by Cursor Bugbot for commit dcf764c. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the base Docker image to nvidia/cuda:13.0.2-devel-ubuntu24.04 and updates the ldconfig path to use the generic /usr/local/cuda/compat/ directory. I have no feedback to provide.

Copy link
Copy Markdown
Collaborator

@maobaolong maobaolong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I aware this issue too, LGTM for this fix, thanks!

@sammshen sammshen enabled auto-merge (squash) April 8, 2026 18:35
@deng451e
Copy link
Copy Markdown
Collaborator

deng451e commented Apr 8, 2026

can we force merge this pr, current ci is blocking this ci fix lol.

vLLM nightly now requires PyTorch 2.11.0 (CUDA 13.0). Update the CI
base image from cuda-dl-base:25.04-cuda12.9 to nvidia/cuda:13.0.2.
Install libcudart12 for backward compat since vLLM's compiled _C
extension still links against libcudart.so.12.

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
@sammshen
Copy link
Copy Markdown
Contributor Author

sammshen commented Apr 8, 2026

fixed, CI back online

@sammshen sammshen added the full Run comprehensive tests on this PR label Apr 8, 2026
@sammshen sammshen merged commit 8885a41 into LMCache:dev Apr 8, 2026
34 of 35 checks passed
Oasis-Git pushed a commit to Oasis-Git/LMCache that referenced this pull request Apr 13, 2026
vLLM nightly now requires PyTorch 2.11.0 which is built against
CUDA 13.0. Update the CI base image to match.

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Co-authored-by: Samuel Shen <slshen@uchciago.edu>
ApostaC added a commit to ApostaC/LMCache that referenced this pull request Apr 16, 2026
The base image was bumped to nvidia/cuda:13.0.2-devel-ubuntu24.04 in
LMCache#2981, but setup-env.sh installs vLLM from the generic nightly index
(wheels.vllm.ai/nightly/vllm/), which resolves non-deterministically to
either a cu128 or a cu130 torch wheel. When the resolver picks cu128,
torch.utils.cpp_extension._check_cuda_version aborts the LMCache
editable install with:

    RuntimeError: The detected CUDA version (13.0) mismatches the
    version that was used to compile PyTorch (12.8).

LMCache#3055 tried to paper over this at runtime by apt-installing
cuda-compiler-12-8 on mismatch and pointing CUDA_HOME at
/usr/local/cuda-12.8, but cuda-compiler-*-* only ships nvcc -- not the
CUDA math-library dev headers (cusparse.h, cublas.h, etc.). The build
then fails deeper with a cryptic:

    fatal error: cusparse.h: No such file or directory

Every k3 build on dev since 10fd636 (2026-04-16 09:55 UTC) has failed
this way because vLLM nightly happens to be publishing cu128 torch
today. Extending the apt install list is a band-aid; the real fix is
to stop resolving torch non-deterministically.

Pin the install to vLLM's per-CUDA-major cu130 sub-index per the
official vllm.ai install instructions. torch.version.cuda is now
deterministically "13.0" and matches system nvcc 13, so
_check_cuda_version passes with the system toolchain -- no CUDA_HOME
override, no apt install, no HTML scraping.

Also drop the HTML index scraper (no longer needed with the proper
--extra-index-url flags) and replace the runtime alignment block with
a small Python sanity check that fails with a clear message if the
pin ever drifts again, so future breakage surfaces here instead of
inside ninja.

- Unblocks every k3 pipeline on dev.
- Removes ~40 lines of CUDA version-alignment logic.
- Keeps the pandas auto-heal loop from LMCache#3055 (that problem is
  independent of this one).

Signed-off-by: Yihua Cheng <yihua98@uchicago.edu>
sammshen pushed a commit that referenced this pull request Apr 16, 2026
…age (#3061)

The base image was bumped to nvidia/cuda:13.0.2-devel-ubuntu24.04 in
#2981, but setup-env.sh installs vLLM from the generic nightly index
(wheels.vllm.ai/nightly/vllm/), which resolves non-deterministically to
either a cu128 or a cu130 torch wheel. When the resolver picks cu128,
torch.utils.cpp_extension._check_cuda_version aborts the LMCache
editable install with:

    RuntimeError: The detected CUDA version (13.0) mismatches the
    version that was used to compile PyTorch (12.8).

#3055 tried to paper over this at runtime by apt-installing
cuda-compiler-12-8 on mismatch and pointing CUDA_HOME at
/usr/local/cuda-12.8, but cuda-compiler-*-* only ships nvcc -- not the
CUDA math-library dev headers (cusparse.h, cublas.h, etc.). The build
then fails deeper with a cryptic:

    fatal error: cusparse.h: No such file or directory

Every k3 build on dev since 10fd636 (2026-04-16 09:55 UTC) has failed
this way because vLLM nightly happens to be publishing cu128 torch
today. Extending the apt install list is a band-aid; the real fix is
to stop resolving torch non-deterministically.

Pin the install to vLLM's per-CUDA-major cu130 sub-index per the
official vllm.ai install instructions. torch.version.cuda is now
deterministically "13.0" and matches system nvcc 13, so
_check_cuda_version passes with the system toolchain -- no CUDA_HOME
override, no apt install, no HTML scraping.

Also drop the HTML index scraper (no longer needed with the proper
--extra-index-url flags) and replace the runtime alignment block with
a small Python sanity check that fails with a clear message if the
pin ever drifts again, so future breakage surfaces here instead of
inside ninja.

- Unblocks every k3 pipeline on dev.
- Removes ~40 lines of CUDA version-alignment logic.
- Keeps the pandas auto-heal loop from #3055 (that problem is
  independent of this one).

Signed-off-by: Yihua Cheng <yihua98@uchicago.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

full Run comprehensive tests on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants