[Docker] Prep for torch 2.11: cu129 fix, image validator, dep cleanup by Kangyan-Zhou · Pull Request #23593 · sgl-project/sglang

Kangyan-Zhou · 2026-04-24T00:24:39Z

Motivation

Torch 2.11 makes cu130 wheels the PyPI default, which surfaces two install-time bugs in the cu129 Docker image build path. This PR fixes those, adds a post-build validator to gate future CUDA-variant regressions, and cleans up NVIDIA package overrides that torch 2.11 now ships at equal or newer versions — mirroring the CI-script cleanup already done in #21247.

Companion to #21247 (torch 2.11 upgrade), which handles python/pyproject.toml, sgl-kernel/Dockerfile, and scripts/ci/cuda/ci_install_dependency.sh. This PR completes the Dockerfile-side work.

Modifications

cu129 torch resolution fixes — `docker/Dockerfile`

sgl-kernel install on cu128/cu129 was missing --force-reinstall --no-deps, so pip resolved sglang-kernel's torch dep and silently pulled a cu130 torch from PyPI into a cu129 image. Made consistent with the cu126/cu130 branches.
Main sglang dep install relied on --extra-index-url, which isn't strong enough to force cu12x resolution when PyPI and the pytorch.org index both publish the same version 2.11.0. Now pre-installs torch/torchvision/torchaudio from the cu12x index with --index-url before the main install.

Post-build validator — new `docker/validate_image.py` + workflow integration

Standalone validator copied into /usr/local/bin/validate_image.py and invoked from .github/workflows/_docker-build-and-publish.yml after each push-by-digest build (x86 cu129, x86 cu130, arm64 cu129, arm64 cu130). Pulls the just-pushed image and runs the validator inside it. A failure blocks the downstream create-manifests job, so bad digests never get tagged.

Checks:

torch.version.cuda matches the matrix CUDA_VERSION
Torch-internal cross-check: nvidia-nccl-cuX and nvidia-cudnn-cuX PyPI wheel versions must startswith torch.cuda.nccl.version() / torch.backends.cudnn.version(). Catches silent wheel downgrades and auto-tracks torch upgrades — no manual pin sync.
Hard-pin assertion for cuda-python (not torch-bundled)
Smoke imports: torch, torchaudio, torchvision, sglang, sgl_kernel, flashinfer, nixl

Modeled after pytorch/pytorch's .ci/pytorch/smoke_test/smoke_test.py pattern.

Redundant NVIDIA overrides removed — `docker/Dockerfile`

Torch 2.11 ships all of these at equal-or-newer versions, so the force-reinstalls were either silent downgrades or no-ops:

Override (removed)	What torch 2.11 ships
`nvidia-nccl-cu12/cu13==2.28.3`	`2.28.9` (downgrade)
`nvidia-cudnn-cu12==9.16.0.29`	`9.17.1.4` (downgrade)
`nvidia-cudnn-cu13==9.16.0.29`	`9.19.0.56` (downgrade)
`nvidia-cublas==13.1.0.3`	pulled by `cuda-toolkit[cublas]==13.0.2` at exactly `13.1.0.3.*` (no-op)
`nvidia-cutlass-dsl==4.4.2` force-reinstall	already pinned in `python/pyproject.toml:40` (no-op)

Matches the cleanup PR #21247 applied in ci_install_dependency.sh.

nixl duplication fix — `docker/Dockerfile`

The nixl stub package has an unconditional requires_dist on nixl-cu12>=1.0.1, so installing plain nixl in the essential-packages block pulled nixl-cu12 onto cu13 images on top of the subsequent nixl-cu13 install — shipping the wrong-CUDA binary. Removed nixl from the essential-packages list; now installs nixl-cu12 / nixl-cu13 directly in the per-CUDA-major block.

Miscellaneous

Default ARG CUDA_VERSION bumped to 13.0.1 (only affects ad-hoc local builds; release workflow always passes --build-arg explicitly)
nvidia-cutlass-dsl tightened from >=4.4.1 to ==4.4.2
docker/diffusion.Dockerfile removed (no remaining references in workflows/scripts/docs)

Accuracy Tests

N/A — no change to model code paths or kernels.

Speed Tests and Profiling

N/A — build-time and CI-gating change only. No runtime effect.

Checklist

Format your code according to Format code with pre-commit — pre-commit run --all-files passes
Add unit tests according to Run and add unit tests — post-build validator exercised automatically by the release workflow against every built image
Follow the SGLang code style guidance

gemini-code-assist · 2026-04-24T00:24:42Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

ovidiusm · 2026-04-24T13:20:17Z

 RUN --mount=type=cache,target=/root/.cache/pip if [ "${CUDA_VERSION%%.*}" = "12" ]; then \
-    python3 -m pip install nvidia-nccl-cu12==2.28.3 --force-reinstall --no-deps ; \
-    python3 -m pip install nvidia-cudnn-cu12==9.16.0.29 --force-reinstall --no-deps ; \
+    python3 -m pip install nixl-cu12 --no-deps ; \


The stub package is needed, so I suggest installing nixl and nixl-cu12 here, and nixl and nixl-cu13 below

…alidator Torch 2.11 ships cu130 wheels as PyPI's default, which broke two install paths in the cu12x Dockerfile branch: 1. sgl-kernel install on cu128/cu129 (Dockerfile:205) was missing --force-reinstall --no-deps, so pip resolved sglang-kernel's torch dep and pulled a cu130 torch from PyPI into a cu129 image. Made consistent with the cu126/cu130 branches. 2. The main sglang dep install relied on --extra-index-url, which isn't strong enough to force cu12x resolution when both indexes publish the same version string. Pre-install torch/torchvision/torchaudio from the cu12x index with --index-url before the main install. Also adds docker/validate_image.py, a post-build validator invoked from release-docker-dev.yml after push-by-digest. It asserts torch.version.cuda matches the matrix CUDA_VERSION, cross-checks torch's compiled-in cudnn/nccl against the installed PyPI wheel (catches silent downgrades), hard-pins cuda-python and nvidia-cublas, and smoke-imports critical packages. Modeled after pytorch/pytorch's .ci/pytorch/smoke_test pattern. Additional changes: - Default ARG CUDA_VERSION bumped to 13.0.1 (only affects ad-hoc local builds; release workflow always passes --build-arg explicitly) - nvidia-cutlass-dsl tightened from >=4.4.1 to ==4.4.2 - docker/diffusion.Dockerfile removed (no remaining references) Companion to #21247 (torch 2.11 upgrade). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Torch 2.11's wheel metadata already pins the NVIDIA libs we were force-reinstalling, and PR #21247 applied the same cleanup in scripts/ci/cuda/ci_install_dependency.sh. Align the Dockerfile: - nvidia-nccl-cu12/cu13==2.28.3: torch 2.11 ships 2.28.9 (pinning older was a silent downgrade) - nvidia-cudnn-cu12==9.16.0.29: torch 2.11 ships 9.17.1.4 (downgrade) - nvidia-cudnn-cu13==9.16.0.29: torch 2.11 ships 9.19.0.56 (downgrade) - nvidia-cublas==13.1.0.3: already pulled transitively by cuda-toolkit[cublas]==13.0.2 at the exact same version - nvidia-cutlass-dsl==4.4.2 force-reinstall: already pinned in python/pyproject.toml and resolved by the main sglang dep install Also fix a nixl duplication bug on cu13 images: the `nixl` stub package has an unconditional requires_dist on nixl-cu12>=1.0.1, so installing plain `nixl` in the essential-packages block pulled nixl-cu12 (~49 MB) onto cu13 images on top of the subsequent nixl-cu13 install. Install nixl-cu12 / nixl-cu13 directly in the per-CUDA-major block instead. Validator (docker/validate_image.py): drop the nvidia-cublas hard-pin assertion; it's now transitively pinned by torch and the Dockerfile no longer force-reinstalls it. The torch-internal cross-check for cudnn and nccl still runs and will assert against whatever torch 2.11 ships. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Correct Dockerfile comment pointer: validator is invoked from _docker-build-and-publish.yml, not release-docker-dev.yml - Drop ~49 MB magnitude from nixl comment (rot-prone, not load-bearing) - Add nixl smoke import to validator (this area was just restructured and otherwise unasserted) - Convert PackageNotFoundError to AssertionError in the torch-bundled cudnn/nccl cross-checks so a missing dep produces a clean FAIL line instead of a raw traceback - Retry docker pull in the 4 validation steps: DockerHub has brief eventual-consistency after push-by-digest Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Removes docker/validate_image.py and the four "Validate ..." steps from .github/workflows/_docker-build-and-publish.yml. The remaining changes (cu129 torch resolution fix, NVIDIA override cleanup, nixl per-CUDA install) stand on their own; the validator can ship in a follow-up PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…u13 refs Address findings from the code review: 1. release-docker.yml: tag_config JSON was malformed (missing comma after cu130 entry, trailing comma after cu129). fromJson would fail and break every tag-pushed release. Header comment also said latest-cu139. 2. docker/Dockerfile: restore the cu12 --index-url torch pre-install that the prior 'upd' commit dropped. With #21247 landed, torch 2.11 is the PyPI default at cu130, and --extra-index-url alone won't override it when both indexes publish the same version — cu126/cu128/cu129 images would silently ship cu130 torch. 3. Update consumers of the dropped dev-cu13 / latest-cu130-runtime tags to the new naming (dev = cu13, dev-cu12 = cu12): trivy-scan-dev.yml, nightly-72-gpu-gb200.yml, release-docker-dev.yml description, _docker-cleanup-nightly.yml examples, and scripts/ci/utils/docker_build_metadata_args.py MOVING_TAGS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…0-runtime aliases - docker/Dockerfile: install the nixl stub alongside nixl-cu12/nixl-cu13 so the `nixl` import path keeps working. --no-deps prevents the stub's unconditional nixl-cu12 dep from shipping wrong-CUDA libs on cu13 images. Addresses review feedback on PR #23593. - release-docker-dev.yml: publish dev-cu13 as an alias of dev on the cu130 nightly and suffixed builds. Lets external consumers pinned to the pre-flip tag keep working. - release-docker-runtime.yml: publish v{ver}-cu130-runtime and latest-cu130-runtime as aliases of the un-suffixed cu130 runtime tags. - docker_build_metadata_args.py: add dev-cu13 to MOVING_TAGS so the metadata script still selects the immutable nightly-dev-{date}-{sha} tag for the build arg. Revert the now-unnecessary description-text edits in trivy-scan-dev.yml, nightly-72-gpu-gb200.yml, release-docker-dev.yml input description, and _docker-cleanup-nightly.yml — the dev-cu13 / nightly-dev-cu13 names are valid again with the alias in place. Trivy matrix stays as ["dev", "dev-cu12"] so we scan both image variants instead of the same one twice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…or smooth transition Apply follow-ups from the review of PR #23593: (a) release-docker.yml: publish v{ver}-cu130 and latest-cu130 as aliases on the cu130 framework release. Mirrors the runtime aliases added in the prior commit. Without this, B200/B300 consumers pinned to `lmsysorg/sglang:v0.5.X-cu130` would break on the next release. (b) release-docker-dev.yml: add nightly-dev-cu13-{date}-{short_sha} to the cu130 nightly tag list so the immutable history tag keeps being published under the pre-flip name. select_tag still picks the canonical nightly-dev-{date}-{short_sha} as the build-arg image tag (dev-cu13 is in MOVING_TAGS, so it's skipped). (c) release-docker-dev.yml: add nightly-dev-cu13 to the cleanup-nightly tag_prefixes so the historical cu13 tags get GC'd alongside the new nightly-dev / nightly-dev-cu12 prefixes. Note: the cu12-default → cu13-default flip on `dev`, `latest`, `v{ver}`, `latest-runtime`, and `v{ver}-runtime` is the unavoidable breaking change in this PR — those tags move by definition and have no clean alias fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mmangkad · 2026-05-02T20:49:08Z

 RUN --mount=type=cache,target=/root/.cache/pip if [ "${CUDA_VERSION%%.*}" = "12" ]; then \
-    python3 -m pip install nvidia-nccl-cu12==2.28.3 --force-reinstall --no-deps ; \
-    python3 -m pip install nvidia-cudnn-cu12==9.16.0.29 --force-reinstall --no-deps ; \
+    python3 -m pip install nixl nixl-cu12 --no-deps ; \


Could be like this

Suggested change

python3 -m pip install nixl nixl-cu12 --no-deps ; \

python3 -m pip install nixl[cu12] --no-deps ; \

mmangkad · 2026-05-02T20:49:32Z

-    python3 -m pip install nvidia-cudnn-cu13==9.16.0.29 --force-reinstall --no-deps ; \
-    python3 -m pip install nvidia-cublas==13.1.0.3 --force-reinstall --no-deps ; \
-    python3 -m pip install nixl-cu13 --no-deps ; \
+    python3 -m pip install nixl nixl-cu13 --no-deps ; \


Same

Suggested change

python3 -m pip install nixl nixl-cu13 --no-deps ; \

python3 -m pip install nixl[cu13] --no-deps ; \

…repository `add-apt-repository ppa:deadsnakes/ppa` calls api.launchpad.net through launchpadlib to look up the PPA owner. That endpoint has been timing out on the self-hosted runners building the dev images, causing the nightly release-docker-dev workflow to fail at the python3.12 install step: TimeoutError: [Errno 110] Connection timed out add-apt-repository → softwareproperties → launchpadlib → httplib2 Drop the launchpadlib dependency by writing the apt source list directly and fetching the deadsnakes signing key from keyserver.ubuntu.com. The build now only talks to ppa.launchpadcontent.net (the package mirror) and keyserver.ubuntu.com, both of which are reachable from the runners. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…add-apt-repository" This reverts commit 128078f.

The base and runtime stages both ran `add-apt-repository ppa:deadsnakes/ppa` to pull Python 3.10 and Python 3.12. Ubuntu 24.04 (noble) already ships python3.12 in `main`, and nothing in the image actually consumes Python 3.10 beyond an unused `update-alternatives` slot. Removing the PPA call avoids transient Launchpad 504s during `add-apt-repository`, which has broken the dev-image build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fridge003

Verified locally

…#23593) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

Kangyan-Zhou requested review from Fridge003, HaiShaw, bingxche, ishandhanani, ispobock, merrymercy and yctseng0211 as code owners April 24, 2026 00:24

Kangyan-Zhou force-pushed the update_docker_images branch from af6d10e to f90d16e Compare April 24, 2026 00:48

Kangyan-Zhou changed the title ~~[Docker] Fix cu129 torch resolution for torch 2.11 + add post-build v…~~ [Docker] Prep for torch 2.11: cu129 fix, image validator, dep cleanup Apr 24, 2026

ovidiusm reviewed Apr 24, 2026

View reviewed changes

Fridge003 mentioned this pull request Apr 30, 2026

[Feature] Upgrade default Cuda version to 13.0 #21498

Closed

17 tasks

Kangyan-Zhou and others added 8 commits May 2, 2026 12:34

redirect image tags

297ecaa

fix

5602893

upd

864fe18

Kangyan-Zhou force-pushed the update_docker_images branch from e28a506 to 348d4b4 Compare May 2, 2026 20:35

Kangyan-Zhou and others added 2 commits May 2, 2026 13:40

mmangkad reviewed May 2, 2026

View reviewed changes

Kangyan-Zhou and others added 4 commits May 2, 2026 16:34

Revert "[Docker] Add deadsnakes PPA via direct apt source instead of …

b0d6c49

…add-apt-repository" This reverts commit 128078f.

fix

7f59b42

Fridge003 approved these changes May 4, 2026

View reviewed changes

Fridge003 merged commit 52b4609 into main May 4, 2026
63 of 67 checks passed

Fridge003 deleted the update_docker_images branch May 4, 2026 07:37

Fridge003 added a commit that referenced this pull request May 4, 2026

[Docker] Prep for torch 2.11: cu129 fix, image validator, dep cleanup (…

b8de24e

…#23593) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>

Kangyan-Zhou mentioned this pull request May 4, 2026

[Docker] fix: install nixl stub alongside nixl-cuXX binary #24369

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Docker] Prep for torch 2.11: cu129 fix, image validator, dep cleanup#23593

[Docker] Prep for torch 2.11: cu129 fix, image validator, dep cleanup#23593
Fridge003 merged 14 commits intomainfrom
update_docker_images

Kangyan-Zhou commented Apr 24, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Apr 24, 2026

Uh oh!

ovidiusm Apr 24, 2026

Uh oh!

mmangkad May 2, 2026

Uh oh!

mmangkad May 2, 2026

Uh oh!

Fridge003 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	python3 -m pip install nixl nixl-cu12 --no-deps ; \
	python3 -m pip install nixl[cu12] --no-deps ; \

	python3 -m pip install nixl nixl-cu13 --no-deps ; \
	python3 -m pip install nixl[cu13] --no-deps ; \

Conversation

Kangyan-Zhou commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

cu129 torch resolution fixes — docker/Dockerfile

Post-build validator — new docker/validate_image.py + workflow integration

Redundant NVIDIA overrides removed — docker/Dockerfile

nixl duplication fix — docker/Dockerfile

Miscellaneous

Accuracy Tests

Speed Tests and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Apr 24, 2026

Uh oh!

ovidiusm Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

mmangkad May 2, 2026

Choose a reason for hiding this comment

Uh oh!

mmangkad May 2, 2026

Choose a reason for hiding this comment

Uh oh!

Fridge003 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Kangyan-Zhou commented Apr 24, 2026 •

edited

Loading

cu129 torch resolution fixes — `docker/Dockerfile`

Post-build validator — new `docker/validate_image.py` + workflow integration

Redundant NVIDIA overrides removed — `docker/Dockerfile`

nixl duplication fix — `docker/Dockerfile`