[Docker] Prep for torch 2.11: cu129 fix, image validator, dep cleanup#23593
Merged
[Docker] Prep for torch 2.11: cu129 fix, image validator, dep cleanup#23593
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
af6d10e to
f90d16e
Compare
ovidiusm
reviewed
Apr 24, 2026
| RUN --mount=type=cache,target=/root/.cache/pip if [ "${CUDA_VERSION%%.*}" = "12" ]; then \ | ||
| python3 -m pip install nvidia-nccl-cu12==2.28.3 --force-reinstall --no-deps ; \ | ||
| python3 -m pip install nvidia-cudnn-cu12==9.16.0.29 --force-reinstall --no-deps ; \ | ||
| python3 -m pip install nixl-cu12 --no-deps ; \ |
Contributor
There was a problem hiding this comment.
The stub package is needed, so I suggest installing nixl and nixl-cu12 here, and nixl and nixl-cu13 below
17 tasks
…alidator Torch 2.11 ships cu130 wheels as PyPI's default, which broke two install paths in the cu12x Dockerfile branch: 1. sgl-kernel install on cu128/cu129 (Dockerfile:205) was missing --force-reinstall --no-deps, so pip resolved sglang-kernel's torch dep and pulled a cu130 torch from PyPI into a cu129 image. Made consistent with the cu126/cu130 branches. 2. The main sglang dep install relied on --extra-index-url, which isn't strong enough to force cu12x resolution when both indexes publish the same version string. Pre-install torch/torchvision/torchaudio from the cu12x index with --index-url before the main install. Also adds docker/validate_image.py, a post-build validator invoked from release-docker-dev.yml after push-by-digest. It asserts torch.version.cuda matches the matrix CUDA_VERSION, cross-checks torch's compiled-in cudnn/nccl against the installed PyPI wheel (catches silent downgrades), hard-pins cuda-python and nvidia-cublas, and smoke-imports critical packages. Modeled after pytorch/pytorch's .ci/pytorch/smoke_test pattern. Additional changes: - Default ARG CUDA_VERSION bumped to 13.0.1 (only affects ad-hoc local builds; release workflow always passes --build-arg explicitly) - nvidia-cutlass-dsl tightened from >=4.4.1 to ==4.4.2 - docker/diffusion.Dockerfile removed (no remaining references) Companion to #21247 (torch 2.11 upgrade). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Torch 2.11's wheel metadata already pins the NVIDIA libs we were force-reinstalling, and PR #21247 applied the same cleanup in scripts/ci/cuda/ci_install_dependency.sh. Align the Dockerfile: - nvidia-nccl-cu12/cu13==2.28.3: torch 2.11 ships 2.28.9 (pinning older was a silent downgrade) - nvidia-cudnn-cu12==9.16.0.29: torch 2.11 ships 9.17.1.4 (downgrade) - nvidia-cudnn-cu13==9.16.0.29: torch 2.11 ships 9.19.0.56 (downgrade) - nvidia-cublas==13.1.0.3: already pulled transitively by cuda-toolkit[cublas]==13.0.2 at the exact same version - nvidia-cutlass-dsl==4.4.2 force-reinstall: already pinned in python/pyproject.toml and resolved by the main sglang dep install Also fix a nixl duplication bug on cu13 images: the `nixl` stub package has an unconditional requires_dist on nixl-cu12>=1.0.1, so installing plain `nixl` in the essential-packages block pulled nixl-cu12 (~49 MB) onto cu13 images on top of the subsequent nixl-cu13 install. Install nixl-cu12 / nixl-cu13 directly in the per-CUDA-major block instead. Validator (docker/validate_image.py): drop the nvidia-cublas hard-pin assertion; it's now transitively pinned by torch and the Dockerfile no longer force-reinstalls it. The torch-internal cross-check for cudnn and nccl still runs and will assert against whatever torch 2.11 ships. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Correct Dockerfile comment pointer: validator is invoked from _docker-build-and-publish.yml, not release-docker-dev.yml - Drop ~49 MB magnitude from nixl comment (rot-prone, not load-bearing) - Add nixl smoke import to validator (this area was just restructured and otherwise unasserted) - Convert PackageNotFoundError to AssertionError in the torch-bundled cudnn/nccl cross-checks so a missing dep produces a clean FAIL line instead of a raw traceback - Retry docker pull in the 4 validation steps: DockerHub has brief eventual-consistency after push-by-digest Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removes docker/validate_image.py and the four "Validate ..." steps from .github/workflows/_docker-build-and-publish.yml. The remaining changes (cu129 torch resolution fix, NVIDIA override cleanup, nixl per-CUDA install) stand on their own; the validator can ship in a follow-up PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…u13 refs Address findings from the code review: 1. release-docker.yml: tag_config JSON was malformed (missing comma after cu130 entry, trailing comma after cu129). fromJson would fail and break every tag-pushed release. Header comment also said latest-cu139. 2. docker/Dockerfile: restore the cu12 --index-url torch pre-install that the prior 'upd' commit dropped. With #21247 landed, torch 2.11 is the PyPI default at cu130, and --extra-index-url alone won't override it when both indexes publish the same version — cu126/cu128/cu129 images would silently ship cu130 torch. 3. Update consumers of the dropped dev-cu13 / latest-cu130-runtime tags to the new naming (dev = cu13, dev-cu12 = cu12): trivy-scan-dev.yml, nightly-72-gpu-gb200.yml, release-docker-dev.yml description, _docker-cleanup-nightly.yml examples, and scripts/ci/utils/docker_build_metadata_args.py MOVING_TAGS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
e28a506 to
348d4b4
Compare
…0-runtime aliases - docker/Dockerfile: install the nixl stub alongside nixl-cu12/nixl-cu13 so the `nixl` import path keeps working. --no-deps prevents the stub's unconditional nixl-cu12 dep from shipping wrong-CUDA libs on cu13 images. Addresses review feedback on PR #23593. - release-docker-dev.yml: publish dev-cu13 as an alias of dev on the cu130 nightly and suffixed builds. Lets external consumers pinned to the pre-flip tag keep working. - release-docker-runtime.yml: publish v{ver}-cu130-runtime and latest-cu130-runtime as aliases of the un-suffixed cu130 runtime tags. - docker_build_metadata_args.py: add dev-cu13 to MOVING_TAGS so the metadata script still selects the immutable nightly-dev-{date}-{sha} tag for the build arg. Revert the now-unnecessary description-text edits in trivy-scan-dev.yml, nightly-72-gpu-gb200.yml, release-docker-dev.yml input description, and _docker-cleanup-nightly.yml — the dev-cu13 / nightly-dev-cu13 names are valid again with the alias in place. Trivy matrix stays as ["dev", "dev-cu12"] so we scan both image variants instead of the same one twice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…or smooth transition Apply follow-ups from the review of PR #23593: (a) release-docker.yml: publish v{ver}-cu130 and latest-cu130 as aliases on the cu130 framework release. Mirrors the runtime aliases added in the prior commit. Without this, B200/B300 consumers pinned to `lmsysorg/sglang:v0.5.X-cu130` would break on the next release. (b) release-docker-dev.yml: add nightly-dev-cu13-{date}-{short_sha} to the cu130 nightly tag list so the immutable history tag keeps being published under the pre-flip name. select_tag still picks the canonical nightly-dev-{date}-{short_sha} as the build-arg image tag (dev-cu13 is in MOVING_TAGS, so it's skipped). (c) release-docker-dev.yml: add nightly-dev-cu13 to the cleanup-nightly tag_prefixes so the historical cu13 tags get GC'd alongside the new nightly-dev / nightly-dev-cu12 prefixes. Note: the cu12-default → cu13-default flip on `dev`, `latest`, `v{ver}`, `latest-runtime`, and `v{ver}-runtime` is the unavoidable breaking change in this PR — those tags move by definition and have no clean alias fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mmangkad
reviewed
May 2, 2026
| RUN --mount=type=cache,target=/root/.cache/pip if [ "${CUDA_VERSION%%.*}" = "12" ]; then \ | ||
| python3 -m pip install nvidia-nccl-cu12==2.28.3 --force-reinstall --no-deps ; \ | ||
| python3 -m pip install nvidia-cudnn-cu12==9.16.0.29 --force-reinstall --no-deps ; \ | ||
| python3 -m pip install nixl nixl-cu12 --no-deps ; \ |
Contributor
There was a problem hiding this comment.
Could be like this
Suggested change
| python3 -m pip install nixl nixl-cu12 --no-deps ; \ | |
| python3 -m pip install nixl[cu12] --no-deps ; \ |
| python3 -m pip install nvidia-cudnn-cu13==9.16.0.29 --force-reinstall --no-deps ; \ | ||
| python3 -m pip install nvidia-cublas==13.1.0.3 --force-reinstall --no-deps ; \ | ||
| python3 -m pip install nixl-cu13 --no-deps ; \ | ||
| python3 -m pip install nixl nixl-cu13 --no-deps ; \ |
Contributor
There was a problem hiding this comment.
Same
Suggested change
| python3 -m pip install nixl nixl-cu13 --no-deps ; \ | |
| python3 -m pip install nixl[cu13] --no-deps ; \ |
…repository
`add-apt-repository ppa:deadsnakes/ppa` calls api.launchpad.net through
launchpadlib to look up the PPA owner. That endpoint has been timing out
on the self-hosted runners building the dev images, causing the nightly
release-docker-dev workflow to fail at the python3.12 install step:
TimeoutError: [Errno 110] Connection timed out
add-apt-repository → softwareproperties → launchpadlib → httplib2
Drop the launchpadlib dependency by writing the apt source list directly
and fetching the deadsnakes signing key from keyserver.ubuntu.com. The
build now only talks to ppa.launchpadcontent.net (the package mirror)
and keyserver.ubuntu.com, both of which are reachable from the runners.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…add-apt-repository" This reverts commit 128078f.
The base and runtime stages both ran `add-apt-repository ppa:deadsnakes/ppa` to pull Python 3.10 and Python 3.12. Ubuntu 24.04 (noble) already ships python3.12 in `main`, and nothing in the image actually consumes Python 3.10 beyond an unused `update-alternatives` slot. Removing the PPA call avoids transient Launchpad 504s during `add-apt-repository`, which has broken the dev-image build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fridge003
added a commit
that referenced
this pull request
May 4, 2026
…#23593) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Torch 2.11 makes cu130 wheels the PyPI default, which surfaces two install-time bugs in the cu129 Docker image build path. This PR fixes those, adds a post-build validator to gate future CUDA-variant regressions, and cleans up NVIDIA package overrides that torch 2.11 now ships at equal or newer versions — mirroring the CI-script cleanup already done in #21247.
Companion to #21247 (torch 2.11 upgrade), which handles
python/pyproject.toml,sgl-kernel/Dockerfile, andscripts/ci/cuda/ci_install_dependency.sh. This PR completes the Dockerfile-side work.Modifications
cu129 torch resolution fixes —
docker/Dockerfilesgl-kernel install on cu128/cu129 was missing
--force-reinstall --no-deps, so pip resolved sglang-kernel's torch dep and silently pulled a cu130 torch from PyPI into a cu129 image. Made consistent with the cu126/cu130 branches.Main sglang dep install relied on
--extra-index-url, which isn't strong enough to force cu12x resolution when PyPI and the pytorch.org index both publish the same version2.11.0. Now pre-installstorch/torchvision/torchaudiofrom the cu12x index with--index-urlbefore the main install.Post-build validator — new
docker/validate_image.py+ workflow integrationStandalone validator copied into
/usr/local/bin/validate_image.pyand invoked from.github/workflows/_docker-build-and-publish.ymlafter each push-by-digest build (x86 cu129, x86 cu130, arm64 cu129, arm64 cu130). Pulls the just-pushed image and runs the validator inside it. A failure blocks the downstreamcreate-manifestsjob, so bad digests never get tagged.Checks:
torch.version.cudamatches the matrixCUDA_VERSIONnvidia-nccl-cuXandnvidia-cudnn-cuXPyPI wheel versions muststartswithtorch.cuda.nccl.version()/torch.backends.cudnn.version(). Catches silent wheel downgrades and auto-tracks torch upgrades — no manual pin sync.cuda-python(not torch-bundled)torch,torchaudio,torchvision,sglang,sgl_kernel,flashinfer,nixlModeled after pytorch/pytorch's
.ci/pytorch/smoke_test/smoke_test.pypattern.Redundant NVIDIA overrides removed —
docker/DockerfileTorch 2.11 ships all of these at equal-or-newer versions, so the force-reinstalls were either silent downgrades or no-ops:
nvidia-nccl-cu12/cu13==2.28.32.28.9(downgrade)nvidia-cudnn-cu12==9.16.0.299.17.1.4(downgrade)nvidia-cudnn-cu13==9.16.0.299.19.0.56(downgrade)nvidia-cublas==13.1.0.3cuda-toolkit[cublas]==13.0.2at exactly13.1.0.3.*(no-op)nvidia-cutlass-dsl==4.4.2force-reinstallpython/pyproject.toml:40(no-op)Matches the cleanup PR #21247 applied in
ci_install_dependency.sh.nixl duplication fix —
docker/DockerfileThe
nixlstub package has an unconditionalrequires_distonnixl-cu12>=1.0.1, so installing plainnixlin the essential-packages block pulled nixl-cu12 onto cu13 images on top of the subsequent nixl-cu13 install — shipping the wrong-CUDA binary. Removednixlfrom the essential-packages list; now installsnixl-cu12/nixl-cu13directly in the per-CUDA-major block.Miscellaneous
ARG CUDA_VERSIONbumped to13.0.1(only affects ad-hoc local builds; release workflow always passes--build-argexplicitly)nvidia-cutlass-dsltightened from>=4.4.1to==4.4.2docker/diffusion.Dockerfileremoved (no remaining references in workflows/scripts/docs)Accuracy Tests
N/A — no change to model code paths or kernels.
Speed Tests and Profiling
N/A — build-time and CI-gating change only. No runtime effect.
Checklist
pre-commit run --all-filespasses