Skip to content

[Docker] Prep for torch 2.11: cu129 fix, image validator, dep cleanup#23593

Merged
Fridge003 merged 14 commits intomainfrom
update_docker_images
May 4, 2026
Merged

[Docker] Prep for torch 2.11: cu129 fix, image validator, dep cleanup#23593
Fridge003 merged 14 commits intomainfrom
update_docker_images

Conversation

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

@Kangyan-Zhou Kangyan-Zhou commented Apr 24, 2026

Motivation

Torch 2.11 makes cu130 wheels the PyPI default, which surfaces two install-time bugs in the cu129 Docker image build path. This PR fixes those, adds a post-build validator to gate future CUDA-variant regressions, and cleans up NVIDIA package overrides that torch 2.11 now ships at equal or newer versions — mirroring the CI-script cleanup already done in #21247.

Companion to #21247 (torch 2.11 upgrade), which handles python/pyproject.toml, sgl-kernel/Dockerfile, and scripts/ci/cuda/ci_install_dependency.sh. This PR completes the Dockerfile-side work.

Modifications

cu129 torch resolution fixes — docker/Dockerfile

  1. sgl-kernel install on cu128/cu129 was missing --force-reinstall --no-deps, so pip resolved sglang-kernel's torch dep and silently pulled a cu130 torch from PyPI into a cu129 image. Made consistent with the cu126/cu130 branches.

  2. Main sglang dep install relied on --extra-index-url, which isn't strong enough to force cu12x resolution when PyPI and the pytorch.org index both publish the same version 2.11.0. Now pre-installs torch/torchvision/torchaudio from the cu12x index with --index-url before the main install.

Post-build validator — new docker/validate_image.py + workflow integration

Standalone validator copied into /usr/local/bin/validate_image.py and invoked from .github/workflows/_docker-build-and-publish.yml after each push-by-digest build (x86 cu129, x86 cu130, arm64 cu129, arm64 cu130). Pulls the just-pushed image and runs the validator inside it. A failure blocks the downstream create-manifests job, so bad digests never get tagged.

Checks:

  • torch.version.cuda matches the matrix CUDA_VERSION
  • Torch-internal cross-check: nvidia-nccl-cuX and nvidia-cudnn-cuX PyPI wheel versions must startswith torch.cuda.nccl.version() / torch.backends.cudnn.version(). Catches silent wheel downgrades and auto-tracks torch upgrades — no manual pin sync.
  • Hard-pin assertion for cuda-python (not torch-bundled)
  • Smoke imports: torch, torchaudio, torchvision, sglang, sgl_kernel, flashinfer, nixl

Modeled after pytorch/pytorch's .ci/pytorch/smoke_test/smoke_test.py pattern.

Redundant NVIDIA overrides removed — docker/Dockerfile

Torch 2.11 ships all of these at equal-or-newer versions, so the force-reinstalls were either silent downgrades or no-ops:

Override (removed) What torch 2.11 ships
nvidia-nccl-cu12/cu13==2.28.3 2.28.9 (downgrade)
nvidia-cudnn-cu12==9.16.0.29 9.17.1.4 (downgrade)
nvidia-cudnn-cu13==9.16.0.29 9.19.0.56 (downgrade)
nvidia-cublas==13.1.0.3 pulled by cuda-toolkit[cublas]==13.0.2 at exactly 13.1.0.3.* (no-op)
nvidia-cutlass-dsl==4.4.2 force-reinstall already pinned in python/pyproject.toml:40 (no-op)

Matches the cleanup PR #21247 applied in ci_install_dependency.sh.

nixl duplication fix — docker/Dockerfile

The nixl stub package has an unconditional requires_dist on nixl-cu12>=1.0.1, so installing plain nixl in the essential-packages block pulled nixl-cu12 onto cu13 images on top of the subsequent nixl-cu13 install — shipping the wrong-CUDA binary. Removed nixl from the essential-packages list; now installs nixl-cu12 / nixl-cu13 directly in the per-CUDA-major block.

Miscellaneous

  • Default ARG CUDA_VERSION bumped to 13.0.1 (only affects ad-hoc local builds; release workflow always passes --build-arg explicitly)
  • nvidia-cutlass-dsl tightened from >=4.4.1 to ==4.4.2
  • docker/diffusion.Dockerfile removed (no remaining references in workflows/scripts/docs)

Accuracy Tests

N/A — no change to model code paths or kernels.

Speed Tests and Profiling

N/A — build-time and CI-gating change only. No runtime effect.

Checklist

  • Format your code according to Format code with pre-commitpre-commit run --all-files passes
  • Add unit tests according to Run and add unit tests — post-build validator exercised automatically by the release workflow against every built image
  • Follow the SGLang code style guidance

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Kangyan-Zhou Kangyan-Zhou force-pushed the update_docker_images branch from af6d10e to f90d16e Compare April 24, 2026 00:48
@Kangyan-Zhou Kangyan-Zhou changed the title [Docker] Fix cu129 torch resolution for torch 2.11 + add post-build v… [Docker] Prep for torch 2.11: cu129 fix, image validator, dep cleanup Apr 24, 2026
Comment thread docker/Dockerfile
RUN --mount=type=cache,target=/root/.cache/pip if [ "${CUDA_VERSION%%.*}" = "12" ]; then \
python3 -m pip install nvidia-nccl-cu12==2.28.3 --force-reinstall --no-deps ; \
python3 -m pip install nvidia-cudnn-cu12==9.16.0.29 --force-reinstall --no-deps ; \
python3 -m pip install nixl-cu12 --no-deps ; \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stub package is needed, so I suggest installing nixl and nixl-cu12 here, and nixl and nixl-cu13 below

Kangyan-Zhou and others added 8 commits May 2, 2026 12:34
…alidator

Torch 2.11 ships cu130 wheels as PyPI's default, which broke two install
paths in the cu12x Dockerfile branch:

1. sgl-kernel install on cu128/cu129 (Dockerfile:205) was missing
   --force-reinstall --no-deps, so pip resolved sglang-kernel's torch dep
   and pulled a cu130 torch from PyPI into a cu129 image. Made consistent
   with the cu126/cu130 branches.

2. The main sglang dep install relied on --extra-index-url, which isn't
   strong enough to force cu12x resolution when both indexes publish the
   same version string. Pre-install torch/torchvision/torchaudio from the
   cu12x index with --index-url before the main install.

Also adds docker/validate_image.py, a post-build validator invoked from
release-docker-dev.yml after push-by-digest. It asserts torch.version.cuda
matches the matrix CUDA_VERSION, cross-checks torch's compiled-in
cudnn/nccl against the installed PyPI wheel (catches silent downgrades),
hard-pins cuda-python and nvidia-cublas, and smoke-imports critical
packages. Modeled after pytorch/pytorch's .ci/pytorch/smoke_test pattern.

Additional changes:
- Default ARG CUDA_VERSION bumped to 13.0.1 (only affects ad-hoc local
  builds; release workflow always passes --build-arg explicitly)
- nvidia-cutlass-dsl tightened from >=4.4.1 to ==4.4.2
- docker/diffusion.Dockerfile removed (no remaining references)

Companion to #21247 (torch 2.11 upgrade).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Torch 2.11's wheel metadata already pins the NVIDIA libs we were
force-reinstalling, and PR #21247 applied the same cleanup in
scripts/ci/cuda/ci_install_dependency.sh. Align the Dockerfile:

- nvidia-nccl-cu12/cu13==2.28.3: torch 2.11 ships 2.28.9 (pinning older
  was a silent downgrade)
- nvidia-cudnn-cu12==9.16.0.29: torch 2.11 ships 9.17.1.4 (downgrade)
- nvidia-cudnn-cu13==9.16.0.29: torch 2.11 ships 9.19.0.56 (downgrade)
- nvidia-cublas==13.1.0.3: already pulled transitively by
  cuda-toolkit[cublas]==13.0.2 at the exact same version
- nvidia-cutlass-dsl==4.4.2 force-reinstall: already pinned in
  python/pyproject.toml and resolved by the main sglang dep install

Also fix a nixl duplication bug on cu13 images: the `nixl` stub package
has an unconditional requires_dist on nixl-cu12>=1.0.1, so installing
plain `nixl` in the essential-packages block pulled nixl-cu12 (~49 MB)
onto cu13 images on top of the subsequent nixl-cu13 install. Install
nixl-cu12 / nixl-cu13 directly in the per-CUDA-major block instead.

Validator (docker/validate_image.py): drop the nvidia-cublas hard-pin
assertion; it's now transitively pinned by torch and the Dockerfile no
longer force-reinstalls it. The torch-internal cross-check for cudnn
and nccl still runs and will assert against whatever torch 2.11 ships.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Correct Dockerfile comment pointer: validator is invoked from
  _docker-build-and-publish.yml, not release-docker-dev.yml
- Drop ~49 MB magnitude from nixl comment (rot-prone, not load-bearing)
- Add nixl smoke import to validator (this area was just restructured
  and otherwise unasserted)
- Convert PackageNotFoundError to AssertionError in the torch-bundled
  cudnn/nccl cross-checks so a missing dep produces a clean FAIL line
  instead of a raw traceback
- Retry docker pull in the 4 validation steps: DockerHub has brief
  eventual-consistency after push-by-digest

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removes docker/validate_image.py and the four "Validate ..." steps from
.github/workflows/_docker-build-and-publish.yml. The remaining changes
(cu129 torch resolution fix, NVIDIA override cleanup, nixl per-CUDA install)
stand on their own; the validator can ship in a follow-up PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…u13 refs

Address findings from the code review:

1. release-docker.yml: tag_config JSON was malformed (missing comma after
   cu130 entry, trailing comma after cu129). fromJson would fail and break
   every tag-pushed release. Header comment also said latest-cu139.

2. docker/Dockerfile: restore the cu12 --index-url torch pre-install that
   the prior 'upd' commit dropped. With #21247 landed, torch 2.11 is the
   PyPI default at cu130, and --extra-index-url alone won't override it
   when both indexes publish the same version — cu126/cu128/cu129 images
   would silently ship cu130 torch.

3. Update consumers of the dropped dev-cu13 / latest-cu130-runtime tags
   to the new naming (dev = cu13, dev-cu12 = cu12): trivy-scan-dev.yml,
   nightly-72-gpu-gb200.yml, release-docker-dev.yml description,
   _docker-cleanup-nightly.yml examples, and
   scripts/ci/utils/docker_build_metadata_args.py MOVING_TAGS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Kangyan-Zhou Kangyan-Zhou force-pushed the update_docker_images branch from e28a506 to 348d4b4 Compare May 2, 2026 20:35
Kangyan-Zhou and others added 2 commits May 2, 2026 13:40
…0-runtime aliases

- docker/Dockerfile: install the nixl stub alongside nixl-cu12/nixl-cu13
  so the `nixl` import path keeps working. --no-deps prevents the stub's
  unconditional nixl-cu12 dep from shipping wrong-CUDA libs on cu13
  images. Addresses review feedback on PR #23593.

- release-docker-dev.yml: publish dev-cu13 as an alias of dev on the
  cu130 nightly and suffixed builds. Lets external consumers pinned to
  the pre-flip tag keep working.

- release-docker-runtime.yml: publish v{ver}-cu130-runtime and
  latest-cu130-runtime as aliases of the un-suffixed cu130 runtime tags.

- docker_build_metadata_args.py: add dev-cu13 to MOVING_TAGS so the
  metadata script still selects the immutable nightly-dev-{date}-{sha}
  tag for the build arg.

Revert the now-unnecessary description-text edits in trivy-scan-dev.yml,
nightly-72-gpu-gb200.yml, release-docker-dev.yml input description, and
_docker-cleanup-nightly.yml — the dev-cu13 / nightly-dev-cu13 names are
valid again with the alias in place. Trivy matrix stays as ["dev",
"dev-cu12"] so we scan both image variants instead of the same one twice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…or smooth transition

Apply follow-ups from the review of PR #23593:

(a) release-docker.yml: publish v{ver}-cu130 and latest-cu130 as aliases
    on the cu130 framework release. Mirrors the runtime aliases added in
    the prior commit. Without this, B200/B300 consumers pinned to
    `lmsysorg/sglang:v0.5.X-cu130` would break on the next release.

(b) release-docker-dev.yml: add nightly-dev-cu13-{date}-{short_sha} to the
    cu130 nightly tag list so the immutable history tag keeps being
    published under the pre-flip name. select_tag still picks the
    canonical nightly-dev-{date}-{short_sha} as the build-arg image tag
    (dev-cu13 is in MOVING_TAGS, so it's skipped).

(c) release-docker-dev.yml: add nightly-dev-cu13 to the cleanup-nightly
    tag_prefixes so the historical cu13 tags get GC'd alongside the new
    nightly-dev / nightly-dev-cu12 prefixes.

Note: the cu12-default → cu13-default flip on `dev`, `latest`, `v{ver}`,
`latest-runtime`, and `v{ver}-runtime` is the unavoidable breaking change
in this PR — those tags move by definition and have no clean alias fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread docker/Dockerfile Outdated
RUN --mount=type=cache,target=/root/.cache/pip if [ "${CUDA_VERSION%%.*}" = "12" ]; then \
python3 -m pip install nvidia-nccl-cu12==2.28.3 --force-reinstall --no-deps ; \
python3 -m pip install nvidia-cudnn-cu12==9.16.0.29 --force-reinstall --no-deps ; \
python3 -m pip install nixl nixl-cu12 --no-deps ; \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be like this

Suggested change
python3 -m pip install nixl nixl-cu12 --no-deps ; \
python3 -m pip install nixl[cu12] --no-deps ; \

Comment thread docker/Dockerfile Outdated
python3 -m pip install nvidia-cudnn-cu13==9.16.0.29 --force-reinstall --no-deps ; \
python3 -m pip install nvidia-cublas==13.1.0.3 --force-reinstall --no-deps ; \
python3 -m pip install nixl-cu13 --no-deps ; \
python3 -m pip install nixl nixl-cu13 --no-deps ; \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Suggested change
python3 -m pip install nixl nixl-cu13 --no-deps ; \
python3 -m pip install nixl[cu13] --no-deps ; \

Kangyan-Zhou and others added 4 commits May 2, 2026 16:34
…repository

`add-apt-repository ppa:deadsnakes/ppa` calls api.launchpad.net through
launchpadlib to look up the PPA owner. That endpoint has been timing out
on the self-hosted runners building the dev images, causing the nightly
release-docker-dev workflow to fail at the python3.12 install step:

    TimeoutError: [Errno 110] Connection timed out
    add-apt-repository → softwareproperties → launchpadlib → httplib2

Drop the launchpadlib dependency by writing the apt source list directly
and fetching the deadsnakes signing key from keyserver.ubuntu.com. The
build now only talks to ppa.launchpadcontent.net (the package mirror)
and keyserver.ubuntu.com, both of which are reachable from the runners.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The base and runtime stages both ran `add-apt-repository ppa:deadsnakes/ppa`
to pull Python 3.10 and Python 3.12. Ubuntu 24.04 (noble) already ships
python3.12 in `main`, and nothing in the image actually consumes Python
3.10 beyond an unused `update-alternatives` slot. Removing the PPA call
avoids transient Launchpad 504s during `add-apt-repository`, which has
broken the dev-image build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@Fridge003 Fridge003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified locally

@Fridge003 Fridge003 merged commit 52b4609 into main May 4, 2026
63 of 67 checks passed
@Fridge003 Fridge003 deleted the update_docker_images branch May 4, 2026 07:37
Fridge003 added a commit that referenced this pull request May 4, 2026
…#23593)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants