Skip to content

[Dependency] Upgrade to Torch 2.11.0#21247

Merged
Kangyan-Zhou merged 30 commits intomainfrom
brayden/torch-211
May 2, 2026
Merged

[Dependency] Upgrade to Torch 2.11.0#21247
Kangyan-Zhou merged 30 commits intomainfrom
brayden/torch-211

Conversation

@b8zhong
Copy link
Copy Markdown
Collaborator

@b8zhong b8zhong commented Mar 24, 2026

Motivation

github.com/pytorch/pytorch/releases/tag/v2.11.0

Modifications

github.com//pull/18862

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added documentation Improvements or additions to documentation dependencies Pull requests that update a dependency file sgl-kernel labels Mar 24, 2026
@b8zhong b8zhong changed the title [Dependency] [Dependency] Upgrade to Torch 2.11.0 Mar 24, 2026
@b8zhong
Copy link
Copy Markdown
Collaborator Author

b8zhong commented Mar 24, 2026

/tag-and-rerun-ci

@johnnynunez
Copy link
Copy Markdown
Contributor

johnnynunez commented Mar 24, 2026

viz @Fridge003 @merrymercy @FlamingoPg

@b8zhong b8zhong force-pushed the brayden/torch-211 branch 2 times, most recently from 346a818 to ca7a304 Compare March 25, 2026 14:10
@b8zhong
Copy link
Copy Markdown
Collaborator Author

b8zhong commented Mar 26, 2026

/rerun-failed-ci AGAIN

@nvpohanh
Copy link
Copy Markdown
Collaborator

@b8zhong What is the ETA of this upgrade? Thanks!

@Kangyan-Zhou Kangyan-Zhou force-pushed the brayden/torch-211 branch 2 times, most recently from df89633 to adb2ead Compare April 20, 2026 02:21
Kangyan-Zhou added a commit to Kangyan-Zhou/sglang that referenced this pull request Apr 20, 2026
The sgl_kernel path filter's extglob `sgl-kernel/**/*.!(md|txt)` puts
the negation inside the extension, which requires a literal dot in the
filename. Extensionless files like `sgl-kernel/Dockerfile`,
`sgl-kernel/Makefile`, and `sgl-kernel/LICENSE` therefore never trip
the filter — so editing only `sgl-kernel/Dockerfile` skips the
`sgl-kernel-build-wheels` job and CI falls back to the pre-built PyPI
wheel (recently hit by sgl-project#21247 when bumping torch to 2.11 via
sgl-kernel/Dockerfile only).

Move the negation to the basename level: `sgl-kernel/**/!(*.md|*.txt)`
matches any file under sgl-kernel/ whose basename does not end in
`.md` or `.txt`, including extensionless files. Single extglob keeps
us clear of the dorny/paths-filter multi-`!` ordering bug
(dorny/paths-filter#113, sgl-project#260).

Applied to all five workflows that shared the pattern: pr-test,
pr-test-amd, pr-test-amd-rocm720, pr-test-xeon, pr-test-xpu.

Verified locally with picomatch@2.3.1 (the version dorny uses, matched
with {dot: true} as in dorny/paths-filter/src/filter.ts):

  Dockerfile    old: skip  → new: match
  Makefile      old: skip  → new: match
  LICENSE       old: skip  → new: match
  README.md     old: skip  → new: skip (preserved)
  CMakeLists.txt old: skip → new: skip (preserved)
  build.sh/*.py/*.cu/*.toml: match (unchanged)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

/rerun-stage multimodal-gen-component-accuracy

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

/rerun-stage multimodal-gen-component-accuracy-1-gpu

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

/rerun-stage multimodal-gen-component-accuracy-2-gpu

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

/rerun-stage multimodal-gen-test-1-b200

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

✅ Triggered multimodal-gen-component-accuracy to run independently (skipping dependencies). View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

✅ Triggered multimodal-gen-test-2-gpu to run independently (skipping dependencies). View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

✅ Triggered multimodal-gen-test-1-gpu to run independently (skipping dependencies). View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

✅ Triggered multimodal-gen-test-1-b200 to run independently (skipping dependencies). View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

✅ Triggered multimodal-gen-component-accuracy-1-gpu to run independently (skipping dependencies). View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

✅ Triggered multimodal-gen-component-accuracy-2-gpu to run independently (skipping dependencies). View workflow run

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

/rerun-stage multimodal-gen-test-1-gpu

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

✅ Triggered multimodal-gen-test-1-gpu to run independently (skipping dependencies). View workflow run

@Kangyan-Zhou
Copy link
Copy Markdown
Collaborator

/rerun-stage multimodal-gen-test-1-gpu

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

✅ Triggered multimodal-gen-test-1-gpu to run independently (skipping dependencies). View workflow run

vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026
sgl-project#24093)

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Kangyan-Zhou Kangyan-Zhou merged commit 88bb5df into main May 2, 2026
119 of 171 checks passed
@Kangyan-Zhou Kangyan-Zhou deleted the brayden/torch-211 branch May 2, 2026 19:25
Kangyan-Zhou added a commit that referenced this pull request May 2, 2026
…alidator

Torch 2.11 ships cu130 wheels as PyPI's default, which broke two install
paths in the cu12x Dockerfile branch:

1. sgl-kernel install on cu128/cu129 (Dockerfile:205) was missing
   --force-reinstall --no-deps, so pip resolved sglang-kernel's torch dep
   and pulled a cu130 torch from PyPI into a cu129 image. Made consistent
   with the cu126/cu130 branches.

2. The main sglang dep install relied on --extra-index-url, which isn't
   strong enough to force cu12x resolution when both indexes publish the
   same version string. Pre-install torch/torchvision/torchaudio from the
   cu12x index with --index-url before the main install.

Also adds docker/validate_image.py, a post-build validator invoked from
release-docker-dev.yml after push-by-digest. It asserts torch.version.cuda
matches the matrix CUDA_VERSION, cross-checks torch's compiled-in
cudnn/nccl against the installed PyPI wheel (catches silent downgrades),
hard-pins cuda-python and nvidia-cublas, and smoke-imports critical
packages. Modeled after pytorch/pytorch's .ci/pytorch/smoke_test pattern.

Additional changes:
- Default ARG CUDA_VERSION bumped to 13.0.1 (only affects ad-hoc local
  builds; release workflow always passes --build-arg explicitly)
- nvidia-cutlass-dsl tightened from >=4.4.1 to ==4.4.2
- docker/diffusion.Dockerfile removed (no remaining references)

Companion to #21247 (torch 2.11 upgrade).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Kangyan-Zhou added a commit that referenced this pull request May 2, 2026
Torch 2.11's wheel metadata already pins the NVIDIA libs we were
force-reinstalling, and PR #21247 applied the same cleanup in
scripts/ci/cuda/ci_install_dependency.sh. Align the Dockerfile:

- nvidia-nccl-cu12/cu13==2.28.3: torch 2.11 ships 2.28.9 (pinning older
  was a silent downgrade)
- nvidia-cudnn-cu12==9.16.0.29: torch 2.11 ships 9.17.1.4 (downgrade)
- nvidia-cudnn-cu13==9.16.0.29: torch 2.11 ships 9.19.0.56 (downgrade)
- nvidia-cublas==13.1.0.3: already pulled transitively by
  cuda-toolkit[cublas]==13.0.2 at the exact same version
- nvidia-cutlass-dsl==4.4.2 force-reinstall: already pinned in
  python/pyproject.toml and resolved by the main sglang dep install

Also fix a nixl duplication bug on cu13 images: the `nixl` stub package
has an unconditional requires_dist on nixl-cu12>=1.0.1, so installing
plain `nixl` in the essential-packages block pulled nixl-cu12 (~49 MB)
onto cu13 images on top of the subsequent nixl-cu13 install. Install
nixl-cu12 / nixl-cu13 directly in the per-CUDA-major block instead.

Validator (docker/validate_image.py): drop the nvidia-cublas hard-pin
assertion; it's now transitively pinned by torch and the Dockerfile no
longer force-reinstalls it. The torch-internal cross-check for cudnn
and nccl still runs and will assert against whatever torch 2.11 ships.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Kangyan-Zhou added a commit that referenced this pull request May 2, 2026
…u13 refs

Address findings from the code review:

1. release-docker.yml: tag_config JSON was malformed (missing comma after
   cu130 entry, trailing comma after cu129). fromJson would fail and break
   every tag-pushed release. Header comment also said latest-cu139.

2. docker/Dockerfile: restore the cu12 --index-url torch pre-install that
   the prior 'upd' commit dropped. With #21247 landed, torch 2.11 is the
   PyPI default at cu130, and --extra-index-url alone won't override it
   when both indexes publish the same version — cu126/cu128/cu129 images
   would silently ship cu130 torch.

3. Update consumers of the dropped dev-cu13 / latest-cu130-runtime tags
   to the new naming (dev = cu13, dev-cu12 = cu12): trivy-scan-dev.yml,
   nightly-72-gpu-gb200.yml, release-docker-dev.yml description,
   _docker-cleanup-nightly.yml examples, and
   scripts/ci/utils/docker_build_metadata_args.py MOVING_TAGS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

amd deepseek dependencies Pull requests that update a dependency file diffusion SGLang Diffusion documentation Improvements or additions to documentation high priority Multi-modal multi-modal language model npu run-ci sgl-kernel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants