[Dependency] Upgrade to Torch 2.11.0#21247
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci |
346a818 to
ca7a304
Compare
|
/rerun-failed-ci AGAIN |
|
@b8zhong What is the ETA of this upgrade? Thanks! |
df89633 to
adb2ead
Compare
The sgl_kernel path filter's extglob `sgl-kernel/**/*.!(md|txt)` puts the negation inside the extension, which requires a literal dot in the filename. Extensionless files like `sgl-kernel/Dockerfile`, `sgl-kernel/Makefile`, and `sgl-kernel/LICENSE` therefore never trip the filter — so editing only `sgl-kernel/Dockerfile` skips the `sgl-kernel-build-wheels` job and CI falls back to the pre-built PyPI wheel (recently hit by sgl-project#21247 when bumping torch to 2.11 via sgl-kernel/Dockerfile only). Move the negation to the basename level: `sgl-kernel/**/!(*.md|*.txt)` matches any file under sgl-kernel/ whose basename does not end in `.md` or `.txt`, including extensionless files. Single extglob keeps us clear of the dorny/paths-filter multi-`!` ordering bug (dorny/paths-filter#113, sgl-project#260). Applied to all five workflows that shared the pattern: pr-test, pr-test-amd, pr-test-amd-rocm720, pr-test-xeon, pr-test-xpu. Verified locally with picomatch@2.3.1 (the version dorny uses, matched with {dot: true} as in dorny/paths-filter/src/filter.ts): Dockerfile old: skip → new: match Makefile old: skip → new: match LICENSE old: skip → new: match README.md old: skip → new: skip (preserved) CMakeLists.txt old: skip → new: skip (preserved) build.sh/*.py/*.cu/*.toml: match (unchanged) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/rerun-stage multimodal-gen-component-accuracy |
|
/rerun-stage multimodal-gen-component-accuracy-1-gpu |
|
/rerun-stage multimodal-gen-component-accuracy-2-gpu |
|
/rerun-stage multimodal-gen-test-1-b200 |
|
✅ Triggered |
|
✅ Triggered |
|
✅ Triggered |
|
✅ Triggered |
|
✅ Triggered |
|
✅ Triggered |
|
/rerun-stage multimodal-gen-test-1-gpu |
|
✅ Triggered |
|
/rerun-stage multimodal-gen-test-1-gpu |
|
✅ Triggered |
sgl-project#24093) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…alidator Torch 2.11 ships cu130 wheels as PyPI's default, which broke two install paths in the cu12x Dockerfile branch: 1. sgl-kernel install on cu128/cu129 (Dockerfile:205) was missing --force-reinstall --no-deps, so pip resolved sglang-kernel's torch dep and pulled a cu130 torch from PyPI into a cu129 image. Made consistent with the cu126/cu130 branches. 2. The main sglang dep install relied on --extra-index-url, which isn't strong enough to force cu12x resolution when both indexes publish the same version string. Pre-install torch/torchvision/torchaudio from the cu12x index with --index-url before the main install. Also adds docker/validate_image.py, a post-build validator invoked from release-docker-dev.yml after push-by-digest. It asserts torch.version.cuda matches the matrix CUDA_VERSION, cross-checks torch's compiled-in cudnn/nccl against the installed PyPI wheel (catches silent downgrades), hard-pins cuda-python and nvidia-cublas, and smoke-imports critical packages. Modeled after pytorch/pytorch's .ci/pytorch/smoke_test pattern. Additional changes: - Default ARG CUDA_VERSION bumped to 13.0.1 (only affects ad-hoc local builds; release workflow always passes --build-arg explicitly) - nvidia-cutlass-dsl tightened from >=4.4.1 to ==4.4.2 - docker/diffusion.Dockerfile removed (no remaining references) Companion to #21247 (torch 2.11 upgrade). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Torch 2.11's wheel metadata already pins the NVIDIA libs we were force-reinstalling, and PR #21247 applied the same cleanup in scripts/ci/cuda/ci_install_dependency.sh. Align the Dockerfile: - nvidia-nccl-cu12/cu13==2.28.3: torch 2.11 ships 2.28.9 (pinning older was a silent downgrade) - nvidia-cudnn-cu12==9.16.0.29: torch 2.11 ships 9.17.1.4 (downgrade) - nvidia-cudnn-cu13==9.16.0.29: torch 2.11 ships 9.19.0.56 (downgrade) - nvidia-cublas==13.1.0.3: already pulled transitively by cuda-toolkit[cublas]==13.0.2 at the exact same version - nvidia-cutlass-dsl==4.4.2 force-reinstall: already pinned in python/pyproject.toml and resolved by the main sglang dep install Also fix a nixl duplication bug on cu13 images: the `nixl` stub package has an unconditional requires_dist on nixl-cu12>=1.0.1, so installing plain `nixl` in the essential-packages block pulled nixl-cu12 (~49 MB) onto cu13 images on top of the subsequent nixl-cu13 install. Install nixl-cu12 / nixl-cu13 directly in the per-CUDA-major block instead. Validator (docker/validate_image.py): drop the nvidia-cublas hard-pin assertion; it's now transitively pinned by torch and the Dockerfile no longer force-reinstalls it. The torch-internal cross-check for cudnn and nccl still runs and will assert against whatever torch 2.11 ships. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…u13 refs Address findings from the code review: 1. release-docker.yml: tag_config JSON was malformed (missing comma after cu130 entry, trailing comma after cu129). fromJson would fail and break every tag-pushed release. Header comment also said latest-cu139. 2. docker/Dockerfile: restore the cu12 --index-url torch pre-install that the prior 'upd' commit dropped. With #21247 landed, torch 2.11 is the PyPI default at cu130, and --extra-index-url alone won't override it when both indexes publish the same version — cu126/cu128/cu129 images would silently ship cu130 torch. 3. Update consumers of the dropped dev-cu13 / latest-cu130-runtime tags to the new naming (dev = cu13, dev-cu12 = cu12): trivy-scan-dev.yml, nightly-72-gpu-gb200.yml, release-docker-dev.yml description, _docker-cleanup-nightly.yml examples, and scripts/ci/utils/docker_build_metadata_args.py MOVING_TAGS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Motivation
github.com/pytorch/pytorch/releases/tag/v2.11.0
Modifications
github.com//pull/18862