[BE] Make spmd_type a CI rather than CD dependency by malfet · Pull Request #187067 · pytorch/pytorch

malfet · 2026-06-11T18:23:56Z

Stack from ghstack (oldest at bottom):

-> [BE] Make spmd_type a CI rather than CD dependency #187067

[ghstack-poisoned]

pytorch-bot · 2026-06-11T18:24:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/187067

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 68 Pending

As of commit 68013e4 with merge base 428e02e ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pianpwk · 2026-06-11T18:28:46Z

#187015 should be landing for binary populate env

atalman

lgtm

malfet · 2026-06-11T22:39:56Z

@pytorchbot merge -f "Let's test in prod, I don't understand the signal anymore"

pytorchmergebot · 2026-06-11T22:42:07Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

huydhn · 2026-06-12T00:24:55Z

@pytorchbot drci

pytorch-auto-revert · 2026-06-12T01:00:50Z

@pytorchbot revert -m "Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable" -c autorevert

This PR is attributed to have caused regression in:

inductor: unit-test / inductor-cpu-build / build-osdc (hud), inductor-build / build-osdc (hud), unit-test / inductor-cpu-core-build (3.11) / build-osdc (hud), unit-test / inductor-build / build-osdc (hud)
trunk: inductor-build / build-osdc (hud), verify-cachebench-cpu-build / build-osdc (hud)

Note: This revert was accelerated by the AI advisor: AI advisor: revert (confidence=0.95); AI advisor: revert (confidence=0.90); AI advisor: revert (confidence=0.90); AI advisor: revert (confidence=0.92); AI advisor: revert (confidence=0.92); AI advisor: revert (confidence=0.90)

Please investigate and fix the issues.
@claude Can you please read this revert comment, follow the links and read the errors, to then give a brief diagnostics on the cause of the error? If you judge the error to be legitimate reason for a revert, please provide brief guidance on how the author could fix it.

claude · 2026-06-12T01:01:19Z

Claude finished @pytorch-auto-revert[bot]'s task in 1m 59s —— View job

Revert diagnostics

Read the revert comment and PR changes
Followed all six cited build-osdc job links and read the logs
Diagnosed root cause
Judged legitimacy + fix guidance

What the failing jobs actually show

All six cited jobs (inductor-cpu-build, inductor-build, inductor-cpu-core-build, verify-cachebench-cpu-build, all the build-osdc variants) fail with the identical error — and it is not a compile/build error:

Pod ...-workflow: phase=Pending, waiting=[job: ImagePullBackOff] / ErrImagePull  (7207s/7200s)
[runner-container-hooks] FATAL: Error: pod failed to come online ... Pending: backoff timeout
##[error]Process completed with exit code 1.

The runner pod sat in ErrImagePull/ImagePullBackOff for the full 2-hour timeout and never even started the build. The image it was trying to pull:

.../pytorch/ci-image:pytorch-linux-jammy-cuda13.0-cudnn9-py3-gcc11-inductor-benchmarks-<hash>

So the jobs failed because the inductor-benchmarks CI Docker image was never successfully built/pushed to ECR — the consumers had nothing to pull.

(Note: the unit-test / inductor-cpu-build / build-osdc job in the list is a separate transient infra failure — its pod failed with a plain backoff timeout with no ErrImagePull. The real, reproducible signal is the ImagePullBackOff on the benchmarks image.)

Why this PR caused it

This PR adds spmd-types==0.2.1 to .ci/docker/requirements-ci.txt. Per CLAUDE.md, any change under .ci/docker/ rebuilds the CI Docker images, so this PR forces a rebuild of the inductor-benchmarks image. That rebuild is broken because spmd-types drags torch into the base conda env (it declares torch>=2.10.0), which collides with the explicit cu128/cu130 torch install in install_inductor_benchmark_deps.sh:48. This is exactly the torch-version-mismatch chain huydhn documented above:

spmd-types==0.2.1 (torch>=2.10.0)
  -> install_conda.sh pre-installs torch (newest dev build) from PyPI
    -> cu12x/cu13x install sees torch "already satisfied" (wrong index)
      -> resolver backtracks -> incoherent torch/vision/audio trio

The image build doesn't produce a valid image -> ECR has no image -> every downstream build-osdc pod hits ImagePullBackOff.

Is the revert legitimate?

Yes. The failures are caused by this PR (it triggers the broken inductor-benchmarks image rebuild), they are reproducible across trunk and PR workflows, and the merge was done with -f ("test in prod"), bypassing the signal that would have caught this. Reverting unblocks trunk.

Fix guidance for the author

The PR's intent is sound (moving spmd_types from a CD/binary_populate_env.sh dependency to a CI dependency). The problem is how it's added to requirements-ci.txt. Recommended fixes, in order of preference:

Stop spmd-types from pulling in torch. Install it without its deps so it can't poison the base env, e.g. pip install --no-deps spmd-types==0.2.1 in the relevant install step rather than letting the resolver pull torch>=2.10.0. spmd-types is described as type stubs, so --no-deps should be safe.
Harden install_inductor_benchmark_deps.sh:48 to be order-independent: pip uninstall -y torch torchvision torchaudio first, or add --force-reinstall, so the cu128/cu130 trio always re-resolves cleanly regardless of what's already in the env. This also protects against future regressions of the same shape.
After fixing, do not -f merge — let the docker-builds / inductor-benchmarks image job run so the rebuild is validated before landing.

Fix this →

--- · branch claude/pr-187067-20260612-0101

pytorchmergebot · 2026-06-12T01:03:10Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

This reverts commit d4c98cd. Reverted #187067 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#187067 (comment)))

pytorchmergebot · 2026-06-12T01:03:18Z

@malfet your PR has been successfully reverted.

huydhn · 2026-06-12T01:04:37Z

Heads up: this PR causes a torch version skew in the inductor-benchmarks image build -- it installs torch-2.10.0 with torchaudio-2.11.0 / torchvision-0.25.0 instead of the matching torch-2.11.0 trio.

Cause: the new spmd-types==0.2.1 dep (torch>=2.10.0) makes install_conda.sh pre-install torch 2.12.0 from PyPI. Then install_inductor_benchmark_deps.sh:48 installs from the cu128 index (max 2.11.0); since torch 2.12.0 is "already satisfied" but unavailable there, pip backtracks torch+torchvision to 2.10.0 / 0.25.0. torchaudio 2.11.0 stays put because it no longer pins torch (2.10.0 / 2.9.1 do; 2.11.0 dropped the pin), so torchvision drives the downgrade and the skew goes unnoticed.

Why it lands on 2.10.0 / 0.25.0 instead of the 2.11.0 / 0.26.0 pair (theory): pip does try torch 2.11.0 + torchvision 0.26.0 first, then rejects it -- and I think the pivot is cupti, not torch itself. torch 2.11.0+cu128 pulls its CUDA deps via the cuda-toolkit==12.8.1 metapackage, whose cupti extra needs the unsuffixed nvidia-cuda-cupti==12.8.x. But requirements-ci.txt:410 directly pins cupti-python>13.0, which requires nvidia-cuda-cupti~=13.0. Those collide on the same package, so the 2.11.0 branch is unsatisfiable while keeping cupti-python. torch 2.10.0+cu128 instead pins the cu12-suffixed nvidia-cuda-cupti-cu12==12.8.90 (a different package cupti-python doesn't constrain), so pip backtracks to it cleanly.

Fix: uninstall torch/vision/audio (or --force-reinstall) before the cu128 install so the trio re-resolves cleanly; or keep spmd-types from pulling torch into the base env.

(Comment generated with the help of Claude.)

huydhn · 2026-06-12T01:35:31Z

Reproducible snippet on an empty venv

python --version
Python 3.10.20

pip install spmd-types==0.2.1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 --no-cache-dir
# Notice the mismatch torchaudio-2.11.0+cu128 and torch-2.10.0+cu128

python -c 'import torchaudio'

[ghstack-poisoned]

ghstack-source-id: 756af3e Pull-Request: #187067

[ghstack-poisoned]

ghstack-source-id: 877122e Pull-Request: #187067

Update

4ab90a4

[ghstack-poisoned]

malfet requested a review from a team as a code owner June 11, 2026 18:23

pytorch-bot Bot added ciflow/docker release notes: releng release notes category labels Jun 11, 2026

pianpwk approved these changes Jun 11, 2026

View reviewed changes

atalman approved these changes Jun 11, 2026

View reviewed changes

atalman added this to the 2.13.0 milestone Jun 11, 2026

fegin approved these changes Jun 11, 2026

View reviewed changes

This was referenced Jun 11, 2026

[distributed] Fix max_seqlen mismatch in ring attention backward #185493

Open

remove spmd_types dependency from nightly wheel #187015

Open

malfet added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 11, 2026

pytorchmergebot added the merging label Jun 11, 2026

pytorchmergebot closed this in d4c98cd Jun 11, 2026

pytorchmergebot added Merged and removed merging labels Jun 11, 2026

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Jun 12, 2026

pytorchmergebot reopened this Jun 12, 2026

Update

00509d5

[ghstack-poisoned]

malfet added a commit that referenced this pull request Jun 12, 2026

[BE] Make spmd_type a CI rather than CD dependency

0fdc88b

ghstack-source-id: 756af3e Pull-Request: #187067

Update

68013e4

[ghstack-poisoned]

malfet added a commit that referenced this pull request Jun 12, 2026

[BE] Make spmd_type a CI rather than CD dependency

201a8d6

ghstack-source-id: 877122e Pull-Request: #187067

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BE] Make spmd_type a CI rather than CD dependency#187067

[BE] Make spmd_type a CI rather than CD dependency#187067
malfet wants to merge 3 commits into
gh/malfet/949/basefrom
gh/malfet/949/head

malfet commented Jun 11, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

pianpwk commented Jun 11, 2026

Uh oh!

atalman left a comment

Uh oh!

malfet commented Jun 11, 2026

Uh oh!

pytorchmergebot commented Jun 11, 2026

Uh oh!

huydhn commented Jun 12, 2026

Uh oh!

pytorch-auto-revert Bot commented Jun 12, 2026

Uh oh!

claude Bot commented Jun 12, 2026 •

edited

Loading

Uh oh!

pytorchmergebot commented Jun 12, 2026

Uh oh!

pytorchmergebot commented Jun 12, 2026

Uh oh!

huydhn commented Jun 12, 2026 •

edited

Loading

Uh oh!

huydhn commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

malfet commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/187067

⏳ No Failures, 68 Pending

Uh oh!

pianpwk commented Jun 11, 2026

Uh oh!

atalman left a comment

Choose a reason for hiding this comment

Uh oh!

malfet commented Jun 11, 2026

Uh oh!

pytorchmergebot commented Jun 11, 2026

Merge started

Uh oh!

huydhn commented Jun 12, 2026

Uh oh!

pytorch-auto-revert Bot commented Jun 12, 2026

Uh oh!

claude Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Revert diagnostics

What the failing jobs actually show

Why this PR caused it

Is the revert legitimate?

Fix guidance for the author

Uh oh!

pytorchmergebot commented Jun 12, 2026

Uh oh!

pytorchmergebot commented Jun 12, 2026

Uh oh!

huydhn commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huydhn commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

malfet commented Jun 11, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 11, 2026 •

edited

Loading

claude Bot commented Jun 12, 2026 •

edited

Loading

huydhn commented Jun 12, 2026 •

edited

Loading

huydhn commented Jun 12, 2026 •

edited

Loading