Skip to content

[BE] Make spmd_type a CI rather than CD dependency#187067

Open
malfet wants to merge 3 commits into
gh/malfet/949/basefrom
gh/malfet/949/head
Open

[BE] Make spmd_type a CI rather than CD dependency#187067
malfet wants to merge 3 commits into
gh/malfet/949/basefrom
gh/malfet/949/head

Conversation

@malfet

@malfet malfet commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

[ghstack-poisoned]
@malfet malfet requested a review from a team as a code owner June 11, 2026 18:23
@pytorch-bot

pytorch-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/187067

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 68 Pending

As of commit 68013e4 with merge base 428e02e (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot Bot added ciflow/docker release notes: releng release notes category labels Jun 11, 2026
@pianpwk

pianpwk commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

#187015 should be landing for binary populate env

@atalman atalman left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@atalman atalman added this to the 2.13.0 milestone Jun 11, 2026
@malfet

malfet commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

@pytorchbot merge -f "Let's test in prod, I don't understand the signal anymore"

@pytorchmergebot

Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@huydhn

huydhn commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

@pytorchbot drci

@pytorch-auto-revert

Copy link
Copy Markdown

@pytorchbot revert -m "Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable" -c autorevert

This PR is attributed to have caused regression in:

Note: This revert was accelerated by the AI advisor: AI advisor: revert (confidence=0.95); AI advisor: revert (confidence=0.90); AI advisor: revert (confidence=0.90); AI advisor: revert (confidence=0.92); AI advisor: revert (confidence=0.92); AI advisor: revert (confidence=0.90)

Please investigate and fix the issues.
@claude Can you please read this revert comment, follow the links and read the errors, to then give a brief diagnostics on the cause of the error? If you judge the error to be legitimate reason for a revert, please provide brief guidance on how the author could fix it.

@claude

claude Bot commented Jun 12, 2026

Copy link
Copy Markdown

Claude finished @pytorch-auto-revert[bot]'s task in 1m 59s —— View job


Revert diagnostics

  • Read the revert comment and PR changes
  • Followed all six cited build-osdc job links and read the logs
  • Diagnosed root cause
  • Judged legitimacy + fix guidance

What the failing jobs actually show

All six cited jobs (inductor-cpu-build, inductor-build, inductor-cpu-core-build, verify-cachebench-cpu-build, all the build-osdc variants) fail with the identical error — and it is not a compile/build error:

Pod ...-workflow: phase=Pending, waiting=[job: ImagePullBackOff] / ErrImagePull  (7207s/7200s)
[runner-container-hooks] FATAL: Error: pod failed to come online ... Pending: backoff timeout
##[error]Process completed with exit code 1.

The runner pod sat in ErrImagePull/ImagePullBackOff for the full 2-hour timeout and never even started the build. The image it was trying to pull:

.../pytorch/ci-image:pytorch-linux-jammy-cuda13.0-cudnn9-py3-gcc11-inductor-benchmarks-<hash>

So the jobs failed because the inductor-benchmarks CI Docker image was never successfully built/pushed to ECR — the consumers had nothing to pull.

(Note: the unit-test / inductor-cpu-build / build-osdc job in the list is a separate transient infra failure — its pod failed with a plain backoff timeout with no ErrImagePull. The real, reproducible signal is the ImagePullBackOff on the benchmarks image.)

Why this PR caused it

This PR adds spmd-types==0.2.1 to .ci/docker/requirements-ci.txt. Per CLAUDE.md, any change under .ci/docker/ rebuilds the CI Docker images, so this PR forces a rebuild of the inductor-benchmarks image. That rebuild is broken because spmd-types drags torch into the base conda env (it declares torch>=2.10.0), which collides with the explicit cu128/cu130 torch install in install_inductor_benchmark_deps.sh:48. This is exactly the torch-version-mismatch chain huydhn documented above:

spmd-types==0.2.1 (torch>=2.10.0)
  -> install_conda.sh pre-installs torch (newest dev build) from PyPI
    -> cu12x/cu13x install sees torch "already satisfied" (wrong index)
      -> resolver backtracks -> incoherent torch/vision/audio trio

The image build doesn't produce a valid image -> ECR has no image -> every downstream build-osdc pod hits ImagePullBackOff.

Is the revert legitimate?

Yes. The failures are caused by this PR (it triggers the broken inductor-benchmarks image rebuild), they are reproducible across trunk and PR workflows, and the merge was done with -f ("test in prod"), bypassing the signal that would have caught this. Reverting unblocks trunk.

Fix guidance for the author

The PR's intent is sound (moving spmd_types from a CD/binary_populate_env.sh dependency to a CI dependency). The problem is how it's added to requirements-ci.txt. Recommended fixes, in order of preference:

  1. Stop spmd-types from pulling in torch. Install it without its deps so it can't poison the base env, e.g. pip install --no-deps spmd-types==0.2.1 in the relevant install step rather than letting the resolver pull torch>=2.10.0. spmd-types is described as type stubs, so --no-deps should be safe.
  2. Harden install_inductor_benchmark_deps.sh:48 to be order-independent: pip uninstall -y torch torchvision torchaudio first, or add --force-reinstall, so the cu128/cu130 trio always re-resolves cleanly regardless of what's already in the env. This also protects against future regressions of the same shape.
  3. After fixing, do not -f merge — let the docker-builds / inductor-benchmarks image job run so the rebuild is validated before landing.

Fix this →

--- · branch claude/pr-187067-20260612-0101

@pytorchmergebot

Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Jun 12, 2026
This reverts commit d4c98cd.

Reverted #187067 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#187067 (comment)))
@pytorchmergebot

Copy link
Copy Markdown
Collaborator

@malfet your PR has been successfully reverted.

@pytorchmergebot pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Jun 12, 2026
@huydhn

huydhn commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Heads up: this PR causes a torch version skew in the inductor-benchmarks image build -- it installs torch-2.10.0 with torchaudio-2.11.0 / torchvision-0.25.0 instead of the matching torch-2.11.0 trio.

Cause: the new spmd-types==0.2.1 dep (torch>=2.10.0) makes install_conda.sh pre-install torch 2.12.0 from PyPI. Then install_inductor_benchmark_deps.sh:48 installs from the cu128 index (max 2.11.0); since torch 2.12.0 is "already satisfied" but unavailable there, pip backtracks torch+torchvision to 2.10.0 / 0.25.0. torchaudio 2.11.0 stays put because it no longer pins torch (2.10.0 / 2.9.1 do; 2.11.0 dropped the pin), so torchvision drives the downgrade and the skew goes unnoticed.

Why it lands on 2.10.0 / 0.25.0 instead of the 2.11.0 / 0.26.0 pair (theory): pip does try torch 2.11.0 + torchvision 0.26.0 first, then rejects it -- and I think the pivot is cupti, not torch itself. torch 2.11.0+cu128 pulls its CUDA deps via the cuda-toolkit==12.8.1 metapackage, whose cupti extra needs the unsuffixed nvidia-cuda-cupti==12.8.x. But requirements-ci.txt:410 directly pins cupti-python>13.0, which requires nvidia-cuda-cupti~=13.0. Those collide on the same package, so the 2.11.0 branch is unsatisfiable while keeping cupti-python. torch 2.10.0+cu128 instead pins the cu12-suffixed nvidia-cuda-cupti-cu12==12.8.90 (a different package cupti-python doesn't constrain), so pip backtracks to it cleanly.

Fix: uninstall torch/vision/audio (or --force-reinstall) before the cu128 install so the trio re-resolves cleanly; or keep spmd-types from pulling torch into the base env.

(Comment generated with the help of Claude.)

@huydhn

huydhn commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Reproducible snippet on an empty venv

python --version
Python 3.10.20

pip install spmd-types==0.2.1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128 --no-cache-dir
# Notice the mismatch torchaudio-2.11.0+cu128 and torch-2.10.0+cu128

python -c 'import torchaudio'

[ghstack-poisoned]
malfet added a commit that referenced this pull request Jun 12, 2026
[ghstack-poisoned]
malfet added a commit that referenced this pull request Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/docker ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: releng release notes category Reverted

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants