Skip to content

[Reland] Upgrade NCCL to 2.29.3#176424

Closed
kwen2501 wants to merge 1 commit intomainfrom
2.29.3
Closed

[Reland] Upgrade NCCL to 2.29.3#176424
kwen2501 wants to merge 1 commit intomainfrom
2.29.3

Conversation

@kwen2501
Copy link
Collaborator

@kwen2501 kwen2501 commented Mar 4, 2026

Trying to reland after original PR's revert #174338

@kwen2501 kwen2501 requested review from a team and jeffdaily as code owners March 4, 2026 12:49
@pytorch-bot
Copy link

pytorch-bot bot commented Mar 4, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/176424

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 3 Unrelated Failures

As of commit 6d17806 with merge base ad92985 (image):

NEW FAILURE - The following job has failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added ci-no-td Do not run TD on this PR ciflow/inductor topic: not user facing topic category labels Mar 4, 2026
@kwen2501 kwen2501 requested a review from huydhn March 4, 2026 12:49
@kwen2501
Copy link
Collaborator Author

kwen2501 commented Mar 4, 2026

@huydhn can you please rerun vLLM CI with NCCL_DEBUG=INFO? Thanks!

@Skylion007
Copy link
Collaborator

We should probably update nvshmem too soon

@huydhn
Copy link
Contributor

huydhn commented Mar 4, 2026

Copy link
Contributor

@huydhn huydhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mock benchmark looks good https://github.com/pytorch/pytorch/actions/runs/22694184506/job/65806064390. It means that whatever caused this to fail before was on vLLM and had been fixed

@huydhn
Copy link
Contributor

huydhn commented Mar 5, 2026

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 5, 2026
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: vLLM Benchmark / Run vLLM benchmarks (linux.dgx.b200.8, openai/gpt-oss-120b) / benchmark

Details for Dev Infra team Raised by workflow job

@huydhn
Copy link
Contributor

huydhn commented Mar 5, 2026

The gpt-oss-120b benchmark failure is from trunk and not related to this change

@huydhn
Copy link
Contributor

huydhn commented Mar 5, 2026

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 2 checks: inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 1, 2, linux.2xlarge.amx, unstable), vLLM Benchmark / Run vLLM benchmarks (linux.dgx.b200.8, openai/gpt-oss-120b) / benchmark

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@kwen2501
Copy link
Collaborator Author

kwen2501 commented Mar 5, 2026

Thanks a lot @huydhn

@kwen2501
Copy link
Collaborator Author

kwen2501 commented Mar 5, 2026

@pytorchbot cherry-pick --onto release/2.11 -c release

@pytorchbot
Copy link
Collaborator

Cherry picking #176424

Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x a2ab1624d05251ca75f57ecb967ab8aaf8ae356b returned non-zero exit code 1

Auto-merging .github/scripts/generate_binary_build_matrix.py
Auto-merging .github/workflows/generated-linux-aarch64-binary-manywheel-nightly.yml
CONFLICT (content): Merge conflict in .github/workflows/generated-linux-aarch64-binary-manywheel-nightly.yml
Auto-merging .github/workflows/generated-linux-binary-manywheel-nightly.yml
CONFLICT (content): Merge conflict in .github/workflows/generated-linux-binary-manywheel-nightly.yml
error: could not apply a2ab1624d05... [Reland] Upgrade NCCL to 2.29.3 (#176424)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Details for Dev Infra team Raised by workflow job

@Skylion007
Copy link
Collaborator

When are we updating to 2.29.7 now? :)

@kwen2501
Copy link
Collaborator Author

kwen2501 commented Mar 6, 2026

When are we updating to 2.29.7 now? :)

Here is the 2.29.7 PR (which you approved, thanks!):
#176299

This 2.29.3 PR is for cherry-picking back into 2.11 release. (It was previously reverted by the time of the cut).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request Merged open source topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants