Skip to content

Upgrade NCCL to 2.29.3#174338

Closed
kwen2501 wants to merge 2 commits intogh/kwen2501/316/basefrom
gh/kwen2501/316/head
Closed

Upgrade NCCL to 2.29.3#174338
kwen2501 wants to merge 2 commits intogh/kwen2501/316/basefrom
gh/kwen2501/316/head

Conversation

@kwen2501
Copy link
Collaborator

@kwen2501 kwen2501 commented Feb 4, 2026

Stack from ghstack (oldest at bottom):

2.29.3 is a patch release that fixes the regression (hang) in 2.29.2. We didn't upgrade to 2.29.2 last time.

[ghstack-poisoned]
@kwen2501 kwen2501 requested review from a team and jeffdaily as code owners February 4, 2026 23:32
@pytorch-bot
Copy link

pytorch-bot bot commented Feb 4, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174338

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit f0df89d with merge base 460a3f6 (image):

NEW FAILURE - The following job has failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

kwen2501 added a commit that referenced this pull request Feb 4, 2026
ghstack-source-id: a248b09
Pull-Request: #174338
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If CI is green...

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Feb 4, 2026
ghstack-source-id: 9bc5a22
Pull-Request: #174338
@kwen2501
Copy link
Collaborator Author

kwen2501 commented Feb 5, 2026

CI error happens during python 3.12 setup

2026-02-04T23:57:35.9370116Z #11 162.8 Setting up python3.12-minimal (3.12.3-1ubuntu0.10) ...
2026-02-04T23:57:35.9370522Z #11 163.4 Segmentation fault (core dumped)
2026-02-04T23:57:35.9371075Z #11 163.4 dpkg: error processing package python3.12-minimal (--configure):
2026-02-04T23:57:35.9371888Z #11 163.4  installed python3.12-minimal package post-installation script subprocess returned error exit status 139
2026-02-04T23:57:35.9372564Z #11 163.5 Errors were encountered while processing:
2026-02-04T23:57:35.9372938Z #11 163.5  python3.12-minimal
2026-02-04T23:57:35.9373297Z #11 163.5 E: Sub-process /usr/bin/dpkg returned an error code (1)

Considering that as unrelated to the NCCL upgrade.

@kwen2501
Copy link
Collaborator Author

kwen2501 commented Feb 5, 2026

@pytorchbot merge -f "Failure is python setup, unrelated"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@huydhn
Copy link
Contributor

huydhn commented Feb 12, 2026

@pytorchbot revert -m 'This change seems to break vLLM benchmark on a couple of models' -c nosignal

https://github.com/pytorch/pytorch/actions/runs/21867580232/job/63123532634#step:15:36062

  • openai/gpt-oss-120b
  • and mistralai/mixtral-8x7b-instruct-v0.1

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

Reverting PR 174338 failed

Reason: Command git -C /home/runner/work/pytorch/pytorch revert --no-edit 881aff4654c9d8521317aa755e8e554b5fefc5d8 returned non-zero exit code 1

Auto-merging .github/scripts/generate_binary_build_matrix.py
CONFLICT (content): Merge conflict in .github/scripts/generate_binary_build_matrix.py
Auto-merging .github/workflows/generated-linux-aarch64-binary-manywheel-nightly.yml
CONFLICT (content): Merge conflict in .github/workflows/generated-linux-aarch64-binary-manywheel-nightly.yml
Auto-merging .github/workflows/generated-linux-binary-manywheel-nightly.yml
CONFLICT (content): Merge conflict in .github/workflows/generated-linux-binary-manywheel-nightly.yml
error: could not revert 881aff4654c... Upgrade NCCL to 2.29.3 (#174338)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git revert --continue".
hint: You can instead skip this commit with "git revert --skip".
hint: To abort and get back to the state before "git revert",
hint: run "git revert --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Details for Dev Infra team Raised by workflow job

@huydhn
Copy link
Contributor

huydhn commented Feb 12, 2026

cc @atalman I will revert this manually then because this has a conflict with #174310 and #174390

huydhn added a commit that referenced this pull request Feb 12, 2026
Signed-off-by: Huy Do <huydhn@gmail.com>
pytorchmergebot pushed a commit that referenced this pull request Feb 12, 2026
#174338 (comment) causes an issue with vLLM benchmark for some models, so I need to revert it.
Pull Request resolved: #174838
Approved by: https://github.com/atalman
radeksm pushed a commit to radeksm/pytorch that referenced this pull request Feb 20, 2026
2.29.3 is a patch release that fixes the regression (hang) in 2.29.2. We didn't upgrade to 2.29.2 last time.
Pull Request resolved: pytorch#174338
Approved by: https://github.com/malfet
pytorchmergebot pushed a commit that referenced this pull request Mar 5, 2026
Trying to reland after original PR's revert #174338
Pull Request resolved: #176424
Approved by: https://github.com/Skylion007, https://github.com/huydhn
@github-actions github-actions bot deleted the gh/kwen2501/316/head branch March 14, 2026 02:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants