Skip to content

[BE]: Update NCCL to 2.27.5#157108

Closed
Skylion007 wants to merge 1 commit intopytorch:mainfrom
Skylion007:skylion007/update-nccl-2-27-5
Closed

[BE]: Update NCCL to 2.27.5#157108
Skylion007 wants to merge 1 commit intopytorch:mainfrom
Skylion007:skylion007/update-nccl-2-27-5

Conversation

@Skylion007
Copy link
Collaborator

Update NCCL to 2.27.5. Minor version, improves Blackwell, Symmem FP8 support, and fixes a bug with MNVVL.

@Skylion007 Skylion007 requested review from atalman, eqy and nWEIdia June 27, 2025 15:35
@Skylion007 Skylion007 requested review from a team and jeffdaily as code owners June 27, 2025 15:35
@pytorch-bot
Copy link

pytorch-bot bot commented Jun 27, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/157108

Note: Links to docs will display an error until the docs builds have been completed.

❌ 10 New Failures, 1 Cancelled Job, 1 Pending, 7 Unrelated Failures

As of commit 6659c69 with merge base d26ca5d (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOB - The following job was cancelled. Please retry:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@Skylion007
Copy link
Collaborator Author

@atalman Will need new binaries.

@Skylion007 Skylion007 requested a review from albanD June 29, 2025 14:04
@atalman atalman added ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/inductor-periodic keep-going Don't stop on first failure, keep running tests until the end ci-no-td Do not run TD on this PR labels Jul 2, 2025
@Skylion007
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 jobs have failed, first few of them are: inductor-rocm / rocm-py3.10-inductor / test (inductor, 2, 2, linux.rocm.gpu.2), inductor-rocm / rocm-py3.10-inductor / test (inductor, 1, 2, linux.rocm.gpu.2)

Details for Dev Infra team Raised by workflow job

@albanD albanD removed their request for review July 2, 2025 16:44
@Skylion007
Copy link
Collaborator Author

@pytorchbot -i "unrelated failures"

@pytorch-bot
Copy link

pytorch-bot bot commented Jul 4, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'unrelated failures' (choose from 'merge', 'revert', 'rebase', 'label', 'drci', 'cherry-pick')

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick} ...

Try @pytorchbot --help for more info.

@Skylion007
Copy link
Collaborator Author

@pytorchbot merge -i "unrelated rocm failures"

@pytorch-bot
Copy link

pytorch-bot bot commented Jul 4, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: unrelated rocm failures

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick} ...

Try @pytorchbot --help for more info.

@Skylion007 Skylion007 force-pushed the skylion007/update-nccl-2-27-5 branch from 1fa19e7 to 0cabbb4 Compare July 4, 2025 13:55
@Skylion007
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: linux-binary-libtorch / libtorch-rocm6_3-shared-with-deps-release-build / build

Details for Dev Infra team Raised by workflow job

@Skylion007
Copy link
Collaborator Author

@pytorchbot merge -i "unrelated"

@pytorch-bot
Copy link

pytorch-bot bot commented Jul 4, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: unrelated

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick} ...

Try @pytorchbot --help for more info.

@Skylion007
Copy link
Collaborator Author

@pytorchbot merge -r

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased skylion007/update-nccl-2-27-5 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout skylion007/update-nccl-2-27-5 && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the skylion007/update-nccl-2-27-5 branch from 0cabbb4 to 6659c69 Compare July 7, 2025 13:29
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

@Skylion007
Copy link
Collaborator Author

@pytorchbot merge -f "indcutor periodic is unrelated"

@Skylion007 Skylion007 added the better-engineering Relatively self-contained tasks for better engineering contributors label Jul 8, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@huydhn
Copy link
Contributor

huydhn commented Jul 16, 2025

@pytorchbot --help

@pytorch-bot
Copy link

pytorch-bot bot commented Jul 16, 2025

PyTorchBot Help

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick} ...

In order to invoke the bot on your PR, include a line that starts with
@pytorchbot anywhere in a comment. That line will form the command; no
multi-line commands are allowed. Some commands may be used on issues as specified below.

Example:
    Some extra context, blah blah, wow this PR looks awesome

    @pytorchbot merge

optional arguments:
  -h, --help            Show this help message and exit.

command:
  {merge,revert,rebase,label,drci,cherry-pick}
    merge               Merge a PR
    revert              Revert a PR
    rebase              Rebase a PR
    label               Add label to a PR
    drci                Update Dr. CI
    cherry-pick         Cherry pick a PR onto a release branch

Merge

usage: @pytorchbot merge [-f MESSAGE | -i] [-ic] [-r [{viable/strict,main}]]

Merge an accepted PR, subject to the rules in .github/merge_rules.json.
By default, this will wait for all required checks (lint, pull) to succeed before merging.

optional arguments:
  -f MESSAGE, --force MESSAGE
                        Merge without checking anything. This requires a reason for auditting purpose, for example:
                        @pytorchbot merge -f 'Minor update to fix lint. Expecting all PR tests to pass'
                        
                        Please use `-f` as last resort, prefer `--ignore-current` to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.
  -i, --ignore-current  Merge while ignoring the currently failing jobs.  Behaves like -f if there are no pending jobs.
  -ic                   Old flag for --ignore-current. Deprecated in favor of -i.
  -r [{viable/strict,main}], --rebase [{viable/strict,main}]
                        Rebase the PR to re run checks before merging.  Accepts viable/strict or main as branch options and will default to viable/strict if not specified.

Revert

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Revert a merged PR. This requires that you are a Meta employee.

Example:
  @pytorchbot revert -m="This is breaking tests on trunk. hud.pytorch.org/" -c=nosignal

optional arguments:
  -m MESSAGE, --message MESSAGE
                        The reason you are reverting, will be put in the commit message. Must be longer than 3 words.
  -c {nosignal,ignoredsignal,landrace,weird,ghfirst}, --classification {nosignal,ignoredsignal,landrace,weird,ghfirst}
                        A machine-friendly classification of the revert reason.

Rebase

usage: @pytorchbot rebase [-s | -b BRANCH]

Rebase a PR. Rebasing defaults to the stable viable/strict branch of pytorch.
Repeat contributor may use this command to rebase their PR.

optional arguments:
  -s, --stable          [DEPRECATED] Rebase onto viable/strict
  -b BRANCH, --branch BRANCH
                        Branch you would like to rebase to

Label

usage: @pytorchbot label labels [labels ...]

Adds label to a PR or Issue [Can be used on Issues]

positional arguments:
  labels  Labels to add to given Pull Request or Issue [Can be used on Issues]

Dr CI

usage: @pytorchbot drci 

Update Dr. CI. Updates the Dr. CI comment on the PR in case it's gotten out of sync with actual CI results.

cherry-pick

usage: @pytorchbot cherry-pick --onto ONTO [--fixes FIXES] -c
                               {regression,critical,fixnewfeature,docs,release}

Cherry pick a pull request onto a release branch for inclusion in a release

optional arguments:
  --onto ONTO, --into ONTO
                        Branch you would like to cherry pick onto (Example: release/2.1)
  --fixes FIXES         Link to the issue that your PR fixes (Example: https://github.com/pytorch/pytorch/issues/110666)
  -c {regression,critical,fixnewfeature,docs,release}, --classification {regression,critical,fixnewfeature,docs,release}
                        A machine-friendly classification of the cherry-pick reason.

@huydhn
Copy link
Contributor

huydhn commented Jul 16, 2025

@pytorchbot cherry-pick --onto release/2.8 --fixes https://github.com/vllm-project/vllm/pull/20358/files#r2208979278 -c regression

@pytorchbot
Copy link
Collaborator

Cherry picking #157108

Command git -C /home/runner/work/pytorch/pytorch cherry-pick -x 476874b37fff42a46d25dfac720ef4c71ec74fe0 returned non-zero exit code 1

Auto-merging .github/scripts/generate_binary_build_matrix.py
CONFLICT (content): Merge conflict in .github/scripts/generate_binary_build_matrix.py
Auto-merging .github/workflows/generated-linux-aarch64-binary-manywheel-nightly.yml
CONFLICT (content): Merge conflict in .github/workflows/generated-linux-aarch64-binary-manywheel-nightly.yml
Auto-merging .github/workflows/generated-linux-binary-manywheel-main.yml
CONFLICT (content): Merge conflict in .github/workflows/generated-linux-binary-manywheel-main.yml
Auto-merging .github/workflows/generated-linux-binary-manywheel-nightly.yml
CONFLICT (content): Merge conflict in .github/workflows/generated-linux-binary-manywheel-nightly.yml
error: could not apply 476874b37ff... [BE]: Update NCCL to 2.27.5 (#157108)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Details for Dev Infra team Raised by workflow job

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

better-engineering Relatively self-contained tasks for better engineering contributors ci-no-td Do not run TD on this PR ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/inductor ciflow/inductor-periodic ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request keep-going Don't stop on first failure, keep running tests until the end Merged open source topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants