Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker#149540
Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker#149540atalman wants to merge 4 commits intopytorch:mainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149540
Note: Links to docs will display an error until the docs builds have been completed. ⏳ 19 Pending, 1 Unrelated FailureAs of commit f865fe7 with merge base 94d761f ( UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| popd | ||
| rm -rf tmp_cusparselt | ||
| } | ||
| NCCL_VERSION=v2.26.2-1 |
There was a problem hiding this comment.
Does this have any issues for older cuda versions? For x86 build we need to use the old nccl version for cuda11.8? cc @kwen2501
There was a problem hiding this comment.
This is only for cuda aarch64 build. Currently for this build we support only CUDA 12.8
There was a problem hiding this comment.
Has this NCCL_VERSION gone through distributed CI testing? Does CI have sufficient signals? Or we would only know when we merge this?
There was a problem hiding this comment.
Oh, this is aarch64 binary, so we don't have GH100 (multiple of them) in CI yet.
malfet
left a comment
There was a problem hiding this comment.
LGTM, but please add a followup issue to unify NCCL definitions across aarch64 and x86 builds
revert previous commit
|
@pytorchmergebot merge -f "lint is green and aarch64 docker build as well" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
@pytorchbot cherry-pick --onto release/2.7 --fixes "aarch64 cuda failures with nccl" -c critical |
Cherry picking #149540Command Details for Dev Infra teamRaised by workflow job |
…12.6 docker (pytorch#149540) 1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: pytorch#149351 TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds 3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: pytorch#148895 Pull Request resolved: pytorch#149540 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia
…12.6 docker (pytorch#149540) 1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: pytorch#149351 TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds 3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: pytorch#148895 Pull Request resolved: pytorch#149540 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia
… aarch64 cuda 12.6 docker #149540 (#149624) Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker (#149540) 1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: #149351 TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds 3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: #148895 Pull Request resolved: #149540 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia
…12.6 docker (pytorch#149540) 1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: pytorch#149351 TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds 3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: pytorch#148895 Pull Request resolved: pytorch#149540 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia
Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: nccl: upgrade to 2.26.2 to avoid hang on ncclCommAbort #149351
TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds
Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: Remove 12.4 x86 builds and 12.6 sbsa builds from nightly #148895