nccl: upgrade to 2.26.2 to avoid hang on ncclCommAbort#149351
nccl: upgrade to 2.26.2 to avoid hang on ncclCommAbort#149351
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149351
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (3 Unrelated Failures)As of commit 6b79c80 with merge base 8bc7bd9 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
UNSTABLE - The following jobs are marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 mandatory check(s) failed. The first few are: Dig deeper by viewing the failures on hud |
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 2 checks: pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, ephemeral.linux.2xlarge), pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
Hi @d4l3k looks like this causes failure with aarch64 CUDA builds : https://github.com/pytorch/pytorch/actions/runs/13937741027/job/39008788394#step:14:4197 Errors: |
Observing aarch64 failure in nightly: https://github.com/pytorch/pytorch/actions/runs/13917778961/job/38943911228 Similar to: pytorch/vision#8982 ``` 2025-03-18T08:44:58.4128744Z Repairing Wheel with AuditWheel 2025-03-18T08:44:58.5440988Z INFO:auditwheel.main_repair:Repairing torch-2.8.0.dev20250318+cpu-cp39-cp39-linux_aarch64.whl 2025-03-18T08:45:20.3393288Z Traceback (most recent call last): 2025-03-18T08:45:20.3393732Z File "/opt/python/cp39-cp39/bin/auditwheel", line 8, in <module> 2025-03-18T08:45:20.3394115Z sys.exit(main()) 2025-03-18T08:45:20.3394559Z File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/main.py", line 53, in main 2025-03-18T08:45:20.3395064Z result: int | None = args.func(args, p) 2025-03-18T08:45:20.3395626Z File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/main_repair.py", line 203, in execute 2025-03-18T08:45:20.3396163Z out_wheel = repair_wheel( 2025-03-18T08:45:20.3396657Z File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/repair.py", line 84, in repair_wheel 2025-03-18T08:45:20.3397184Z raise ValueError(msg) 2025-03-18T08:45:20.3397620Z ValueError: Cannot repair wheel, because required library "libarm_compute.so" could not be located 2025-03-18T08:45:20.3678843Z Traceback (most recent call last): 2025-03-18T08:45:20.3679267Z File "/pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py", line 236, in <module> 2025-03-18T08:45:20.3680988Z pytorch_wheel_name = complete_wheel("/pytorch/") 2025-03-18T08:45:20.3681449Z File "/pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py", line 141, in complete_wheel 2025-03-18T08:45:20.3681976Z check_call(["auditwheel", "repair", f"dist/{wheel_name}"], cwd=folder) 2025-03-18T08:45:20.3682860Z File "/opt/python/cp39-cp39/lib/python3.9/subprocess.py", line 373, in check_call 2025-03-18T08:45:20.3683308Z raise CalledProcessError(retcode, cmd) 2025-03-18T08:45:20.3684034Z subprocess.CalledProcessError: Command '['auditwheel', 'repair', 'dist/torch-2.8.0.dev20250318+cpu-cp39-cp39-linux_aarch64.whl']' returned non-zero exit status 1. 2025-03-18T08:45:20.3790063Z ##[error]Process completed with exit code 1. 2025-03-18T08:45:20.3862012Z ##[group]Run pytorch/test-infra/.github/actions/teardown-linux@main 2025-03-18T08:45:20.3862448Z with: ``` Please note aarch64 CUDA failures are related to: #149351 Pull Request resolved: #149471 Approved by: https://github.com/malfet
Observing aarch64 failure in nightly: https://github.com/pytorch/pytorch/actions/runs/13917778961/job/38943911228 Similar to: pytorch/vision#8982 ``` 2025-03-18T08:44:58.4128744Z Repairing Wheel with AuditWheel 2025-03-18T08:44:58.5440988Z INFO:auditwheel.main_repair:Repairing torch-2.8.0.dev20250318+cpu-cp39-cp39-linux_aarch64.whl 2025-03-18T08:45:20.3393288Z Traceback (most recent call last): 2025-03-18T08:45:20.3393732Z File "/opt/python/cp39-cp39/bin/auditwheel", line 8, in <module> 2025-03-18T08:45:20.3394115Z sys.exit(main()) 2025-03-18T08:45:20.3394559Z File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/main.py", line 53, in main 2025-03-18T08:45:20.3395064Z result: int | None = args.func(args, p) 2025-03-18T08:45:20.3395626Z File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/main_repair.py", line 203, in execute 2025-03-18T08:45:20.3396163Z out_wheel = repair_wheel( 2025-03-18T08:45:20.3396657Z File "/opt/_internal/cpython-3.9.21/lib/python3.9/site-packages/auditwheel/repair.py", line 84, in repair_wheel 2025-03-18T08:45:20.3397184Z raise ValueError(msg) 2025-03-18T08:45:20.3397620Z ValueError: Cannot repair wheel, because required library "libarm_compute.so" could not be located 2025-03-18T08:45:20.3678843Z Traceback (most recent call last): 2025-03-18T08:45:20.3679267Z File "/pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py", line 236, in <module> 2025-03-18T08:45:20.3680988Z pytorch_wheel_name = complete_wheel("/pytorch/") 2025-03-18T08:45:20.3681449Z File "/pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py", line 141, in complete_wheel 2025-03-18T08:45:20.3681976Z check_call(["auditwheel", "repair", f"dist/{wheel_name}"], cwd=folder) 2025-03-18T08:45:20.3682860Z File "/opt/python/cp39-cp39/lib/python3.9/subprocess.py", line 373, in check_call 2025-03-18T08:45:20.3683308Z raise CalledProcessError(retcode, cmd) 2025-03-18T08:45:20.3684034Z subprocess.CalledProcessError: Command '['auditwheel', 'repair', 'dist/torch-2.8.0.dev20250318+cpu-cp39-cp39-linux_aarch64.whl']' returned non-zero exit status 1. 2025-03-18T08:45:20.3790063Z ##[error]Process completed with exit code 1. 2025-03-18T08:45:20.3862012Z ##[group]Run pytorch/test-infra/.github/actions/teardown-linux@main 2025-03-18T08:45:20.3862448Z with: ``` Please note aarch64 CUDA failures are related to: #149351 Pull Request resolved: #149471 Approved by: https://github.com/malfet (cherry picked from commit 4df66e0)
Fixes pytorch#149153 Yaml generated from: ``` python .github/scripts/generate_ci_workflows.py ``` Test plan: Repro in https://gist.github.com/d4l3k/16a19b475952bc40ddd7f2febcc297b7 ``` rm -rf third_party/nccl python setup.py develop ``` Pull Request resolved: pytorch#149351 Approved by: https://github.com/kwen2501, https://github.com/atalman, https://github.com/malfet
…12.6 docker (#149540) 1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: #149351 TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds 3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: #148895 Pull Request resolved: #149540 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia
…12.6 docker (pytorch#149540) 1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: pytorch#149351 TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds 3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: pytorch#148895 Pull Request resolved: pytorch#149540 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia
| @@ -1 +1 @@ | |||
| v2.25.1-1 | |||
There was a problem hiding this comment.
Ugh, we forgot to update some of the common docker install sh scripts to the latest NCCL version too?
There was a problem hiding this comment.
@Skylion007 sorry about that! Do you have code pointers? I can send another PR
Also do you have any repro instructions on how to trigger those so I don't miss anything again?
…12.6 docker (pytorch#149540) 1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: pytorch#149351 TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds 3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: pytorch#148895 Pull Request resolved: pytorch#149540 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia
…149351) (#149546) * nccl: upgrade to 2.26.2 to avoid hang on ncclCommAbort (#149351) Fixes #149153 Yaml generated from: ``` python .github/scripts/generate_ci_workflows.py ``` Test plan: Repro in https://gist.github.com/d4l3k/16a19b475952bc40ddd7f2febcc297b7 ``` rm -rf third_party/nccl python setup.py develop ``` Pull Request resolved: #149351 Approved by: https://github.com/kwen2501, https://github.com/atalman, https://github.com/malfet * fixed_regenerations --------- Co-authored-by: Tristan Rice <rice@fn.lc>
… aarch64 cuda 12.6 docker #149540 (#149624) Modify cuda aarch64 install for cudnn and nccl. Cleanup aarch64 cuda 12.6 docker (#149540) 1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: #149351 TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds 3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: #148895 Pull Request resolved: #149540 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia
…12.6 docker (pytorch#149540) 1. Use NCCL_VERSION=v2.26.2-1 . Fixes nccl cuda aarch64 related failure we see here: https://github.com/pytorch/pytorch/actions/runs/13955856471/job/39066681549?pr=149443 . After landing: pytorch#149351 TODO: Followup required to unify NCCL definitions across the x86 and aarch64 builds 3. Cleanup Remove older CUDA versions for aarch64 builds . CUDA 12.6 where removed by: pytorch#148895 Pull Request resolved: pytorch#149540 Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/nWEIdia
Fixes #149153
Yaml generated from:
Test plan:
Repro in https://gist.github.com/d4l3k/16a19b475952bc40ddd7f2febcc297b7