[BE]: Update NCCL submodule to 2.22.3#133593
[BE]: Update NCCL submodule to 2.22.3#133593Skylion007 wants to merge 2 commits intopytorch:mainfrom
Conversation
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 2 jobs have failed, first few of them are: linux-binary-manywheel / manywheel-py3_8-cuda12_4-test / test, linux-binary-manywheel / manywheel-py3_8-cuda12_4-split-test / test Details for Dev Infra teamRaised by workflow job |
18a29c0 to
bd355a2
Compare
bd355a2 to
9a4e880
Compare
| "nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | " | ||
| "nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | " | ||
| "nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | " | ||
| "nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | " |
There was a problem hiding this comment.
Here https://pypi.org/project/nvidia-nccl-cu11/ version: 2.21.5 is the latest one. I guess we only would need to update cu 12.1 and 12.4
There was a problem hiding this comment.
@eqy Do you know why there are no binaries released for 11.8 here? Is this intentional or a release bug?
There was a problem hiding this comment.
There was a problem hiding this comment.
If 2.22.3 is not supported for 11.8 and we need 2.22.3, we should stop supporting 11.8, but IMO this is a pretty big step to make... On the other hand, those constrains are only needed for poetry, aren't they? In that case, we might build/link with nccl statically for 11.8, but dynamically for 12.x
There was a problem hiding this comment.
We kinda decided to stop supporting 11.8, so maybe we could update nccl? cc @kwen2501
malfet
left a comment
There was a problem hiding this comment.
Please do not remove "nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | " constrains or just add statement, that we don't care about poetry users anymore...
That's not the issue, the issue is we have a divergence of supported NCCL versions. 11.8 only supports up to 2.21.5 while 12 supports newer versions. This will also be a problem for CUDNN especially as we update it to get needed CUDNN attention fixes. For NCCL it's a big problem since the CONDA 11.8 and the pypi builds willl diverge due to using different versions of submodule compiled code vs shipped binaries. |
|
Let's try to outline pros and cons of continuing to maintain 11.8 support in 2.5 (First PyTorch version to be released with 11.8 were PyTorch-2.0) |
|
Yeah, we could try to do some submodule hacking to change the tags in different dockers, but that seems like a pain. |
|
Let's deprecate CUDA 11.8 and update nccl at the same time, for 2.6 release probably. |
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Update NCCL submodule to 2.22.3 . This is a pretty major release that has been out for a while and hasn't encountered any issues. It also fixes a lot bugs and enables some new NCCL features: https://docs.nvidia.com/deeplearning/nccl/release-notes/rel_2-22-3.html . It's been out since June so it should be fairly stable.