[BE] Update nvshem dependency to 3.3.20#160458
[BE] Update nvshem dependency to 3.3.20#160458malfet wants to merge 7 commits intogh/malfet/484/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/160458
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 134 PendingAs of commit afa4438 with merge base f782c79 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
And surprise, nvshem tarball for 3.3.20 is not publish to https://developer.download.nvidia.com/compute/redist/nvshmem/ |
There was a problem hiding this comment.
Seems the NVSHMEM modified the location that they officially publish packages --
For the install_nvshmem function:
filename="libnvshmem_cuda${cuda_major_version}-linux-${arch_path}-${nvshmem_version}"
url="https://developer.download.nvidia.com/compute/redist/nvshmem/${nvshmem_version}/builds/cuda${cuda_major_version}/txz/agnostic/${dl_arch}/${filename}.tar.gz"
changing to below should work:
filename="libnvshmem-linux-${arch_path}-${nvshmem_version}_cuda${cuda_major_version}-archive"
|
Not sure if these are needed, but a search with To be consistent, we might as well update all 3.3.9 to 3.3.20. |
|
Hi @malfet , we fixed the link and the 3.3.20 is available https://developer.download.nvidia.com/compute/redist/nvshmem/3.3.20/builds/cuda12/txz/agnostic/ Please note that these wheels now seem to have suffix of "xz" , so pytorch/.ci/docker/common/install_cuda.sh Lines 65 to 66 in a354fa9 |
|
Uh, in addition to .gz to .xz change, the file name seems to also have changes: " |
nWEIdia
left a comment
There was a problem hiding this comment.
Uh looks like there are other references to tar.gz.
e.g. after wget -q "${url}"
we have:
tar xf "${filename}.tar.gz"
So this needs yet another update...
Updated, and obviously I didn't plan to merge -f with all the signals red
|
I guess I should have stacked the comments together: filename="libnvshmem-linux-${arch_path}-${nvshmem_version}_cuda${cuda_major_version}-archive" I promise this should be THE last change that is needed. |
nWEIdia
left a comment
There was a problem hiding this comment.
Sorry, I have to eat my own words again.
the following:
cp -a "libnvshmem/include/"* /usr/local/cuda/include/
cp -a "libnvshmem/lib/"* /usr/local/cuda/lib64/
needs to be changed to:
cp -a "${filename}/include/"* /usr/local/cuda/include/
cp -a "${filename}/lib"* /usr/local/cuda/lib64
to accommodate our changes from the released tar.xz file.
v3.3.9, when extracted, we get libnvshmem,
as v3.3.20 currently stands, we get. e.g. libnvshmem-linux-x86_64-3.3.20_cuda12-archive , apologies for the inconveniences!
|
To clarify, I had to fix the URL to unblock the CUDA 13 PR initially since build was failing(when packages were not moved to the current location), so fixing the URL was a good added bonus before.. Since we merged this PR and the CUDA 13 one will needs rebase, I will open a separate PR to update the URL. |
|
@pytorchmergebot revert -m "need to rerun workflow generation (failing workflow-checks)" -c landrace |
|
@pytorchbot successfully started a revert job. Check the current status here. |
This reverts commit e0488d9. Reverted #160458 on behalf of https://github.com/wdvr due to need to rerun workflow generation (failing workflow-checks) ([comment](#160458 (comment)))
|
@malfet your PR has been successfully reverted. |
@wdvr it should have been a landrace, with @atalman 's #160788 |
|
@pytorchbot merge -f "2nd time is the charm" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Which is manylinux2_28 compatible, even on aarch64 platform archive contents and URL pattern changed quite drastically between 3.3.9 and 3.3.20, but hopefully it still works. Package `libnvshmem_host.so.3` into gigantic aarch64+CUDA wheel Should fix pytorch#160425 Pull Request resolved: pytorch#160458 Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/tinglvv
This reverts commit e0488d9. Reverted pytorch#160458 on behalf of https://github.com/wdvr due to need to rerun workflow generation (failing workflow-checks) ([comment](pytorch#160458 (comment)))
Which is manylinux2_28 compatible, even on aarch64 platform archive contents and URL pattern changed quite drastically between 3.3.9 and 3.3.20, but hopefully it still works. Package `libnvshmem_host.so.3` into gigantic aarch64+CUDA wheel Should fix pytorch#160425 Pull Request resolved: pytorch#160458 Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/tinglvv
Which is manylinux2_28 compatible, even on aarch64 platform archive contents and URL pattern changed quite drastically between 3.3.9 and 3.3.20, but hopefully it still works. Package `libnvshmem_host.so.3` into gigantic aarch64+CUDA wheel Should fix pytorch#160425 Pull Request resolved: pytorch#160458 Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/tinglvv
This reverts commit e0488d9. Reverted pytorch#160458 on behalf of https://github.com/wdvr due to need to rerun workflow generation (failing workflow-checks) ([comment](pytorch#160458 (comment)))
Which is manylinux2_28 compatible, even on aarch64 platform archive contents and URL pattern changed quite drastically between 3.3.9 and 3.3.20, but hopefully it still works. Package `libnvshmem_host.so.3` into gigantic aarch64+CUDA wheel Should fix pytorch#160425 Pull Request resolved: pytorch#160458 Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/tinglvv
Stack from ghstack (oldest at bottom):
Which is manylinux2_28 compatible, even on aarch64 platform
archive contents and URL pattern changed quite drastically between 3.3.9 and 3.3.20, but hopefully it still works.
Package
libnvshmem_host.so.3into gigantic aarch64+CUDA wheelShould fix #160425