Skip to content

third-party: Improvements to NVSHMEM Integration#295

Merged
sphish merged 9 commits intodeepseek-ai:mainfrom
seth-howell:main
Jul 16, 2025
Merged

third-party: Improvements to NVSHMEM Integration#295
sphish merged 9 commits intodeepseek-ai:mainfrom
seth-howell:main

Conversation

@seth-howell
Copy link
Contributor

  1. Use upstream NVSHMEM binaries when building DeepEP.
  2. Add back support for CPU-Assisted IBGDA.
  3. Remove the nvshmem host-side patch.

NVSHMEM 3.3 and above support the host-side features
in the patch.

Note: Removed recv queue support

Signed-off-by: Seth Howell <sethh@nvidia.com>
This allows users to use NVSHMEM without setting the driver regkey.

Signed-off-by: Seth Howell <sethh@nvidia.com>
Signed-off-by: Seth Howell <sethh@nvidia.com>
setup.py Outdated
nvcc_dlink.extend(['-dlink', f'-L{nvshmem_dir}/lib', '-lnvshmem'])
extra_link_args.extend(['-l:libnvshmem.a', '-l:nvshmem_bootstrap_uid.so', f'-Wl,-rpath,{nvshmem_dir}/lib'])
nvcc_dlink.extend(['-dlink', f'-L{nvshmem_dir}/lib', '-lnvshmem_device'])
extra_link_args.extend(['-l:libnvshmem_host.so', '-l:libnvshmem_device.a', f'-Wl,-rpath,{nvshmem_dir}/lib'])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quality of life feature, since we use unedited nvshmem binaries now, why don't we just we have NVSHMEM dir in setup.py try to find it from the nvshmem wheel if not specified with an import nvshmem; nvshmem.__file__? Would simplify compilation a lot

Copy link
Contributor Author

@seth-howell seth-howell Jul 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was validating this on my local system, and it seems that some NVIDIA wheels which only include C++ binaries (NVSHMEM, NCCL, etc.) don't have __init__.py files in them so it's impossible to do this right now (nvshmem.__file__ is None). I've raised a ticket internally to fix that and as soon as that is up, can push another change to update setup.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disregard, I got a little extra guidance on how we're expected to do this with namespace packages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

This will give consumers an opportunity to update their builds.

Signed-off-by: Seth Howell <sethh@nvidia.com>
Signed-off-by: Seth Howell <sethh@nvidia.com>
Responding to review comments.

Signed-off-by: Seth Howell <sethh@nvidia.com>
This enables the CPU-Assisted data path.

Signed-off-by: Seth Howell <sethh@nvidia.com>
Signed-off-by: Seth Howell <sethh@nvidia.com>
@sphish
Copy link
Collaborator

sphish commented Jul 16, 2025

LGTM, any suggestion? @youkaichao

@youkaichao
Copy link
Contributor

LGTM now, thanks!

@sphish sphish merged commit b6ce310 into deepseek-ai:main Jul 16, 2025
@ishandhanani
Copy link

ishandhanani commented Jul 31, 2025

@youkaichao - we're having some trouble when running SGLang + DeepEP after this version bump. Specifically we see some cuMemCreate failed errors. Curious on if you've seen that recently after this PR?

Resolved by fixing cuda graph bs in sglang

@youkaichao
Copy link
Contributor

we're having some trouble when running SGLang + DeepEP after this version bump. Specifically we see some cuMemCreate failed errors. Curious on if you've seen that recently after this PR?

sorry i have no ideas

@alpha-baby
Copy link
Contributor

@youkaichao - we're having some trouble when running SGLang + DeepEP after this version bump. Specifically we see some cuMemCreate failed errors. Curious on if you've seen that recently after this PR?-在此版本升级后,我们在运行 SGLang+DeepEP 时遇到了一些问题。具体来说,我们看到了一些#0 错误。想知道在这次公关之后,你最近是否看到了这一点?

can you show detail log?

@HPC4AI
Copy link

HPC4AI commented Oct 28, 2025

@youkaichao - we're having some trouble when running SGLang + DeepEP after this version bump. Specifically we see some cuMemCreate failed errors. Curious on if you've seen that recently after this PR?

Resolved by fixing cuda graph bs in sglang

Hello, I’m also encountering the same issue. Could you share how you resolved it? Why does CUDA Graph cause this error?
Thanks

@ishandhanani
Copy link

@youkaichao - we're having some trouble when running SGLang + DeepEP after this version bump. Specifically we see some cuMemCreate failed errors. Curious on if you've seen that recently after this PR?
Resolved by fixing cuda graph bs in sglang

Hello, I’m also encountering the same issue. Could you share how you resolved it? Why does CUDA Graph cause this error? Thanks

NVSHMEM during init allocs memory. CUDA graph was taking too much mem causing the fail

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants