third-party: Improvements to NVSHMEM Integration#295
third-party: Improvements to NVSHMEM Integration#295sphish merged 9 commits intodeepseek-ai:mainfrom
Conversation
seth-howell
commented
Jul 12, 2025
- Use upstream NVSHMEM binaries when building DeepEP.
- Add back support for CPU-Assisted IBGDA.
- Remove the nvshmem host-side patch.
NVSHMEM 3.3 and above support the host-side features in the patch. Note: Removed recv queue support Signed-off-by: Seth Howell <sethh@nvidia.com>
This allows users to use NVSHMEM without setting the driver regkey. Signed-off-by: Seth Howell <sethh@nvidia.com>
Signed-off-by: Seth Howell <sethh@nvidia.com>
setup.py
Outdated
| nvcc_dlink.extend(['-dlink', f'-L{nvshmem_dir}/lib', '-lnvshmem']) | ||
| extra_link_args.extend(['-l:libnvshmem.a', '-l:nvshmem_bootstrap_uid.so', f'-Wl,-rpath,{nvshmem_dir}/lib']) | ||
| nvcc_dlink.extend(['-dlink', f'-L{nvshmem_dir}/lib', '-lnvshmem_device']) | ||
| extra_link_args.extend(['-l:libnvshmem_host.so', '-l:libnvshmem_device.a', f'-Wl,-rpath,{nvshmem_dir}/lib']) |
There was a problem hiding this comment.
Quality of life feature, since we use unedited nvshmem binaries now, why don't we just we have NVSHMEM dir in setup.py try to find it from the nvshmem wheel if not specified with an import nvshmem; nvshmem.__file__? Would simplify compilation a lot
There was a problem hiding this comment.
I was validating this on my local system, and it seems that some NVIDIA wheels which only include C++ binaries (NVSHMEM, NCCL, etc.) don't have __init__.py files in them so it's impossible to do this right now (nvshmem.__file__ is None). I've raised a ticket internally to fix that and as soon as that is up, can push another change to update setup.py.
There was a problem hiding this comment.
Disregard, I got a little extra guidance on how we're expected to do this with namespace packages.
This will give consumers an opportunity to update their builds. Signed-off-by: Seth Howell <sethh@nvidia.com>
Signed-off-by: Seth Howell <sethh@nvidia.com>
Signed-off-by: Seth Howell <sethh@nvidia.com>
Responding to review comments. Signed-off-by: Seth Howell <sethh@nvidia.com>
This enables the CPU-Assisted data path. Signed-off-by: Seth Howell <sethh@nvidia.com>
Signed-off-by: Seth Howell <sethh@nvidia.com>
|
LGTM, any suggestion? @youkaichao |
|
LGTM now, thanks! |
|
Resolved by fixing cuda graph bs in sglang |
sorry i have no ideas |
can you show detail log? |
Hello, I’m also encountering the same issue. Could you share how you resolved it? Why does CUDA Graph cause this error? |
NVSHMEM during init allocs memory. CUDA graph was taking too much mem causing the fail |