Skip to content

Fix DOCA build/path regressions from NIXL integration (#606)#616

Merged
jershi425 merged 1 commit into
deepseek-ai:hybrid-epfrom
QiaoK:doca_path_fix
Apr 29, 2026
Merged

Fix DOCA build/path regressions from NIXL integration (#606)#616
jershi425 merged 1 commit into
deepseek-ai:hybrid-epfrom
QiaoK:doca_path_fix

Conversation

@QiaoK

@QiaoK QiaoK commented Apr 28, 2026

Copy link
Copy Markdown

The NIXL integration commit (1b8f467) broke DOCA-only builds in two ways:

  1. internode_doca.cuh lost the torch / pybind11 / standard headers that the original internode.cuh provided. Building with HYBRID_EP_MULTINODE=1 and USE_NIXL=0 (the documented DOCA path) failed with 9 nvcc errors on the torch::Tensor / py::list / memcpy expressions inside RDMACoordinator::exchange_remote_rdma_info.

  2. A committed empty .use_nixl marker file plus a setup.py fallback forced use_nixl=True on every build, regardless of the USE_NIXL env var. This silently disabled the DOCA path even when the user explicitly opted out.

This change:

  • Restores the missing system / torch / pybind11 includes in internode_doca.cuh (relocated from the original internode.cuh; guarded by #ifndef USE_NIXL so they cannot leak into the NIXL build).
  • Removes the .use_nixl marker file and the corresponding fallback branch in setup.py, making USE_NIXL=0 the actual default again.
  • Restores the DOCA branch in setup.py to the pre-NIXL behavior: subprocess.run(make src.build, ..., check=True) with no NIXL-suggesting SystemExit message.
  • Updates docs/README_Hybrid-EP.md to drop the .use_nixl workaround and point users at pip install --no-build-isolation when env vars are stripped by build isolation.

DOCA build configuration (sources, includes, libs, extra_objects, link args, NVCC flags) is now byte-for-byte identical to the pre-1b8f467 state. NIXL paths are unchanged and remain opt-in via USE_NIXL=1.

Fixes #606

The NIXL integration commit (1b8f467) broke DOCA-only builds in two
ways:

1. internode_doca.cuh lost the torch / pybind11 / standard headers that
   the original internode.cuh provided. Building with HYBRID_EP_MULTINODE=1
   and USE_NIXL=0 (the documented DOCA path) failed with 9 nvcc errors on
   the torch::Tensor / py::list / memcpy expressions inside
   RDMACoordinator::exchange_remote_rdma_info.

2. A committed empty .use_nixl marker file plus a setup.py fallback
   forced use_nixl=True on every build, regardless of the USE_NIXL env
   var. This silently disabled the DOCA path even when the user
   explicitly opted out.

This change:

- Restores the missing system / torch / pybind11 includes in
  internode_doca.cuh (relocated from the original internode.cuh; guarded
  by #ifndef USE_NIXL so they cannot leak into the NIXL build).
- Removes the .use_nixl marker file and the corresponding fallback
  branch in setup.py, making USE_NIXL=0 the actual default again.
- Restores the DOCA branch in setup.py to the pre-NIXL behavior:
  subprocess.run(make src.build, ..., check=True) with no NIXL-suggesting
  SystemExit message.
- Updates docs/README_Hybrid-EP.md to drop the .use_nixl workaround and
  point users at pip install --no-build-isolation when env vars are
  stripped by build isolation.

DOCA build configuration (sources, includes, libs, extra_objects, link
args, NVCC flags) is now byte-for-byte identical to the pre-1b8f467
state. NIXL paths are unchanged and remain opt-in via USE_NIXL=1.

Fixes deepseek-ai#606
@jershi425 jershi425 merged commit d28bd67 into deepseek-ai:hybrid-ep Apr 29, 2026
Autumn1998 pushed a commit to Autumn1998/DeepEP that referenced this pull request May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants