Fix DOCA build/path regressions from NIXL integration (#606)#616
Merged
Conversation
The NIXL integration commit (1b8f467) broke DOCA-only builds in two ways: 1. internode_doca.cuh lost the torch / pybind11 / standard headers that the original internode.cuh provided. Building with HYBRID_EP_MULTINODE=1 and USE_NIXL=0 (the documented DOCA path) failed with 9 nvcc errors on the torch::Tensor / py::list / memcpy expressions inside RDMACoordinator::exchange_remote_rdma_info. 2. A committed empty .use_nixl marker file plus a setup.py fallback forced use_nixl=True on every build, regardless of the USE_NIXL env var. This silently disabled the DOCA path even when the user explicitly opted out. This change: - Restores the missing system / torch / pybind11 includes in internode_doca.cuh (relocated from the original internode.cuh; guarded by #ifndef USE_NIXL so they cannot leak into the NIXL build). - Removes the .use_nixl marker file and the corresponding fallback branch in setup.py, making USE_NIXL=0 the actual default again. - Restores the DOCA branch in setup.py to the pre-NIXL behavior: subprocess.run(make src.build, ..., check=True) with no NIXL-suggesting SystemExit message. - Updates docs/README_Hybrid-EP.md to drop the .use_nixl workaround and point users at pip install --no-build-isolation when env vars are stripped by build isolation. DOCA build configuration (sources, includes, libs, extra_objects, link args, NVCC flags) is now byte-for-byte identical to the pre-1b8f467 state. NIXL paths are unchanged and remain opt-in via USE_NIXL=1. Fixes deepseek-ai#606
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The NIXL integration commit (1b8f467) broke DOCA-only builds in two ways:
internode_doca.cuh lost the torch / pybind11 / standard headers that the original internode.cuh provided. Building with HYBRID_EP_MULTINODE=1 and USE_NIXL=0 (the documented DOCA path) failed with 9 nvcc errors on the torch::Tensor / py::list / memcpy expressions inside RDMACoordinator::exchange_remote_rdma_info.
A committed empty .use_nixl marker file plus a setup.py fallback forced use_nixl=True on every build, regardless of the USE_NIXL env var. This silently disabled the DOCA path even when the user explicitly opted out.
This change:
DOCA build configuration (sources, includes, libs, extra_objects, link args, NVCC flags) is now byte-for-byte identical to the pre-1b8f467 state. NIXL paths are unchanged and remain opt-in via USE_NIXL=1.
Fixes #606