Skip to content

fix build error without GPU for doca/NCCL#606

Open
alpha-baby wants to merge 1 commit into
deepseek-ai:hybrid-epfrom
alpha-baby:fujh/build_hybrid-ep_without_gpu
Open

fix build error without GPU for doca/NCCL#606
alpha-baby wants to merge 1 commit into
deepseek-ai:hybrid-epfrom
alpha-baby:fujh/build_hybrid-ep_without_gpu

Conversation

@alpha-baby

Copy link
Copy Markdown
Contributor

fix build error

build command:

rm -rf .use_nixl
HYBRID_EP_MULTINODE=1 LIBRARY_PATH=/usr/local/cuda/lib64/stubs:$LIBRARY_PATH TORCH_CUDA_ARCH_LIST='10.0' python setup.py bdist_wheel

error log:

FAILED: /root/hybrid-ep-open-source/build/temp.linux-x86_64-cpython-312/csrc/hybrid_ep/buffer/internode_doca.o
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /root/hybrid-ep-open-source/build/temp.linux-x86_64-cpython-312/csrc/hybrid_ep/buffer/internode_doca.o.d -I/root/hybrid-ep-open-source/csrc/hybrid_ep/ -I/root/hybrid-ep-open-source/csrc/hybrid_ep/backend/ -I/root/hybrid-ep-open-source/third-party/nccl/src/transport/net_ib/gdaki/doca-gpunetio/include -Iinclude -I/usr/local/lib/python3.12/dist-packages/torch/include -I/usr/local/lib/python3.12/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/usr/include/python3.12 -c -c /root/hybrid-ep-open-source/csrc/hybrid_ep/buffer/internode_doca.cu -o /root/hybrid-ep-open-source/build/temp.linux-x86_64-cpython-312/csrc/hybrid_ep/buffer/internode_doca.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -std=c++17 -Xcompiler -fPIC --expt-relaxed-constexpr -O3 --shared '-DSM_ARCH="10.0"' -DHYBRID_EP_BUILD_MULTINODE_ENABLE '-DRDMA_CORE_HOME=""' -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=hybrid_ep_cpp -gencode=arch=compute_100,code=sm_100
In file included from /root/hybrid-ep-open-source/third-party/nccl/src/transport/net_ib/gdaki/doca-gpunetio/include/doca_gpunetio_device.h:41,
                 from /root/hybrid-ep-open-source/csrc/hybrid_ep/buffer/internode_doca.cuh:8,
                 from /root/hybrid-ep-open-source/csrc/hybrid_ep/buffer/internode_doca.cu:4:
/root/hybrid-ep-open-source/third-party/nccl/src/transport/net_ib/gdaki/doca-gpunetio/include/device/doca_gpunetio_dev_verbs_common.cuh:58:2: warning: #warning "warning: doca_gpunetio should be used with a CUDA version >= 12020." [-Wcpp]
   58 | #warning "warning: doca_gpunetio should be used with a CUDA version >= 12020."
      |  ^~~~~~~
In file included from /root/hybrid-ep-open-source/third-party/nccl/src/transport/net_ib/gdaki/doca-gpunetio/include/doca_gpunetio_device.h:41,
                 from /root/hybrid-ep-open-source/csrc/hybrid_ep/buffer/internode_doca.cuh:8,
                 from /root/hybrid-ep-open-source/csrc/hybrid_ep/buffer/internode_doca.cu:4:
/root/hybrid-ep-open-source/third-party/nccl/src/transport/net_ib/gdaki/doca-gpunetio/include/device/doca_gpunetio_dev_verbs_common.cuh:58:2: warning: #warning "warning: doca_gpunetio should be used with a CUDA version >= 12020." [-Wcpp]
   58 | #warning "warning: doca_gpunetio should be used with a CUDA version >= 12020."
      |  ^~~~~~~
/root/hybrid-ep-open-source/csrc/hybrid_ep/buffer/internode_doca.cu(796): error: name followed by "::" must be a class or namespace name
    torch::Tensor buffer = torch::empty({num_bytes}, at::device(at::kCPU).dtype(at::kByte));
    ^

/root/hybrid-ep-open-source/csrc/hybrid_ep/buffer/internode_doca.cu(796): error: expected a ";"
    torch::Tensor buffer = torch::empty({num_bytes}, at::device(at::kCPU).dtype(at::kByte));
                  ^

/root/hybrid-ep-open-source/csrc/hybrid_ep/buffer/internode_doca.cu(797): error: identifier "buffer" is undefined
    memcpy(buffer.data_ptr<uint8_t>(), reinterpret_cast<void *>(src), num_of_qps * sizeof(remote_info));
           ^

/root/hybrid-ep-open-source/csrc/hybrid_ep/buffer/internode_doca.cu(797): error: type name is not allowed
    memcpy(buffer.data_ptr<uint8_t>(), reinterpret_cast<void *>(src), num_of_qps * sizeof(remote_info));
                           ^

/root/hybrid-ep-open-source/csrc/hybrid_ep/buffer/internode_doca.cu(797): error: expected an expression
    memcpy(buffer.data_ptr<uint8_t>(), reinterpret_cast<void *>(src), num_of_qps * sizeof(remote_info));
                                    ^

/root/hybrid-ep-open-source/csrc/hybrid_ep/buffer/internode_doca.cu(805): error: name followed by "::" must be a class or namespace name
      output_list.append(torch::empty_like(buffer));
                         ^

/root/hybrid-ep-open-source/csrc/hybrid_ep/buffer/internode_doca.cu(812): error: name followed by "::" must be a class or namespace name
      auto tensor = output_list[i].cast<torch::Tensor>().cpu();
                                        ^

/root/hybrid-ep-open-source/csrc/hybrid_ep/buffer/internode_doca.cu(813): error: type name is not allowed
      memcpy(dst + num_of_qps * (i / buffer_config.num_of_ranks_per_node), tensor.data_ptr<uint8_t>(), num_of_qps * sizeof(remote_info));
                                                                                           ^

/root/hybrid-ep-open-source/csrc/hybrid_ep/buffer/internode_doca.cu(813): error: expected an expression
      memcpy(dst + num_of_qps * (i / buffer_config.num_of_ranks_per_node), tensor.data_ptr<uint8_t>(), num_of_qps * sizeof(remote_info));
                                                                                                    ^

9 errors detected in the compilation of "/root/hybrid-ep-open-source/csrc/hybrid_ep/buffer/internode_doca.cu".

@alpha-baby

Copy link
Copy Markdown
Contributor Author

Hi @Autumn1998. I would greatly appreciate it if you could take a look when you get a chance. Please let me know if any changes are needed. Thank you for your time and guidance!

@shouyil

shouyil commented Apr 27, 2026

Copy link
Copy Markdown

Had the same issue building container. Thanks @alpha-baby for this PR. For me, adding the includes fixed the issue. @Autumn1998 Please take a look when you get a chance.

@Autumn1998

Autumn1998 commented Apr 28, 2026

Copy link
Copy Markdown
Collaborator

@alpha-baby @shouyil This should be an issue introduced by the latest nixl backend. Please temporarily drop/revert the latest commit.
cc @QiaoK, Could you help take a look and fix it as soon as possible?

@QiaoK

QiaoK commented Apr 28, 2026

Copy link
Copy Markdown

Addressed in #616. I can reproduce the issue reported. With the fix in PR 616, there is no more build error using USE_NIXL=0. Please let me know if there are any remaining issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants