I'm trying to build sgl-kernel from source. but failed during the sgl-attn build stage. the detail error info is in the end.
environments
hardware: H20 * 8
software:
host driver version: 570.158.01 (535.161.07 meet the same error)
docker images:
cuda: 12.6
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:10:22_PDT_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0
I have successfully compiled sgl-kernel fully before, with cuda dirver 535.161.07, cuda 12.4. Due to the greenctx_stream feature, we update the docker image to cuda 12.6 to compile the newest sgl-kernel. but encountered this error during sgl-attn compiling stage.
According to the error log, I have no idea what caused this error. It seems to have nothing to do with cuda driver version. I'm confused what cause this error, why there was no such problem with cuda 12.4.
Hope yours give some advices and help to solve the problem. 3ks.
specific error infos
[311/314] Building CUDA object CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimdiff_bf16_paged_softcap_sm90.cu.o
FAILED: CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimdiff_bf16_paged_softcap_sm90.cu.o
ccache /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DFLASHATTENTION_DISABLE_BACKWARD -DFLASHATTENTION_DISABLE_DROPOUT -DFLASHATTENTION_DISABLE_UNEVEN_K -DFLASHATTENTION_VARLEN_ONLY -DPy_LIMITED_API=0x03090000 -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dflash_ops_EXPORTS -I/cfs/xtchen/repositories/sglang/sgl-kernel/include -I/cfs/xtchen/repositories/sglang/sgl-kernel/csrc -I/cfs/xtchen/repositories/sgl-kernel-sub/cutlass/include -I/cfs/xtchen/repositories/sgl-kernel-sub/cutlass/tools/util/include -I/cfs/xtchen/repositories/sgl-kernel-sub/flashinfer/include -I/cfs/xtchen/repositories/sgl-kernel-sub/flashinfer/csrc -I/cfs/xtchen/repositories/sgl-kernel-sub/mscclpp/include -I/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper -isystem /usr/include/python3.10 -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/targets/x86_64-linux/include -DONNX_NAMESPACE=onnx_c2 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -D_GLIBCXX_USE_CXX11_ABI=1 -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DNDEBUG -DOPERATOR_NAMESPACE=sgl-kernel -O3 -Xcompiler -fPIC -gencode=arch=compute_90a,code=sm_90a -std=c++17 -DCUTE_USE_PACKED_TUPLE=1 -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_VERSIONS_GENERATED -DCUTLASS_TEST_LEVEL=0 -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -Xcompiler=-Wconversion -Xcompiler=-fno-strict-aliasing -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -D_GLIBCXX_USE_CXX11_ABI=1 -MD -MT CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimdiff_bf16_paged_softcap_sm90.cu.o -MF CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimdiff_bf16_paged_softcap_sm90.cu.o.d -x cu -c /cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimdiff_bf16_paged_softcap_sm90.cu -o CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimdiff_bf16_paged_softcap_sm90.cu.o
ptxas info : (C7517) warpgroup.wait is injected in around line 399385 by compiler to allow use of registers defined by GMMA in function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi64EEESA_SA_EEELi512ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb1ELb1ELb0ELb1ELb0ELb1ELb0ELb0ELb1ELb0ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_NS7_ILi512EEESA_EEES9_SC_SE_Li256ELb0ELb1ELb0ELb0EEENS1_30DynamicPersistentTileSchedulerILi256ELi128ELb0ELb1ELb1EEEEEEEEEvNT_6ParamsE'
ptxas info : (C7510) Potential Performance Loss: wgmma.mma_async instructions are serialized due to wgmma pipeline crossing function boundary at a function call in the function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi64EEESA_SA_EEELi512ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb1ELb1ELb0ELb1ELb0ELb1ELb0ELb0ELb1ELb0ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_NS7_ILi512EEESA_EEES9_SC_SE_Li256ELb0ELb1ELb0ELb0EEENS1_30DynamicPersistentTileSchedulerILi256ELi128ELb0ELb1ELb1EEEEEEEEEvNT_6ParamsE'
ptxas info : (C7517) warpgroup.wait is injected in around line 354007 by compiler to allow use of registers defined by GMMA in function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi64EEESA_SA_EEELi512ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb1ELb1ELb1ELb1ELb0ELb1ELb0ELb0ELb1ELb0ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_NS7_ILi512EEESA_EEES9_SC_SE_Li256ELb1ELb1ELb0ELb0EEENS1_36VarlenDynamicPersistentTileSchedulerILi64ELi256ELi128ELb0ELb1ELb1EEEEEEEEEvNT_6ParamsE'
ptxas info : (C7510) Potential Performance Loss: wgmma.mma_async instructions are serialized due to wgmma pipeline crossing function boundary at a function call in the function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi64EEESA_SA_EEELi512ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb1ELb1ELb1ELb1ELb0ELb1ELb0ELb0ELb1ELb0ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_NS7_ILi512EEESA_EEES9_SC_SE_Li256ELb1ELb1ELb0ELb0EEENS1_36VarlenDynamicPersistentTileSchedulerILi64ELi256ELi128ELb0ELb1ELb1EEEEEEEEEvNT_6ParamsE'
ptxas info : (C7517) warpgroup.wait is injected in around line 325475 by compiler to allow use of registers defined by GMMA in function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi64EEESA_SA_EEELi512ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb1ELb1ELb1ELb1ELb1ELb1ELb0ELb0ELb1ELb0ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_NS7_ILi512EEESA_EEES9_SC_SE_Li256ELb1ELb1ELb0ELb0EEENS1_36VarlenDynamicPersistentTileSchedulerILi64ELi256ELi128ELb0ELb1ELb1EEEEEEEEEvNT_6ParamsE'
ptxas info : (C7510) Potential Performance Loss: wgmma.mma_async instructions are serialized due to wgmma pipeline crossing function boundary at a function call in the function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi64EEESA_SA_EEELi512ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb1ELb1ELb1ELb1ELb1ELb1ELb0ELb0ELb1ELb0ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_NS7_ILi512EEESA_EEES9_SC_SE_Li256ELb1ELb1ELb0ELb0EEENS1_36VarlenDynamicPersistentTileSchedulerILi64ELi256ELi128ELb0ELb1ELb1EEEEEEEEEvNT_6ParamsE'
Segmentation fault (core dumped)
[312/314] Building CUDA object CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_e4m3_paged_split_softcap_sm90.cu.o
FAILED: CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_e4m3_paged_split_softcap_sm90.cu.o
ccache /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DFLASHATTENTION_DISABLE_BACKWARD -DFLASHATTENTION_DISABLE_DROPOUT -DFLASHATTENTION_DISABLE_UNEVEN_K -DFLASHATTENTION_VARLEN_ONLY -DPy_LIMITED_API=0x03090000 -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dflash_ops_EXPORTS -I/cfs/xtchen/repositories/sglang/sgl-kernel/include -I/cfs/xtchen/repositories/sglang/sgl-kernel/csrc -I/cfs/xtchen/repositories/sgl-kernel-sub/cutlass/include -I/cfs/xtchen/repositories/sgl-kernel-sub/cutlass/tools/util/include -I/cfs/xtchen/repositories/sgl-kernel-sub/flashinfer/include -I/cfs/xtchen/repositories/sgl-kernel-sub/flashinfer/csrc -I/cfs/xtchen/repositories/sgl-kernel-sub/mscclpp/include -I/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper -isystem /usr/include/python3.10 -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/targets/x86_64-linux/include -DONNX_NAMESPACE=onnx_c2 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -D_GLIBCXX_USE_CXX11_ABI=1 -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DNDEBUG -DOPERATOR_NAMESPACE=sgl-kernel -O3 -Xcompiler -fPIC -gencode=arch=compute_90a,code=sm_90a -std=c++17 -DCUTE_USE_PACKED_TUPLE=1 -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_VERSIONS_GENERATED -DCUTLASS_TEST_LEVEL=0 -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -Xcompiler=-Wconversion -Xcompiler=-fno-strict-aliasing -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -D_GLIBCXX_USE_CXX11_ABI=1 -MD -MT CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_e4m3_paged_split_softcap_sm90.cu.o -MF CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_e4m3_paged_split_softcap_sm90.cu.o.d -x cu -c /cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_e4m3_paged_split_softcap_sm90.cu -o CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_e4m3_paged_split_softcap_sm90.cu.o
Segmentation fault (core dumped)
[313/314] Building CUDA object CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_bf16_paged_split_sm90.cu.o
ninja: build stopped: subcommand failed.
*** CMake build failed
× Failed to build `/cfs/xtchen/repositories/sglang/sgl-kernel`
├─▶ The build backend returned an error
╰─▶ Call to `scikit_build_core.build.build_wheel` failed (exit status: 1)
hint: This usually indicates a problem with the package or the build environment.
make: *** [Makefile:29:build] 错误 2
I'm trying to build
sgl-kernelfrom source. but failed during thesgl-attnbuild stage. the detail error info is in the end.environments
hardware: H20 * 8 software: host driver version: 570.158.01 (535.161.07 meet the same error) docker images: cuda: 12.6 nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Wed_Aug_14_10:10:22_PDT_2024 Cuda compilation tools, release 12.6, V12.6.68 Build cuda_12.6.r12.6/compiler.34714021_0I have successfully compiled
sgl-kernelfully before, withcuda dirver 535.161.07, cuda 12.4. Due to thegreenctx_streamfeature, we update the docker image tocuda 12.6to compile the newestsgl-kernel. but encountered this error duringsgl-attncompiling stage.According to the error log, I have no idea what caused this error. It seems to have nothing to do with
cuda driverversion. I'm confused what cause this error, why there was no such problem withcuda 12.4.Hope yours give some advices and help to solve the problem. 3ks.
specific error infos