build sgl-kernel failed in the newest code with nvcc/cuda version is 12.6

I'm trying to build `sgl-kernel` from source. but failed during the `sgl-attn` build stage. the detail error info is in the end.
### environments
```shell
hardware: H20 * 8

software:
host driver version: 570.158.01 (535.161.07 meet the same error)

docker images: 
cuda: 12.6
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:10:22_PDT_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0
```
I have successfully compiled `sgl-kernel` fully before, with `cuda dirver 535.161.07, cuda 12.4`.  Due to the `greenctx_stream` feature, we update the docker image to `cuda 12.6` to compile the newest `sgl-kernel`. but encountered this error during `sgl-attn` compiling stage. 
According to the error log, I have no idea what caused this error.  It seems to have nothing to do with `cuda driver` version. I'm confused what cause this error, why there was no such problem with `cuda 12.4`. 
Hope yours give some advices and help to solve the problem. 3ks.

### specific error infos
```shell
[311/314] Building CUDA object CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimdiff_bf16_paged_softcap_sm90.cu.o
FAILED: CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimdiff_bf16_paged_softcap_sm90.cu.o
ccache /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DFLASHATTENTION_DISABLE_BACKWARD -DFLASHATTENTION_DISABLE_DROPOUT -DFLASHATTENTION_DISABLE_UNEVEN_K -DFLASHATTENTION_VARLEN_ONLY -DPy_LIMITED_API=0x03090000 -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dflash_ops_EXPORTS -I/cfs/xtchen/repositories/sglang/sgl-kernel/include -I/cfs/xtchen/repositories/sglang/sgl-kernel/csrc -I/cfs/xtchen/repositories/sgl-kernel-sub/cutlass/include -I/cfs/xtchen/repositories/sgl-kernel-sub/cutlass/tools/util/include -I/cfs/xtchen/repositories/sgl-kernel-sub/flashinfer/include -I/cfs/xtchen/repositories/sgl-kernel-sub/flashinfer/csrc -I/cfs/xtchen/repositories/sgl-kernel-sub/mscclpp/include -I/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper -isystem /usr/include/python3.10 -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/targets/x86_64-linux/include -DONNX_NAMESPACE=onnx_c2 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -D_GLIBCXX_USE_CXX11_ABI=1 -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DNDEBUG -DOPERATOR_NAMESPACE=sgl-kernel -O3 -Xcompiler -fPIC -gencode=arch=compute_90a,code=sm_90a -std=c++17 -DCUTE_USE_PACKED_TUPLE=1 -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_VERSIONS_GENERATED -DCUTLASS_TEST_LEVEL=0 -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -Xcompiler=-Wconversion -Xcompiler=-fno-strict-aliasing -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -D_GLIBCXX_USE_CXX11_ABI=1 -MD -MT CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimdiff_bf16_paged_softcap_sm90.cu.o -MF CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimdiff_bf16_paged_softcap_sm90.cu.o.d -x cu -c /cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimdiff_bf16_paged_softcap_sm90.cu -o CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimdiff_bf16_paged_softcap_sm90.cu.o
ptxas info    : (C7517) warpgroup.wait is injected in around line 399385 by compiler to allow use of registers defined by GMMA in function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi64EEESA_SA_EEELi512ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb1ELb1ELb0ELb1ELb0ELb1ELb0ELb0ELb1ELb0ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_NS7_ILi512EEESA_EEES9_SC_SE_Li256ELb0ELb1ELb0ELb0EEENS1_30DynamicPersistentTileSchedulerILi256ELi128ELb0ELb1ELb1EEEEEEEEEvNT_6ParamsE'
ptxas info    : (C7510) Potential Performance Loss: wgmma.mma_async instructions are serialized due to wgmma pipeline crossing function boundary at a function call in the function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi64EEESA_SA_EEELi512ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb1ELb1ELb0ELb1ELb0ELb1ELb0ELb0ELb1ELb0ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_NS7_ILi512EEESA_EEES9_SC_SE_Li256ELb0ELb1ELb0ELb0EEENS1_30DynamicPersistentTileSchedulerILi256ELi128ELb0ELb1ELb1EEEEEEEEEvNT_6ParamsE'
ptxas info    : (C7517) warpgroup.wait is injected in around line 354007 by compiler to allow use of registers defined by GMMA in function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi64EEESA_SA_EEELi512ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb1ELb1ELb1ELb1ELb0ELb1ELb0ELb0ELb1ELb0ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_NS7_ILi512EEESA_EEES9_SC_SE_Li256ELb1ELb1ELb0ELb0EEENS1_36VarlenDynamicPersistentTileSchedulerILi64ELi256ELi128ELb0ELb1ELb1EEEEEEEEEvNT_6ParamsE'
ptxas info    : (C7510) Potential Performance Loss: wgmma.mma_async instructions are serialized due to wgmma pipeline crossing function boundary at a function call in the function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi64EEESA_SA_EEELi512ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb1ELb1ELb1ELb1ELb0ELb1ELb0ELb0ELb1ELb0ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_NS7_ILi512EEESA_EEES9_SC_SE_Li256ELb1ELb1ELb0ELb0EEENS1_36VarlenDynamicPersistentTileSchedulerILi64ELi256ELi128ELb0ELb1ELb1EEEEEEEEEvNT_6ParamsE'
ptxas info    : (C7517) warpgroup.wait is injected in around line 325475 by compiler to allow use of registers defined by GMMA in function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi64EEESA_SA_EEELi512ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb1ELb1ELb1ELb1ELb1ELb1ELb0ELb0ELb1ELb0ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_NS7_ILi512EEESA_EEES9_SC_SE_Li256ELb1ELb1ELb0ELb0EEENS1_36VarlenDynamicPersistentTileSchedulerILi64ELi256ELi128ELb0ELb1ELb1EEEEEEEEEvNT_6ParamsE'
ptxas info    : (C7510) Potential Performance Loss: wgmma.mma_async instructions are serialized due to wgmma pipeline crossing function boundary at a function call in the function '_ZN7cutlass13device_kernelIN5flash20enable_sm90_or_laterINS1_16FlashAttnFwdSm90INS1_25CollectiveMainloopFwdSm90ILi2EN4cute5tupleIJNS5_1CILi1EEES8_S8_EEENS6_IJNS7_ILi64EEESA_SA_EEELi512ENS_10bfloat16_tEfNS_4arch4Sm90ELb0ELb1ELb1ELb1ELb1ELb1ELb1ELb0ELb0ELb1ELb0ELb0EEENS1_21CollectiveEpilogueFwdINS6_IJSA_NS7_ILi512EEESA_EEES9_SC_SE_Li256ELb1ELb1ELb0ELb0EEENS1_36VarlenDynamicPersistentTileSchedulerILi64ELi256ELi128ELb0ELb1ELb1EEEEEEEEEvNT_6ParamsE'
Segmentation fault (core dumped)
[312/314] Building CUDA object CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_e4m3_paged_split_softcap_sm90.cu.o
FAILED: CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_e4m3_paged_split_softcap_sm90.cu.o
ccache /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DFLASHATTENTION_DISABLE_BACKWARD -DFLASHATTENTION_DISABLE_DROPOUT -DFLASHATTENTION_DISABLE_UNEVEN_K -DFLASHATTENTION_VARLEN_ONLY -DPy_LIMITED_API=0x03090000 -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dflash_ops_EXPORTS -I/cfs/xtchen/repositories/sglang/sgl-kernel/include -I/cfs/xtchen/repositories/sglang/sgl-kernel/csrc -I/cfs/xtchen/repositories/sgl-kernel-sub/cutlass/include -I/cfs/xtchen/repositories/sgl-kernel-sub/cutlass/tools/util/include -I/cfs/xtchen/repositories/sgl-kernel-sub/flashinfer/include -I/cfs/xtchen/repositories/sgl-kernel-sub/flashinfer/csrc -I/cfs/xtchen/repositories/sgl-kernel-sub/mscclpp/include -I/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper -isystem /usr/include/python3.10 -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/targets/x86_64-linux/include -DONNX_NAMESPACE=onnx_c2 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -D_GLIBCXX_USE_CXX11_ABI=1 -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DNDEBUG -DOPERATOR_NAMESPACE=sgl-kernel -O3 -Xcompiler -fPIC -gencode=arch=compute_90a,code=sm_90a -std=c++17 -DCUTE_USE_PACKED_TUPLE=1 -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1 -DCUTLASS_VERSIONS_GENERATED -DCUTLASS_TEST_LEVEL=0 -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math -Xcompiler=-Wconversion -Xcompiler=-fno-strict-aliasing -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -D_GLIBCXX_USE_CXX11_ABI=1 -MD -MT CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_e4m3_paged_split_softcap_sm90.cu.o -MF CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_e4m3_paged_split_softcap_sm90.cu.o.d -x cu -c /cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_e4m3_paged_split_softcap_sm90.cu -o CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_e4m3_paged_split_softcap_sm90.cu.o
Segmentation fault (core dumped)
[313/314] Building CUDA object CMakeFiles/flash_ops.dir/cfs/xtchen/repositories/sgl-kernel-sub/sgl-attn/hopper/instantiations/flash_fwd_hdimall_bf16_paged_split_sm90.cu.o
ninja: build stopped: subcommand failed.

*** CMake build failed
  × Failed to build `/cfs/xtchen/repositories/sglang/sgl-kernel`
  ├─▶ The build backend returned an error
  ╰─▶ Call to `scikit_build_core.build.build_wheel` failed (exit status: 1)
      hint: This usually indicates a problem with the package or the build environment.
make: *** [Makefile:29：build] 错误 2
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build sgl-kernel failed in the newest code with nvcc/cuda version is 12.6 #8097

environments

specific error infos

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

build sgl-kernel failed in the newest code with nvcc/cuda version is 12.6 #8097

Description

environments

specific error infos

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions