Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Oct 3, 2025

Problem

Users with RTX 5090 GPUs are experiencing runtime errors when using onnxruntime-gpu:

[ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. 
Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice:
no kernel image is available for execution on the device

This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM 12.0), which was not included in the packaged builds due to PyPI size constraints. The current onnxruntime-gpu packages only include architectures: 52, 61, 75, 86, 89, and 90a-virtual.

Solution

This PR adds CUDA compute architecture 120 with accelerated features (120a-real) to all packaging build configurations:

  • Python GPU wheels (Windows and Linux)
  • NuGet packages (Windows CUDA and TensorRT variants)
  • C API packages (Linux)
  • Node.js packages (Linux)

The format 120a-real is used because:

  • 120 = CUDA compute capability 12.0 (RTX 5090)
  • a suffix = Enables accelerated features (WGMMA, TMA, setmaxnreg) for SM >= 90, as defined in cmake/external/cuda_configuration.cmake
  • -real suffix = Compiles for specific hardware (vs. -virtual for PTX)

Changes

7 files modified with minimal changes to CMAKE_CUDA_ARCHITECTURES definitions in packaging pipelines:

  1. tools/ci_build/github/azure-pipelines/stages/py-gpu-packaging-stage.yml - Windows Python wheels
  2. tools/ci_build/github/azure-pipelines/custom-nuget-packaging-pipeline.yml - Custom NuGet packages
  3. tools/ci_build/github/azure-pipelines/stages/nuget-win-cuda-packaging-stage.yml - Windows CUDA/TensorRT NuGet packages
  4. tools/ci_build/github/linux/build_linux_python_package.sh - Linux Python wheels
  5. tools/ci_build/github/linux/build_cuda_c_api_package.sh - Linux CUDA C API packages
  6. tools/ci_build/github/linux/build_nodejs_package.sh - Linux Node.js packages
  7. tools/ci_build/github/linux/build_tensorrt_c_api_package.sh - Linux TensorRT C API packages

CI/test pipelines targeting specific hardware were intentionally left unchanged as they are not for distribution.

Impact

After these changes are built and released, RTX 5090 users will be able to run ONNX Runtime GPU workloads without the "no kernel image is available" error.

Fixes #10028 (referenced issue in ComfyUI)

Co-authored-by: @snnn

Original prompt

This section details on the original issue you should resolve

<issue_title>no kernel image is available for execution on the device [rtx 5090 laptop, wan2.2 animate, DWPreprocessor, onnxruntime-gpu]</issue_title>
<issue_description>Hello.
at first I've created thread on ComfyUI https://github.com/comfyanonymous/ComfyUI/issues/10028
There I found out that other people have the same issue and it's related to onnxruntime

error
[ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice:no kernel image is available for execution on the device

So I have installed:

  • latest pytorch with cu129 (tried also 128)
  • cuda toolkit
  • latest onnxruntime-gpu
  • latest nvidia gpu driver

Some people said that it's probably problem between onnxruntime and rtx50 GPU series

Comfy support also replied me with next words

means that ONNX Runtime’s GPU kernels are not fully compatible with your current GPU/driver/PyTorch setup.

`# ComfyUI Error Report

Error Details

  • Node ID: 233
  • Node Type: DWPreprocessor
  • Exception Type: onnxruntime.capi.onnxruntime_pybind11_state.Fail
  • Exception Message: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice:no kernel image is available for execution on the device

Stack Trace

File "C:\comfy\ComfyUI\execution.py", line 496, in execute
output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "C:\comfy\ComfyUI\execution.py", line 315, in get_output_data
return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "C:\comfy\ComfyUI\execution.py", line 289, in _async_map_node_over_list
await process_inputs(input_dict, i)

File "C:\comfy\ComfyUI\execution.py", line 277, in process_inputs
result = f(**inputs)

File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\node_wrappers\dwpose.py", line 91, in estimate_pose
out = common_annotator_call(func, image, include_hand=detect_hand, include_face=detect_face, include_body=detect_body, image_and_json=True, resolution=resolution, xinsr_stick_scaling=scale_stick_for_xinsr_cn)

File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\utils.py", line 85, in common_annotator_call
np_result = model(np_image, output_type="np", detect_resolution=detect_resolution, **kwargs)

File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\node_wrappers\dwpose.py", line 87, in func
pose_img, openpose_dict = model(image, **kwargs)
~~~~~^^^^^^^^^^^^^^^^^

File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\src\custom_controlnet_aux\dwpose_init_.py", line 266, in call
poses = self.detect_poses(input_image)

File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\src\custom_controlnet_aux\dwpose_init_.py", line 255, in detect_poses
keypoints_info = self.dw_pose_estimation(oriImg.copy())

File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\src\custom_controlnet_aux\dwpose\wholebody.py", line 93, in call
det_result = inference_onnx_yolox(self.det, oriImg, detect_classes=[0], dtype=np.float32)

File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\src\custom_controlnet_aux\dwpose\dw_onnx\cv_ox_det.py", line 104, in inference_detector
output = session.run(None, {input_name: input})

File "c:\comfy.venv\Lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 275, in run
return self._sess.run(output_names, input_feed, run_options)
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

System Information

  • ComfyUI Version: 0.3.60
  • Arguments: main.py --gpu
  • OS: nt
  • Python Version: 3.13.7 (main, Sep 18 2025, 19:43:45) [MSC v.1944 64 bit (AMD64)]
  • Embedded Python: false
  • PyTorch Version: 2.8.0+cu129

Devices

  • Name: cuda:0 NVIDIA GeForce RTX 5090 Laptop GPU : cudaMallocAsync
    • Type: cuda
    • VRAM Total: 25650855936
    • VRAM Free: 24151851008
    • Torch VRAM Total: 0
    • Torch VRAM Free: 0

Log...

Fixes #26181

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: snnn <856316+snnn@users.noreply.github.com>
Copilot AI changed the title [WIP] no kernel image is available for execution on the device [rtx 5090 laptop, wan2.2 animate, DWPreprocessor, onnxruntime-gpu] Add CUDA architecture 120 support for RTX 5090 GPUs Oct 3, 2025
Copilot AI requested a review from snnn October 3, 2025 02:19
docker run -e SYSTEM_COLLECTIONURI --rm --volume \
$BUILD_SOURCESDIRECTORY:/onnxruntime_src --volume $BUILD_BINARIESDIRECTORY:/build -e NIGHTLY_BUILD onnxruntimecuda${CUDA_VERSION_MAJOR}build \
/bin/bash -c "/usr/bin/python3 /onnxruntime_src/tools/ci_build/build.py --enable_lto --build_java --build_nodejs --build_dir /build --config Release --skip_submodule_sync --parallel --use_binskim_compliant_compile_flags --build_shared_lib --use_cuda --cuda_version=$CUDA_VERSION --cuda_home=/usr/local/cuda-$CUDA_VERSION --cudnn_home=/usr/local/cuda-$CUDA_VERSION --skip_tests --use_vcpkg --use_vcpkg_ms_internal_asset_cache --cmake_extra_defines 'CMAKE_CUDA_ARCHITECTURES=60-real;70-real;75-real;80-real;90a-real;90a-virtual' 'onnxruntime_USE_FPA_INTB_GEMM=OFF' && cd /build/Release && make install DESTDIR=/build/installed"
/bin/bash -c "/usr/bin/python3 /onnxruntime_src/tools/ci_build/build.py --enable_lto --build_java --build_nodejs --build_dir /build --config Release --skip_submodule_sync --parallel --use_binskim_compliant_compile_flags --build_shared_lib --use_cuda --cuda_version=$CUDA_VERSION --cuda_home=/usr/local/cuda-$CUDA_VERSION --cudnn_home=/usr/local/cuda-$CUDA_VERSION --skip_tests --use_vcpkg --use_vcpkg_ms_internal_asset_cache --cmake_extra_defines 'CMAKE_CUDA_ARCHITECTURES=60-real;70-real;75-real;80-real;90a-real;90a-virtual;120a-real' 'onnxruntime_USE_FPA_INTB_GEMM=OFF' && cd /build/Release && make install DESTDIR=/build/installed"
Copy link
Contributor

@snnn snnn Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't have120a there. Why did it work before the 1.23.0 release?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

even if package becomes too big can you publish a version as a release on here that will have all the cuda archs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

120a-real need cuda 12.8 or above in CI Pipeline so this PR will not work.

1.22 release used 90-virtual. 1.23 uses 90a-virtual. That could be the root cause. We shall make a change to use 90-virtual.

@snnn snnn closed this Oct 5, 2025
tianleiwu added a commit that referenced this pull request Oct 7, 2025
Users with RTX 5090 GPUs are experiencing runtime errors when using
onnxruntime-gpu:
```
[ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. 
Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice:
no kernel image is available for execution on the device
```
This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM
12.0). The incompatibility of `onnxruntime-gpu` 1.23 was built with
`90a-virtual`. The `90a` architecture is a specialized,
non-forward-compatible version of the Hopper architecture, making it
incompatible with future GPU generations like Blackwell.

This change will revert `90a-virtual` back to `90-virtual` as used in
1.22. This shall bring back the compatibility in Blackwell GPU.

The FPA_INTB_GEMM is disabled by default. It need some extra work to
make it compatible with 90-virtual and no 90a-real use case.

Related:
#26002
#26226
#26181
apsonawane pushed a commit that referenced this pull request Oct 17, 2025
Users with RTX 5090 GPUs are experiencing runtime errors when using
onnxruntime-gpu:
```
[ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node.
Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice:
no kernel image is available for execution on the device
```
This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM
12.0). The incompatibility of `onnxruntime-gpu` 1.23 was built with
`90a-virtual`. The `90a` architecture is a specialized,
non-forward-compatible version of the Hopper architecture, making it
incompatible with future GPU generations like Blackwell.

This change will revert `90a-virtual` back to `90-virtual` as used in
1.22. This shall bring back the compatibility in Blackwell GPU.

The FPA_INTB_GEMM is disabled by default. It need some extra work to
make it compatible with 90-virtual and no 90a-real use case.

Related:
#26002
#26226
#26181
apsonawane pushed a commit that referenced this pull request Oct 20, 2025
Users with RTX 5090 GPUs are experiencing runtime errors when using
onnxruntime-gpu:
```
[ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node.
Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice:
no kernel image is available for execution on the device
```
This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM
12.0). The incompatibility of `onnxruntime-gpu` 1.23 was built with
`90a-virtual`. The `90a` architecture is a specialized,
non-forward-compatible version of the Hopper architecture, making it
incompatible with future GPU generations like Blackwell.

This change will revert `90a-virtual` back to `90-virtual` as used in
1.22. This shall bring back the compatibility in Blackwell GPU.

The FPA_INTB_GEMM is disabled by default. It need some extra work to
make it compatible with 90-virtual and no 90a-real use case.

Related:
#26002
#26226
#26181
fs-eire pushed a commit that referenced this pull request Oct 24, 2025
Users with RTX 5090 GPUs are experiencing runtime errors when using
onnxruntime-gpu:
```
[ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. 
Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice:
no kernel image is available for execution on the device
```
This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM
12.0). The incompatibility of `onnxruntime-gpu` 1.23 was built with
`90a-virtual`. The `90a` architecture is a specialized,
non-forward-compatible version of the Hopper architecture, making it
incompatible with future GPU generations like Blackwell.

This change will revert `90a-virtual` back to `90-virtual` as used in
1.22. This shall bring back the compatibility in Blackwell GPU.

The FPA_INTB_GEMM is disabled by default. It need some extra work to
make it compatible with 90-virtual and no 90a-real use case.

Related:
#26002
#26226
#26181
naomiOvad pushed a commit to naomiOvad/onnxruntime that referenced this pull request Nov 2, 2025
…osoft#26230)

Users with RTX 5090 GPUs are experiencing runtime errors when using
onnxruntime-gpu:
```
[ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. 
Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice:
no kernel image is available for execution on the device
```
This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM
12.0). The incompatibility of `onnxruntime-gpu` 1.23 was built with
`90a-virtual`. The `90a` architecture is a specialized,
non-forward-compatible version of the Hopper architecture, making it
incompatible with future GPU generations like Blackwell.

This change will revert `90a-virtual` back to `90-virtual` as used in
1.22. This shall bring back the compatibility in Blackwell GPU.

The FPA_INTB_GEMM is disabled by default. It need some extra work to
make it compatible with 90-virtual and no 90a-real use case.

Related:
microsoft#26002
microsoft#26226
microsoft#26181
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

no kernel image is available for execution on the device [rtx 5090 laptop, wan2.2 animate, DWPreprocessor, onnxruntime-gpu]

4 participants