Add CUDA architecture 120 support for RTX 5090 GPUs #26226

Copilot · 2025-10-03T02:05:08Z

Problem

Users with RTX 5090 GPUs are experiencing runtime errors when using onnxruntime-gpu:

[ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. 
Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice:
no kernel image is available for execution on the device

This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM 12.0), which was not included in the packaged builds due to PyPI size constraints. The current onnxruntime-gpu packages only include architectures: 52, 61, 75, 86, 89, and 90a-virtual.

Solution

This PR adds CUDA compute architecture 120 with accelerated features (120a-real) to all packaging build configurations:

Python GPU wheels (Windows and Linux)
NuGet packages (Windows CUDA and TensorRT variants)
C API packages (Linux)
Node.js packages (Linux)

The format 120a-real is used because:

120 = CUDA compute capability 12.0 (RTX 5090)
a suffix = Enables accelerated features (WGMMA, TMA, setmaxnreg) for SM >= 90, as defined in cmake/external/cuda_configuration.cmake
-real suffix = Compiles for specific hardware (vs. -virtual for PTX)

Changes

7 files modified with minimal changes to CMAKE_CUDA_ARCHITECTURES definitions in packaging pipelines:

tools/ci_build/github/azure-pipelines/stages/py-gpu-packaging-stage.yml - Windows Python wheels
tools/ci_build/github/azure-pipelines/custom-nuget-packaging-pipeline.yml - Custom NuGet packages
tools/ci_build/github/azure-pipelines/stages/nuget-win-cuda-packaging-stage.yml - Windows CUDA/TensorRT NuGet packages
tools/ci_build/github/linux/build_linux_python_package.sh - Linux Python wheels
tools/ci_build/github/linux/build_cuda_c_api_package.sh - Linux CUDA C API packages
tools/ci_build/github/linux/build_nodejs_package.sh - Linux Node.js packages
tools/ci_build/github/linux/build_tensorrt_c_api_package.sh - Linux TensorRT C API packages

CI/test pipelines targeting specific hardware were intentionally left unchanged as they are not for distribution.

Impact

After these changes are built and released, RTX 5090 users will be able to run ONNX Runtime GPU workloads without the "no kernel image is available" error.

Fixes #10028 (referenced issue in ComfyUI)

Co-authored-by: @snnn

Original prompt

This section details on the original issue you should resolve

<issue_title>no kernel image is available for execution on the device [rtx 5090 laptop, wan2.2 animate, DWPreprocessor, onnxruntime-gpu]</issue_title>
<issue_description>Hello.
at first I've created thread on ComfyUI https://github.com/comfyanonymous/ComfyUI/issues/10028
There I found out that other people have the same issue and it's related to onnxruntime

error
[ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice:no kernel image is available for execution on the device

So I have installed:

latest pytorch with cu129 (tried also 128)
cuda toolkit
latest onnxruntime-gpu
latest nvidia gpu driver

Some people said that it's probably problem between onnxruntime and rtx50 GPU series

Comfy support also replied me with next words

means that ONNX Runtime’s GPU kernels are not fully compatible with your current GPU/driver/PyTorch setup.

`# ComfyUI Error Report

Error Details

Node ID: 233
Node Type: DWPreprocessor
Exception Type: onnxruntime.capi.onnxruntime_pybind11_state.Fail
Exception Message: [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice:no kernel image is available for execution on the device

Stack Trace

File "C:\comfy\ComfyUI\execution.py", line 496, in execute
output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "C:\comfy\ComfyUI\execution.py", line 315, in get_output_data
return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "C:\comfy\ComfyUI\execution.py", line 289, in _async_map_node_over_list
await process_inputs(input_dict, i)

File "C:\comfy\ComfyUI\execution.py", line 277, in process_inputs
result = f(**inputs)

File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\node_wrappers\dwpose.py", line 91, in estimate_pose
out = common_annotator_call(func, image, include_hand=detect_hand, include_face=detect_face, include_body=detect_body, image_and_json=True, resolution=resolution, xinsr_stick_scaling=scale_stick_for_xinsr_cn)

File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\utils.py", line 85, in common_annotator_call
np_result = model(np_image, output_type="np", detect_resolution=detect_resolution, **kwargs)

File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\node_wrappers\dwpose.py", line 87, in func
pose_img, openpose_dict = model(image, **kwargs)
~~~~~^^^^^^^^^^^^^^^^^

File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\src\custom_controlnet_aux\dwpose_init_.py", line 266, in call
poses = self.detect_poses(input_image)

File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\src\custom_controlnet_aux\dwpose_init_.py", line 255, in detect_poses
keypoints_info = self.dw_pose_estimation(oriImg.copy())

File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\src\custom_controlnet_aux\dwpose\wholebody.py", line 93, in call
det_result = inference_onnx_yolox(self.det, oriImg, detect_classes=[0], dtype=np.float32)

File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\src\custom_controlnet_aux\dwpose\dw_onnx\cv_ox_det.py", line 104, in inference_detector
output = session.run(None, {input_name: input})

File "c:\comfy.venv\Lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 275, in run
return self._sess.run(output_names, input_feed, run_options)
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

System Information

ComfyUI Version: 0.3.60
Arguments: main.py --gpu
OS: nt
Python Version: 3.13.7 (main, Sep 18 2025, 19:43:45) [MSC v.1944 64 bit (AMD64)]
Embedded Python: false
PyTorch Version: 2.8.0+cu129

Devices

Name: cuda:0 NVIDIA GeForce RTX 5090 Laptop GPU : cudaMallocAsync
- Type: cuda
- VRAM Total: 25650855936
- VRAM Free: 24151851008
- Torch VRAM Total: 0
- Torch VRAM Free: 0

Log...

Fixes #26181

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: snnn <856316+snnn@users.noreply.github.com>

snnn · 2025-10-03T02:45:29Z

tools/ci_build/github/linux/build_cuda_c_api_package.sh

 docker run -e SYSTEM_COLLECTIONURI --rm --volume \
 $BUILD_SOURCESDIRECTORY:/onnxruntime_src --volume $BUILD_BINARIESDIRECTORY:/build -e NIGHTLY_BUILD onnxruntimecuda${CUDA_VERSION_MAJOR}build \
-/bin/bash -c "/usr/bin/python3 /onnxruntime_src/tools/ci_build/build.py --enable_lto --build_java --build_nodejs --build_dir /build --config Release --skip_submodule_sync  --parallel --use_binskim_compliant_compile_flags --build_shared_lib --use_cuda --cuda_version=$CUDA_VERSION --cuda_home=/usr/local/cuda-$CUDA_VERSION --cudnn_home=/usr/local/cuda-$CUDA_VERSION --skip_tests --use_vcpkg --use_vcpkg_ms_internal_asset_cache --cmake_extra_defines 'CMAKE_CUDA_ARCHITECTURES=60-real;70-real;75-real;80-real;90a-real;90a-virtual' 'onnxruntime_USE_FPA_INTB_GEMM=OFF' && cd /build/Release && make install DESTDIR=/build/installed"
+/bin/bash -c "/usr/bin/python3 /onnxruntime_src/tools/ci_build/build.py --enable_lto --build_java --build_nodejs --build_dir /build --config Release --skip_submodule_sync  --parallel --use_binskim_compliant_compile_flags --build_shared_lib --use_cuda --cuda_version=$CUDA_VERSION --cuda_home=/usr/local/cuda-$CUDA_VERSION --cudnn_home=/usr/local/cuda-$CUDA_VERSION --skip_tests --use_vcpkg --use_vcpkg_ms_internal_asset_cache --cmake_extra_defines 'CMAKE_CUDA_ARCHITECTURES=60-real;70-real;75-real;80-real;90a-real;90a-virtual;120a-real' 'onnxruntime_USE_FPA_INTB_GEMM=OFF' && cd /build/Release && make install DESTDIR=/build/installed"


We didn't have120a there. Why did it work before the 1.23.0 release?

even if package becomes too big can you publish a version as a release on here that will have all the cuda archs?

120a-real need cuda 12.8 or above in CI Pipeline so this PR will not work.

1.22 release used 90-virtual. 1.23 uses 90a-virtual. That could be the root cause. We shall make a change to use 90-virtual.

Users with RTX 5090 GPUs are experiencing runtime errors when using onnxruntime-gpu: ``` [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device ``` This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM 12.0). The incompatibility of `onnxruntime-gpu` 1.23 was built with `90a-virtual`. The `90a` architecture is a specialized, non-forward-compatible version of the Hopper architecture, making it incompatible with future GPU generations like Blackwell. This change will revert `90a-virtual` back to `90-virtual` as used in 1.22. This shall bring back the compatibility in Blackwell GPU. The FPA_INTB_GEMM is disabled by default. It need some extra work to make it compatible with 90-virtual and no 90a-real use case. Related: #26002 #26226 #26181

…osoft#26230) Users with RTX 5090 GPUs are experiencing runtime errors when using onnxruntime-gpu: ``` [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device ``` This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM 12.0). The incompatibility of `onnxruntime-gpu` 1.23 was built with `90a-virtual`. The `90a` architecture is a specialized, non-forward-compatible version of the Hopper architecture, making it incompatible with future GPU generations like Blackwell. This change will revert `90a-virtual` back to `90-virtual` as used in 1.22. This shall bring back the compatibility in Blackwell GPU. The FPA_INTB_GEMM is disabled by default. It need some extra work to make it compatible with 90-virtual and no 90a-real use case. Related: microsoft#26002 microsoft#26226 microsoft#26181

Initial plan

3f51038

Copilot AI assigned Copilot and snnn Oct 3, 2025

Copilot started work on behalf of snnn October 3, 2025 02:05 View session

Add CUDA architecture 120 support for RTX 5090 GPUs

54a2972

Co-authored-by: snnn <856316+snnn@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] no kernel image is available for execution on the device [rtx 5090 laptop, wan2.2 animate, DWPreprocessor, onnxruntime-gpu]~~ Add CUDA architecture 120 support for RTX 5090 GPUs Oct 3, 2025

Copilot AI requested a review from snnn October 3, 2025 02:19

Copilot finished work on behalf of snnn October 3, 2025 02:19

snnn reviewed Oct 3, 2025

View reviewed changes

tianleiwu mentioned this pull request Oct 3, 2025

[CUDA] replace 90a-virtual by 90-virtual for forward compatible #26230

Merged

snnn closed this Oct 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add CUDA architecture 120 support for RTX 5090 GPUs #26226

Add CUDA architecture 120 support for RTX 5090 GPUs #26226

Uh oh!

Copilot AI commented Oct 3, 2025 •

edited

Loading

Uh oh!

snnn Oct 3, 2025 •

edited

Loading

Uh oh!

FurkanGozukara Oct 3, 2025

Uh oh!

tianleiwu Oct 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add CUDA architecture 120 support for RTX 5090 GPUs #26226

Add CUDA architecture 120 support for RTX 5090 GPUs #26226

Uh oh!

Conversation

Copilot AI commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Changes

Impact

Error Details

Stack Trace

System Information

Devices

Log...

Uh oh!

snnn Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FurkanGozukara Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

tianleiwu Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Copilot AI commented Oct 3, 2025 •

edited

Loading

snnn Oct 3, 2025 •

edited

Loading