-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Add CUDA architecture 120 support for RTX 5090 GPUs #26226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: snnn <856316+snnn@users.noreply.github.com>
| docker run -e SYSTEM_COLLECTIONURI --rm --volume \ | ||
| $BUILD_SOURCESDIRECTORY:/onnxruntime_src --volume $BUILD_BINARIESDIRECTORY:/build -e NIGHTLY_BUILD onnxruntimecuda${CUDA_VERSION_MAJOR}build \ | ||
| /bin/bash -c "/usr/bin/python3 /onnxruntime_src/tools/ci_build/build.py --enable_lto --build_java --build_nodejs --build_dir /build --config Release --skip_submodule_sync --parallel --use_binskim_compliant_compile_flags --build_shared_lib --use_cuda --cuda_version=$CUDA_VERSION --cuda_home=/usr/local/cuda-$CUDA_VERSION --cudnn_home=/usr/local/cuda-$CUDA_VERSION --skip_tests --use_vcpkg --use_vcpkg_ms_internal_asset_cache --cmake_extra_defines 'CMAKE_CUDA_ARCHITECTURES=60-real;70-real;75-real;80-real;90a-real;90a-virtual' 'onnxruntime_USE_FPA_INTB_GEMM=OFF' && cd /build/Release && make install DESTDIR=/build/installed" | ||
| /bin/bash -c "/usr/bin/python3 /onnxruntime_src/tools/ci_build/build.py --enable_lto --build_java --build_nodejs --build_dir /build --config Release --skip_submodule_sync --parallel --use_binskim_compliant_compile_flags --build_shared_lib --use_cuda --cuda_version=$CUDA_VERSION --cuda_home=/usr/local/cuda-$CUDA_VERSION --cudnn_home=/usr/local/cuda-$CUDA_VERSION --skip_tests --use_vcpkg --use_vcpkg_ms_internal_asset_cache --cmake_extra_defines 'CMAKE_CUDA_ARCHITECTURES=60-real;70-real;75-real;80-real;90a-real;90a-virtual;120a-real' 'onnxruntime_USE_FPA_INTB_GEMM=OFF' && cd /build/Release && make install DESTDIR=/build/installed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We didn't have120a there. Why did it work before the 1.23.0 release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even if package becomes too big can you publish a version as a release on here that will have all the cuda archs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
120a-real need cuda 12.8 or above in CI Pipeline so this PR will not work.
1.22 release used 90-virtual. 1.23 uses 90a-virtual. That could be the root cause. We shall make a change to use 90-virtual.
Users with RTX 5090 GPUs are experiencing runtime errors when using onnxruntime-gpu: ``` [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device ``` This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM 12.0). The incompatibility of `onnxruntime-gpu` 1.23 was built with `90a-virtual`. The `90a` architecture is a specialized, non-forward-compatible version of the Hopper architecture, making it incompatible with future GPU generations like Blackwell. This change will revert `90a-virtual` back to `90-virtual` as used in 1.22. This shall bring back the compatibility in Blackwell GPU. The FPA_INTB_GEMM is disabled by default. It need some extra work to make it compatible with 90-virtual and no 90a-real use case. Related: #26002 #26226 #26181
Users with RTX 5090 GPUs are experiencing runtime errors when using onnxruntime-gpu: ``` [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device ``` This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM 12.0). The incompatibility of `onnxruntime-gpu` 1.23 was built with `90a-virtual`. The `90a` architecture is a specialized, non-forward-compatible version of the Hopper architecture, making it incompatible with future GPU generations like Blackwell. This change will revert `90a-virtual` back to `90-virtual` as used in 1.22. This shall bring back the compatibility in Blackwell GPU. The FPA_INTB_GEMM is disabled by default. It need some extra work to make it compatible with 90-virtual and no 90a-real use case. Related: #26002 #26226 #26181
Users with RTX 5090 GPUs are experiencing runtime errors when using onnxruntime-gpu: ``` [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device ``` This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM 12.0). The incompatibility of `onnxruntime-gpu` 1.23 was built with `90a-virtual`. The `90a` architecture is a specialized, non-forward-compatible version of the Hopper architecture, making it incompatible with future GPU generations like Blackwell. This change will revert `90a-virtual` back to `90-virtual` as used in 1.22. This shall bring back the compatibility in Blackwell GPU. The FPA_INTB_GEMM is disabled by default. It need some extra work to make it compatible with 90-virtual and no 90a-real use case. Related: #26002 #26226 #26181
Users with RTX 5090 GPUs are experiencing runtime errors when using onnxruntime-gpu: ``` [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device ``` This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM 12.0). The incompatibility of `onnxruntime-gpu` 1.23 was built with `90a-virtual`. The `90a` architecture is a specialized, non-forward-compatible version of the Hopper architecture, making it incompatible with future GPU generations like Blackwell. This change will revert `90a-virtual` back to `90-virtual` as used in 1.22. This shall bring back the compatibility in Blackwell GPU. The FPA_INTB_GEMM is disabled by default. It need some extra work to make it compatible with 90-virtual and no 90a-real use case. Related: #26002 #26226 #26181
…osoft#26230) Users with RTX 5090 GPUs are experiencing runtime errors when using onnxruntime-gpu: ``` [ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device ``` This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM 12.0). The incompatibility of `onnxruntime-gpu` 1.23 was built with `90a-virtual`. The `90a` architecture is a specialized, non-forward-compatible version of the Hopper architecture, making it incompatible with future GPU generations like Blackwell. This change will revert `90a-virtual` back to `90-virtual` as used in 1.22. This shall bring back the compatibility in Blackwell GPU. The FPA_INTB_GEMM is disabled by default. It need some extra work to make it compatible with 90-virtual and no 90a-real use case. Related: microsoft#26002 microsoft#26226 microsoft#26181
Problem
Users with RTX 5090 GPUs are experiencing runtime errors when using onnxruntime-gpu:
This occurs because RTX 5090 uses CUDA compute architecture 12.0 (SM 12.0), which was not included in the packaged builds due to PyPI size constraints. The current onnxruntime-gpu packages only include architectures: 52, 61, 75, 86, 89, and 90a-virtual.
Solution
This PR adds CUDA compute architecture 120 with accelerated features (
120a-real) to all packaging build configurations:The format
120a-realis used because:120= CUDA compute capability 12.0 (RTX 5090)asuffix = Enables accelerated features (WGMMA, TMA, setmaxnreg) for SM >= 90, as defined incmake/external/cuda_configuration.cmake-realsuffix = Compiles for specific hardware (vs.-virtualfor PTX)Changes
7 files modified with minimal changes to
CMAKE_CUDA_ARCHITECTURESdefinitions in packaging pipelines:tools/ci_build/github/azure-pipelines/stages/py-gpu-packaging-stage.yml- Windows Python wheelstools/ci_build/github/azure-pipelines/custom-nuget-packaging-pipeline.yml- Custom NuGet packagestools/ci_build/github/azure-pipelines/stages/nuget-win-cuda-packaging-stage.yml- Windows CUDA/TensorRT NuGet packagestools/ci_build/github/linux/build_linux_python_package.sh- Linux Python wheelstools/ci_build/github/linux/build_cuda_c_api_package.sh- Linux CUDA C API packagestools/ci_build/github/linux/build_nodejs_package.sh- Linux Node.js packagestools/ci_build/github/linux/build_tensorrt_c_api_package.sh- Linux TensorRT C API packagesCI/test pipelines targeting specific hardware were intentionally left unchanged as they are not for distribution.
Impact
After these changes are built and released, RTX 5090 users will be able to run ONNX Runtime GPU workloads without the "no kernel image is available" error.
Fixes #10028 (referenced issue in ComfyUI)
Co-authored-by: @snnn
Original prompt
This section details on the original issue you should resolve
<issue_title>no kernel image is available for execution on the device [rtx 5090 laptop, wan2.2 animate, DWPreprocessor, onnxruntime-gpu]</issue_title>
<issue_description>Hello.
at first I've created thread on ComfyUI https://github.com/comfyanonymous/ComfyUI/issues/10028
There I found out that other people have the same issue and it's related to onnxruntime
error
[ONNXRuntimeError] : 1 : FAIL : Non-zero status code returned while running Slice node. Name:'Slice_34' Status Message: CUDA error cudaErrorNoKernelImageForDevice:no kernel image is available for execution on the deviceSo I have installed:
Some people said that it's probably problem between onnxruntime and rtx50 GPU series
Comfy support also replied me with next words
`# ComfyUI Error Report
Error Details
Stack Trace
File "C:\comfy\ComfyUI\execution.py", line 496, in execute
output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\comfy\ComfyUI\execution.py", line 315, in get_output_data
return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\comfy\ComfyUI\execution.py", line 289, in _async_map_node_over_list
await process_inputs(input_dict, i)
File "C:\comfy\ComfyUI\execution.py", line 277, in process_inputs
result = f(**inputs)
File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\node_wrappers\dwpose.py", line 91, in estimate_pose
out = common_annotator_call(func, image, include_hand=detect_hand, include_face=detect_face, include_body=detect_body, image_and_json=True, resolution=resolution, xinsr_stick_scaling=scale_stick_for_xinsr_cn)
File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\utils.py", line 85, in common_annotator_call
np_result = model(np_image, output_type="np", detect_resolution=detect_resolution, **kwargs)
File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\node_wrappers\dwpose.py", line 87, in func
pose_img, openpose_dict = model(image, **kwargs)
~~~~~^^^^^^^^^^^^^^^^^
File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\src\custom_controlnet_aux\dwpose_init_.py", line 266, in call
poses = self.detect_poses(input_image)
File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\src\custom_controlnet_aux\dwpose_init_.py", line 255, in detect_poses
keypoints_info = self.dw_pose_estimation(oriImg.copy())
File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\src\custom_controlnet_aux\dwpose\wholebody.py", line 93, in call
det_result = inference_onnx_yolox(self.det, oriImg, detect_classes=[0], dtype=np.float32)
File "C:\comfy\ComfyUI\custom_nodes\comfyui_controlnet_aux\src\custom_controlnet_aux\dwpose\dw_onnx\cv_ox_det.py", line 104, in inference_detector
output = session.run(None, {input_name: input})
File "c:\comfy.venv\Lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 275, in run
return self._sess.run(output_names, input_feed, run_options)
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
System Information
Devices
Log...
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.