Pull v1.21.0 release changes to win-ort-main #24069

ashrit-ms · 2025-03-17T18:59:49Z

Description

This change reverts the following PRs made to win-ort-main
0420687 Update win-ort-main to tip main 250211 (#23646)
480bcdf [VitisAI] Add vaip Integration Using FetchContent (Cherry-pick of PR#22038 to win-ort-main branch) (#23608)
4b5b5f7 Update win-ort-main to tip main 250123 (#23473)
df87317 Update win-ort-main to tip main 250116 (#23398)

and cherry-picks commits between 6806174 to e0b66ca

### Description  For legacy jetson users who use jetpack 5.x, the latest TRT version is 8.5. Add version check to newer trt features to fix build on jetpack 5.x (cuda11.8+gcc11 are required) ### Motivation and Context

### Description  Changed all support tensor type from ir 9 to ir 10. ### Motivation and Context  - See issue #23205 Co-authored-by: Yueqing Zhang <yueqingz@amd.com>

### Description The Web CI pipeline uses three different Windows machine pools: 1. onnxruntime-Win2022-webgpu-A10 2. onnxruntime-Win2022-VS2022-webgpu-A10 3. onnxruntime-Win-CPU-2022-web This PR merges them together to reduce ongoing maintenance cost.

### Description Use `https.get` instead of `fetch` in ORT Nodejs binding package install script. ### Motivation and Context According to discussions in #23232, the package `global-agent` cannot work with `fetch` API. To make it work with the proxy agent, this PR replaces the `fetch` API with `https.get` in the install script.

### Description This PR is convenient to do post processing for the generated json file when profiling is enabled. Kernel type can be used to aggregate the same type kernels' overall time.

Move Linux GPU CI pipeline to A10 machines which are more advanced. Retire onnxruntime-Linux-GPU-T4 machine pool. Disable run_lean_attention test because the new machines do not have enough shared memory. ``` skip loading trt attention kernel fmha_mhca_fp16_128_256_sm86_kernel because no enough shared memory [E:onnxruntime:, sequential_executor.cc:505 ExecuteKernel] Non-zero status code returned while running MultiHeadAttention node. Name:'MultiHeadAttention_0' Status Message: CUDA error cudaErrorInvalidValue:invalid argument ```

…#23232) ### Description Add proxy agent to fetch request ### Motivation and Context Fixes #23231 --------- Signed-off-by: Junze Wu <junze.wu@intel.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>

### Description Update `mocha` to v11.0.1 and `fs-extra` to v11.2.0 ``` # npm audit report nanoid <3.3.8 Severity: moderate Predictable results in nanoid generation when given non-integer values - GHSA-mwcw-c2x4-8c55 fix available via `npm audit fix` node_modules/nanoid mocha 8.2.0 - 10.2.0 Depends on vulnerable versions of nanoid node_modules/mocha 2 moderate severity vulnerabilities ```

### Description 1. Currently Python-Cuda-Publishing-Pipeline only publishes Linux wheels, not Windows wheels. It is because recently we refactored the upstream pipeline("Python-CUDA-Packaging-Pipeline") to use 1ES PT. This PR fixed the issue 2. tools/ci_build/github/azure-pipelines/stages/py-win-gpu-stage.yml no longer includes component-governance-component-detection-steps.yml , because 1ES PT already inserted such a thing 3. Delete tools/ci_build/github/windows/eager/requirements.txt because it is no longer used. ### Motivation and Context The "Python-CUDA-Packaging-Pipeline" is for CUDA 12. "Python CUDA ALT Packaging Pipeline" is for CUDA 11. The two pipelines are very similar, except the CUDA versions are different. Each of them has three parts: build, test, publish. "Python-CUDA-Packaging-Pipeline" is the first part: build. "Python CUDA12 Package Test Pipeline" is the second part. "Python-Cuda-Publishing-Pipeline" is the third part that publishes the packages to an internal ADO feed.

Resolve #21308

### Description Separating result processor out from profiler.py without changing the behaviors of current profile.py ### Motivation and Context Less dependency and smaller code for processing profile from other scenarios. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

The input should be added by skip and bias (if it exits) firstly.

### Description This PR 1) uses override shape instead of tensor original shape in shader key to reduce some shader variants; 2) adds indices shape rank to shader key in case some potential errors.

### Description Fusing Pad & AveragePool requires AveragePool to use `count_include_pad=1`. If the AveragePool already set some padding and `count_include_pad=0`, fusion can't happen. This PR adds a condition to perform fusion depending on those attributes. If fusion occurs, `count_include_pad` is always set to `1`. ### Motivation and Context Fix #22177 (mislabelled as a performance issue but there's an actual bug in the implementation) Bug introduced in #21556

mitigates #23183 while we investigate final solution

### Description Fix comparison of narrow type with wide type in loop condition. ### Motivation and Context Comparison between types of different widths in a loop condition can cause the loop to fail to terminate.

Some quantized models have QDQ around Conv/Gemm but the weight and/or bias are not quantized. This PR adds WeightBiasQuantization optimizer to quantize float weight and/or bias to INT8 and INT32 tensors respectively. We only do this for weight and/or bias initializer so that ConstantFolding will fold the sub-graph to real quantized initializers during the graph optimization next round.

ONNX's MatMul is same as numpy.matmul, which supports input tensors with rank >= 1. But QNN's MatMul can only support input tensors with rank >= 2. This PR is to add MatMulOpBuilder for QNN EP to build QNN graph to support all possible cases of ONNX's MatMul, by adding Reshape nodes if necessary, e.g., if Reshape 1D input to 2D if exists, and Reshape output to expected shape at the end. This PR also tries to use FullyConnected Op for MatMul if 2nd input is 2D initializer or 1D tensor because FullyConnected is faster than MatMul on QNN EP. If 2nd input is 2D tensor, we require it an initializer because FullyConnected requires 2nd input in [n, k] shape, we can transpose it when graph building if it's an initializer (we don't want to add extra Transpose node). Use swin_base model as example, which contains several MatMul nodes with 2nd input is 2D initializer (not followed by Add), running on Gen3 mobile device, before the change, it takes 34.8876 ms, after this change, it's 27.0639 ms.

### Description Add a temporary path to RN 0.69.3 to update the boost url ### Motivation and Context Fix the React-native CI until we update the RN to 0.70.15 or 0.73.3+ versions

### Description Changes vcpkg manifest and configuration file (vcpkg.json & vcpkg-configuration.json) * Update vcpkg version to https://github.com/microsoft/vcpkg/releases/tag/2024.12.16 * Use protobuf 3.21.12(= `v21.12`) to sync with [cmake/deps.txt](https://github.com/microsoft/onnxruntime/blob/main/cmake/deps.txt) * Resolve #22750 * Add `onnx` to vcpkg manifest so `find_package(ONNX)` and `find_dependency(Protobuf)` can work as expected. * Currently, It uses 1.16.2 * v1.17.0 will become available after microsoft/vcpkg#42942 However, `onnx` in vcpkg doesn't configure `ONNX_DISABLE_STATIC_REGISTRATION` build option. * microsoft/vcpkg#38879 * Create "cmake/vcpkg-triplets/" folder and triplet files which use `VCPKG_CMAKE_CONFIGURE_OPTIONS` for the option * This requires `VCPKG_OVERLAY_TRIPLETS` environment variable for CI steps, which is a bit inconvenient. I will try to find simple way to get same result ### Motivation and Context * Help #23158 * "ONNX is not consumed from vcpkg" * "Mismatch protobuf version. When vcpkg is enabled , we should not fetch protoc from Github which may cause version mismatches." * microsoft/vcpkg#43126 * #21348

### Description Fix the issue for Gather int64 indices handling. Make it still insert Cast node if it's non-quantized Gather node.

### Description Always make sure resources and callbacks are cleaned up ### Motivation and Context We've seen problems where the log callback isn't deregistered which can lead to crashes --------- Co-authored-by: Adrian Lizarraga <adrianlm2@gmail.com>

Similar to #21989

Update min iOS version to 15.1 to align with React Native 0.76. We need to update React Native . See react-native-community/discussions-and-proposals#812 for background. Similar to PR #20773

### Description Update documentation for Nuget packages for OVEP Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>

Currently we have Clip/Relu with Q fusion on level 2. But for EPs that are using NodeUnit, these optimizers are not applied. If we want to remove such redundant Clip/Relu nodes, we need to add code to handle it for each EP separately. The PR detects a Clip/Relu is made redundant with a Q node, and add this information to the corresponding QDQ NodeUnit, so that EPs can ignore it, and can handle the target node only in the QDQ NodeUnit.

Fix build error.

… attribute. (#23287) ### Description  Added a fatal error message for unsupported GroupQuerryAttention do_rotary attribute. ### Motivation and Context  #22987 Help user understand that this attribute is not supported.

### Description Updates packaging pipelines to build onnxruntime_qnn wheel for Python 3.13. ### Motivation and Context Enable use of ONNX Runtime QNN EP on Python 3.13.

### Description  Removed the schema when unloading. ### Motivation and Context  This fix the crash where the onnxruntime reloads vitisai ep. Co-authored-by: Yueqing Zhang <yueqingz@amd.com>

…23749) ### Description - The calibrator uses `np.max/np.min` to get min/max values from collected data. However, these functions return `nan` if any of the array values is `nan` which subsequently leads invalid scale and failure during quantization at https://github.com/microsoft/onnxruntime/blob/93689c5995dcacbb99c3afa9ec477b305c71159f/onnxruntime/python/tools/quantization/quant_utils.py#L293. - When quantizing models with `GroupQueryAttention`, the intermediate activations corresponding to padded tokens can become nan. We can safely ignore such values as they don't contribute to the final model output. - Using `np.nanmax/np.nanmin` ensures that the calibrator can handle `nan` values. If all values are nan, numpy raises a `RuntimeWarning: All-NaN slice encountered` warning which can help debug the eventual scale issue failure. ```python import numpy as np no_nans = np.array([1, 2, 3], dtype=np.float32) some_nans = np.array([np.nan, 1, 2, 3, np.nan, np.nan], dtype=np.float32) all_nans = np.array([np.nan, np.nan], dtype=np.float32) for array in [no_nans, some_nans, all_nans]: print("np.max/np.min:", np.max(array), np.min(array)) print("np.nanmax/np.nanmin:", np.nanmax(array), np.nanmin(array)) ``` Output ```bash np.max/np.min: 3.0 1.0 np.nanmax/np.nanmin: 3.0 1.0 np.max/np.min: nan nan np.nanmax/np.nanmin: 3.0 1.0 np.max/np.min: nan nan np.nanmax/np.nanmin: nan nan RuntimeWarning: All-NaN slice encountered print("np.nanmax/np.nanmin:", np.nanmax(array), np.nanmin(array)) ``` ### Motivation and Context

### Description The latest driver for Linux A10 machines is 535. The latest driver for Windows A10 machines is 550. We have some unused A100 quota. If we can replace the A10 machines with A100 machines, we might be able to upgrade our CUDA from 12.2 to 12.4.

### Description To support latest cmake, this PR is having the changes for: - Instead of using `${CMAKE_SYSTEM_NAME}`, use `CMAKE_SYSTEM_NAME` - For onnxruntime_providers_shared and custom_op_library , setting AIX_SHARED_LIBRARY_ARCHIVE property as OFF. Explanation is put as comment.

…23756) Include QNN error handle value in fallback error message so we can see the actual error code in case we can't get the error message.

### Description This PR reverts changes from [this PR](https://github.com/microsoft/onnxruntime/pull/15759/files). ### Motivation and Context This fixes a security vulnerability that was raised internally.

### Description  ### Motivation and Context

### Description Changes in this PR are for: - Cleanup the patch for Eigen on AIX. Not needed anymore . - Fix to recent test failures ``` 1: [----------] Global test environment tear-down 1: [==========] 4737 tests from 310 test suites ran. (94682 ms total) 1: [ PASSED ] 4733 tests. 1: [ SKIPPED ] 2 tests, listed below: 1: [ SKIPPED ] MatMulFpQ4.MatMul2DSym 1: [ SKIPPED ] MatMulFpQ4.MatMul2DBlkZp 1: [ FAILED ] 2 tests, listed below: 1: [ FAILED ] GraphTransformationTests.MatMulAddFusion_three_input_with_1d 1: [ FAILED ] GraphTransformationTests.MatMulAddFusion_NeedReshape_3D ``` --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com>

For phi3.5-gqa-static sum_long(>1000 tokens) on meteor lake. Before: 300 tokens in 27.0sec, e2e:11.1 tps, prompt: 212.4 tps, gen: 14.2 tps, ttft: 5.85 sec After: 300 tokens in 23.0sec, e2e:13.0 tps, prompt: 248.9 tps, gen: 16.6 tps, ttft: 4.99 sec

### Description - Enable hgemm and softmax fp16 kernels for GQA - add intra-loop parallelism to RoPE fp16 kernel __Benchmarking models__ - float32: [phi-3 cpu accuracy level 0](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx/tree/main/cpu_and_mobile/cpu-int4-rtn-block-32) - float16: [phi-3 gpu accuracy level 0](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx/tree/main/cuda/cuda-int4-rtn-block-32) Note: - Both fp32 and fp16 models share the same model structure and operator settings. - GQA takes ~15% of the runtime. - prompt length 256, token generation length 512 Linux (ubuntu 24.04) Standard D16pls v5 (16 vcpus, 32 GiB memory) | | fp32 (tps) | old fp16 (tps) | new fp16 (tps) | new fp16 vs old fp16 | new fp16 vs fp32 | |--|--|--|--|--|--| | prompt processing | 31.22 | 44.24 | 46.29 | +4.6% | +48.25% | | token generation | 4.75 | 7.2 | 7.95 | +10.39% | +67.43% | ### Motivation and Context Speed up GQA on FP16

### Description Allow users to specify per EP specific resource constraints. Currently, models that do not fit into device memory error out. This PR lays groundwork for EP specific resource constrained graph partitioning, subject to incremental feature additions. Partitioning in this context means to assign graph nodes to a specific device (Execution Provider) up to a certain limit that is every automatically inferred or provided by configuration. In this implementation, we stop assigning nodes to CUDA once we reach the specified memory limit. This allows users to run models on devices with limited memory or other limited resources and offload parts of the graph on CPU or other EPs as configured. The PR also introduces an ability to profile and save resource consumption on a per node basis. The results of one or more runs are saved into a CSV file which can then be loaded to assist partitioning. Model architecture-based partitioning (like put N transformer blocks on GPU and embedding on CPU) is not implemented in this PR but will be coming in the future. ### Motivation and Context We want to allow models to run in constrained environments. ### Pending Annotation assisted partitioning

1. **Add new flag to build.py**: Introduced a `--use_vcpkg_ms_internal_asset_cache` flag to `build.py`. The flag is intended for internal use only. 2. **Reduce excessive logs**: Removed some excessive logs from `vcpkg_helper.py`.

### Description  Credit to [chethanpk](https://github.com/chethanpk) who provided with Rope Embedding in a patch. The patch is in the first commit of this PR. I have been confirming perf improvement with this code change. My analysis is based on phi-3-mini-4k-instruct-int4-int8-blklen32. Benchmark from onnxruntim-genai does not show clear improvement. this is because GQA only takes a small portion of the whole model (<10%) and Rope within GQA only take small portion of the whole GQA (12%). The following is the profile with and without avx2 we see cost of RoPE dropped from 82.42 to 18.86. Therefore I still recommend to merge this PR. with avx2 RoPE: Name: GroupQueryAttention_rotary, Mean Duration: 18.86, Percentage: 3.16% plain c++ RoPE: Name: GroupQueryAttention_rotary, Mean Duration: 82.42, Percentage: 12.20% mlas benchmark: dim|interleaved|baseline|new -|-|-|- 128 |false|735|18.1 256 |false|1470|31.7 512 |false|2938|59.2 1024 |false|5876|81.5 128 |true|368|23.1 256 |true|735|34.3 512 |true|1470|62.0 1024 |true|2937|125 --------- Signed-off-by: Liqun Fu <liqun.fu@microsoft.com> Signed-off-by: liqunfu <liqun.fu@microsoft.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

…ce (#23530) [onnxruntime/build] Add CI testing for ORT build with generic interface Summary: - Remove unused cmake variables - Add target specific logic when generic interface is used. - Add QNN EP test case that use ORT generic interface build

Adjusts scatter-nd kernel implementation for the case when reduction=none and there are duplicate values in the indices input tensor. If duplicates are detected, a single thread processes all indices to ensure correct results.

### Description The existing implementation of session options for the QNN EP does not honor the various bindings available. As such, even if set at runtime they are ignored. Fix is to follow the pattern of the `webgpu` provider and parse/populate the options accordingly. Existing defaults are preserved, such that if options are not set the prior behavior will persist. ### Motivation and Context During debugging and development of Node implementations using the QNN EP the need to set various parameters became apparent. Currently the parameters can only be set via changes to the ORT dll code itself, which is inflexible and slows development. --------- Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>

### Description Add shape infer dispatcher for `GatherBlockQuantized` contrib op. It reuses the dispatcher for `Gather` op since the first two inputs have the same specs. The output elem type comes from input 2 (scales) for `GatherBlockQuantized`. ### Motivation and Context Support shape inference for models with `GatherBlockQuantized` op.

### Description This pull request combines multiple improvements, bug fixes for the OpenVINO Execution Provider (OVEP). The changes are summarized as follows: 1. Support for various contrib Ops in OVEP. 2. Dimension Check Fixes for Greater, Pad, and MAX Ops: Fixed dimension check failures for the Greater, Pad, and MAX ops in OVEP, ensuring they now pass validation for all supported models. 3. Refactor Core and Shared Context Lifetimes: Refactored the lifetimes of the OpenVINO core and shared context to remove dependency on shutdown calls. This change avoids relying on static lifetime management and improves stability and resource cleanup. 4. Fix for Duplicate DQ Node Removal: Addressed an issue where duplicate Dequantize (DQ) nodes that were initializers were incorrectly removed. Initializers should always be preserved, and this fix ensures that all duplicate DQ nodes that are initializers are retained. --------- Co-authored-by: Ankit Maheshkar <ankit.maheshkar@intel.com> Co-authored-by: n1harika <niharika.sathish@intel.com> Co-authored-by: rayngun <103146671+rayngun@users.noreply.github.com> Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com> Co-authored-by: Eric Crawford <eric.r.crawford@intel.com> Co-authored-by: Surendar Rama Sitaraman <surendar.rama.sitaraman@intel.com>

…ypes (#23771) ### Description Fixes bug in the IsQDQPairSupported utility function, which is used by various QDQ optimizers (e.g., DoubleQDQPairsRemover, QDQFinalCleanup, etc.). The bug causes an exception when IsQDQPairIsSupported() is called with a `Q(scale_f32) -> DQ(scale_f16)` sequence that uses different scale types. ### Motivation and Context Fix bug that prevents creating QDQ models that use scales of different types.

### Description Recent progress with SubGroupMatrix prototype in Dawn https://issues.chromium.org/issues/348702031, exposes SIMD-Group Matrix Functions to webgpu. This shader implements a matmulnbits using that primitive. Observed perf gains, in terms of LLM inference speed, prefill perf for Phi 3.5 for a 1K token prefill see 3x improvement. 5.4s from 15s. With Changes ``` ./model_benchmark -i ~/Phi-3.5-mini-instruct-onnx-web -l 1000 Batch size: 1, prompt tokens: 1001, tokens to generate: 128 Prompt processing (time to first token): avg (us): 5.42498e+06 <<< SubGroupMatrix 5.4s avg (tokens/s): 184.517 p50 (us): 5.41982e+06 stddev (us): 12023.8 n: 5 * 1001 token(s) Token generation: avg (us): 91138.5 avg (tokens/s): 10.9723 p50 (us): 89488.5 stddev (us): 35136.2 n: 635 * 1 token(s) ``` Baseline ``` ./model_benchmark -i ~/Phi-3.5-mini-instruct-onnx-web -l 1000 Batch size: 1, prompt tokens: 1001, tokens to generate: 128 Prompt processing (time to first token): avg (us): 1.45507e+07 <<< Baseline 14.5s avg (tokens/s): 68.7938 p50 (us): 1.45413e+07 stddev (us): 22208.9 n: 5 * 1001 token(s) Token generation: avg (us): 94109.8 avg (tokens/s): 10.6259 p50 (us): 89660 stddev (us): 61579 n: 635 * 1 token(s) ```

### Description  Action item: * ~~Add LTO support when cuda 12.8 & Relocatable Device Code (RDC)/separate_compilation are enabled, to reduce potential perf regression~~LTO needs further testing * Reduce nuget/whl package size by selecting devices & their cuda binary/PTX assembly during ORT build; * make sure ORT nuget package < 250 MB, python wheel < 300 MB * Suggest creating internal repo to publish pre-built package with Blackwell sm100/120 SASS and sm120 PTX to repo like [onnxruntime-blackwell](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/onnxruntime-blackwell), since the package size will be much larger than nuget/pypi repo limit * Considering the most popular datacenter/consumer GPUs, here's the cuda_arch list for linux/windows: * With this change, perf on next release ORT is optimal on Linux with Tesla P100 (sm60), V100 (sm70), T4 (sm75), A100 (sm80), A10 (sm86, py whl), H100 (sm90); on Windows with GTX 980 (sm52), GTX 1080 (sm61), RTX 2080 (sm75), RTX 3090 (sm86), RTX 4090 (sm89). Other newer architecture GPUs are compatible. * | OS | cmake_cuda_architecture | package size | | ------------- | ------------------------------------------ | ------------ | | Linux nupkg | 60-real;70-real;75-real;80-real;90 | 215 MB | | Linux whl | 60-real;70-real;75-real;80-real;86-real;90 | 268 MB | | Windows nupkg | 52-real;61-real;75-real;86-real;89-real;90-virtual | 197 MB | | Windows whl | 52-real;61-real;75-real;86-real;89-real;90-virtual | 204 MB | * [TODO] Vaildate on Windows CUDA CI pipeline with cu128 ### Motivation and Context  Address discussed topics in #23562 and #23309 #### Stats | libonnxruntime_providers_cuda lib size | Main 75;80;90 | 75-real;80-real;90-virtual | 75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 | | -------------------------------------- | ----------------- | -------------------------- | --------------------- | ------------------------------------------------ | ------------------ | ------------- | ------------------ | -------------------------- | | Linux | 446 MB | 241 MB | 362 MB | 482 MB | N/A | 422 MB | 301 MB | | | Windows | 417 MB | 224 MB | 338 MB | 450 MB | 279 MB | N/A | | 292 MB | | nupkg size | Main 75;80;90 | 75-real;80-real;90-virtual | 75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 | | ---------- | ----------------- | -------------------------- | --------------------- | ------------------------------------------------ | ------------------ | ------------- | ------------------ | -------------------------- | | Linux | 287 MB | TBD | 224 MB | 299 MB | | | 197 MB | N/A | | Windows | 264 MB | TBD | 205 MB | 274 MB | | | N/A | 188 MB | | whl size | Main 75;80;90 | 75-real;80-real;90-virtual | 75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 | | -------- | ----------------- | -------------------------- | --------------------- | ------------------------------------------------ | ------------------ | ------------- | ------------------ | -------------------------- | | Linux | 294 MB | 154 MB | TBD | TBD | N/A | 278 MB | 203 MB | N/A | | Windows | 271 MB | 142 MB | TBD | 280 MB | 184 MB | N/A | N/A | 194 MB | ### Reference https://developer.nvidia.com/cuda-gpus [Improving GPU Application Performance with NVIDIA CUDA 11.2 Device Link Time Optimization](https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/) [PTX Compatibility](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ptx-compatibility) [Application Compatibility on the NVIDIA Ada GPU Architecture](https://docs.nvidia.com/cuda/ada-compatibility-guide/#application-compatibility-on-the-nvidia-ada-gpu-architecture) [Software Migration Guide for NVIDIA Blackwell RTX GPUs: A Guide to CUDA 12.8, PyTorch, TensorRT, and Llama.cpp](https://forums.developer.nvidia.com/t/software-migration-guide-for-nvidia-blackwell-rtx-gpus-a-guide-to-cuda-12-8-pytorch-tensorrt-and-llama-cpp/321330) ### Track some failed/unfinished experiments to control package size: 1. Build ORT with `CUDNN_FRONTEND_SKIP_JSON_LIB=ON` doesn't help much on package size; 2. ORT packaging uses 7z to pack the package, which can only use zip's deflate compression. In such format, setting compression ratio to ultra `-mx=9` doesn't help much to control size (7z's LZMA compression is much better but not supported by nuget/pypi) 3. Simply replacing `sm_xx` with `lto_xx` would increase cudaep library size by ~50% (Haven't tested on perf yet). This needs further validation.

…TRT (#23705) This PR removes the implicit filtering-out DDS ops from running on TRT. In other words, by default, DDS nodes will be run by TRT if it supports. Moreover, it adds new provider option `trt_op_types_to_exclude`: - User can provide op type list to be excluded from running on TRT - e.g. `trt_op_types_to_exclude="NonMaxSuppression,NonZero,RoiAlignl"` (This PR basically adds back [feature](#22681 previously being held to merge.) [Note] There may be potential performance issues in TRT 10 when running models that contain DDS operations such as NonMaxSuppression, NonZero, and RoiAlign (e.g., Faster-RCNN). If user encounters significant performance degradation, we suggest specifying those DDS ops to be excluded from running by TRT, i.e. trt_op_types_to_exclude=\"NonMaxSuppression,NonZero,RoiAlign\". Those DDS nodes will be run by CUDA EP or CPU.

### Description Set build user's uid when creating Migraphx/ROCM docker images

Cherry-pick the following changes into [rel-1.21.0](https://github.com/microsoft/onnxruntime/tree/rel-1.21.0). - (#23791) - (#23710) - (#23789) - (#23829) --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Yifan Li <109183385+yf711@users.noreply.github.com> Co-authored-by: Ankit Maheshkar <ankit.maheshkar@intel.com> Co-authored-by: n1harika <niharika.sathish@intel.com> Co-authored-by: Changming Sun <chasun@microsoft.com>

The second round of cherry-picks into [rel-1.21.0](https://github.com/microsoft/onnxruntime/tree/rel-1.21.0). The first one was done in #23846. - #23779 - #23856 - #23827 - #23834 - #23876 - #23892 --------- Co-authored-by: Jambay Kinley <jambaykinley@microsoft.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Ashish Garg <quic_ashigarg@quicinc.com> Co-authored-by: Ashish Garg <ashigarg@qti.qualcomm.com>

yf711 and others added 30 commits March 17, 2025 11:48

Merge web machine pools (#23243)

f743009

### Description The Web CI pipeline uses three different Windows machine pools: 1. onnxruntime-Win2022-webgpu-A10 2. onnxruntime-Win2022-VS2022-webgpu-A10 3. onnxruntime-Win-CPU-2022-web This PR merges them together to reduce ongoing maintenance cost.

[webgpu] Add kernel type to profile info (#23167)

8b39454

### Description This PR is convenient to do post processing for the generated json file when profiling is enabled. Kernel type can be used to aggregate the same type kernels' overall time.

fix pipeline build-perf-test-binaries (#23255)

7473944

[build] Be compatible with the latest protobuf (#23260)

a9f05b1

Resolve #21308

[WebNN] Fix bug in SkipSimplifiedLayerNormalization (#23236)

0e104b6

The input should be added by skip and bias (if it exits) firstly.

[webgpu] Use override shape in shader key (#23188)

a03a564

### Description This PR 1) uses override shape instead of tensor original shape in shader key to reduce some shader variants; 2) adds indices shape rank to shader key in case some potential errors.

disable scatternd op for jsep (#23277)

7f574c4

mitigates #23183 while we investigate final solution

Address CodeQL security issues on comparison of different types (#23276)

dad3430

### Description Fix comparison of narrow type with wide type in loop condition. ### Motivation and Context Comparison between types of different widths in a loop condition can cause the loop to fail to terminate.

Add a temporary path to RN 0.69.3 to update the boost url (#23281)

ccf9c0d

### Description Add a temporary path to RN 0.69.3 to update the boost url ### Motivation and Context Fix the React-native CI until we update the RN to 0.70.15 or 0.73.3+ versions

Fix the issue for Gather int64 indices handling (#23274)

0a6585e

### Description Fix the issue for Gather int64 indices handling. Make it still insert Cast node if it's non-quantized Gather node.

Update NDK (#23280)

45797ef

Similar to #21989

Update min iOS version to 15.1 to align with React Native 0.76 (#23292)

b4513c9

Update min iOS version to 15.1 to align with React Native 0.76. We need to update React Native . See react-native-community/discussions-and-proposals#812 for background. Similar to PR #20773

Updated the Documentation for nuget packages (#23182)

243276d

### Description Update documentation for Nuget packages for OVEP Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>

Fix Build Error (#23299)

1fef949

Fix build error.

Add Gradient for Atan (#23172)

562ca39

adrianlizarraga and others added 26 commits March 17, 2025 11:49

[QNN EP] Build Python 3.13 packages for QNN (#23706)

65d4adb

### Description Updates packaging pipelines to build onnxruntime_qnn wheel for Python 3.13. ### Motivation and Context Enable use of ONNX Runtime QNN EP on Python 3.13.

[QNN EP] Include QNN error handle value in fallback error message. (#…

94100ec

…23756) Include QNN error handle value in fallback error message so we can see the actual error code in case we can't get the error message.

Fix security vulnerability with Whisper export (#23743)

c01cb39

### Description This PR reverts changes from [this PR](https://github.com/microsoft/onnxruntime/pull/15759/files). ### Motivation and Context This fixes a security vulnerability that was raised internally.

Add condition to gpu wheel build flag (#23760)

6c9d33c

### Description  ### Motivation and Context

[JSEP] fix scatter-nd jsep kernel (#23755)

45adaab

Adjusts scatter-nd kernel implementation for the case when reduction=none and there are duplicate values in the indices input tensor. If duplicates are detected, a single thread processes all indices to ensure correct results.

Set build user's uid when creating Migraphx/ROCM docker images (#23657)

b816837

### Description Set build user's uid when creating Migraphx/ROCM docker images

ashrit-ms requested a review from a team as a code owner March 17, 2025 18:59

ashrit-ms self-assigned this Mar 17, 2025

ashrit-ms merged commit b1b4c44 into win-ort-main Mar 17, 2025
29 of 30 checks passed

ashrit-ms deleted the ashritms/v1.21.0-update branch March 17, 2025 19:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pull v1.21.0 release changes to win-ort-main #24069

Pull v1.21.0 release changes to win-ort-main #24069

Uh oh!

ashrit-ms commented Mar 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Pull v1.21.0 release changes to win-ort-main #24069

Pull v1.21.0 release changes to win-ort-main #24069

Uh oh!

Conversation

ashrit-ms commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

ashrit-ms commented Mar 17, 2025 •

edited

Loading