Skip to content

Conversation

@ashrit-ms
Copy link
Contributor

@ashrit-ms ashrit-ms commented Mar 17, 2025

Description

This change reverts the following PRs made to win-ort-main
0420687 Update win-ort-main to tip main 250211 (#23646)
480bcdf [VitisAI] Add vaip Integration Using FetchContent (Cherry-pick of PR#22038 to win-ort-main branch) (#23608)
4b5b5f7 Update win-ort-main to tip main 250123 (#23473)
df87317 Update win-ort-main to tip main 250116 (#23398)

and cherry-picks commits between 6806174 to e0b66ca

yf711 and others added 30 commits March 17, 2025 11:48
### Description
<!-- Describe your changes. -->
For legacy jetson users who use jetpack 5.x, the latest TRT version is
8.5.
Add version check to newer trt features to fix build on jetpack 5.x
(cuda11.8+gcc11 are required)


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Changed all support tensor  type from ir 9 to ir 10.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
- See issue #23205

Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
### Description
The Web CI pipeline uses three different Windows machine pools:
1. onnxruntime-Win2022-webgpu-A10
2. onnxruntime-Win2022-VS2022-webgpu-A10
3. onnxruntime-Win-CPU-2022-web

This PR merges them together to reduce ongoing maintenance cost.
### Description

Use `https.get` instead of `fetch` in ORT Nodejs binding package install
script.

### Motivation and Context

According to discussions in #23232, the package `global-agent` cannot
work with `fetch` API. To make it work with the proxy agent, this PR
replaces the `fetch` API with `https.get` in the install script.
### Description
This PR is convenient to do post processing for the generated json file
when profiling is enabled. Kernel type can be used to aggregate the same
type kernels' overall time.
Move Linux GPU CI pipeline to A10 machines which are more advanced.
Retire onnxruntime-Linux-GPU-T4 machine pool.
Disable run_lean_attention test because the new machines do not have
enough shared memory.

```
skip loading trt attention kernel fmha_mhca_fp16_128_256_sm86_kernel because no enough shared memory
[E:onnxruntime:, sequential_executor.cc:505 ExecuteKernel] Non-zero status code returned while running MultiHeadAttention node. Name:'MultiHeadAttention_0' Status Message: CUDA error cudaErrorInvalidValue:invalid argument
```
…#23232)

### Description
Add proxy agent to fetch request



### Motivation and Context
Fixes #23231

---------

Signed-off-by: Junze Wu <junze.wu@intel.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description

Update `mocha` to v11.0.1 and `fs-extra` to v11.2.0

```
# npm audit report

nanoid  <3.3.8
Severity: moderate
Predictable results in nanoid generation when given non-integer values - GHSA-mwcw-c2x4-8c55
fix available via `npm audit fix`
node_modules/nanoid
  mocha  8.2.0 - 10.2.0
  Depends on vulnerable versions of nanoid
  node_modules/mocha

2 moderate severity vulnerabilities
```
### Description
1. Currently Python-Cuda-Publishing-Pipeline only publishes Linux
wheels, not Windows wheels. It is because recently we refactored the
upstream pipeline("Python-CUDA-Packaging-Pipeline") to use 1ES PT. This
PR fixed the issue
2. tools/ci_build/github/azure-pipelines/stages/py-win-gpu-stage.yml no
longer includes component-governance-component-detection-steps.yml ,
because 1ES PT already inserted such a thing
3. Delete tools/ci_build/github/windows/eager/requirements.txt because
it is no longer used.

### Motivation and Context
The "Python-CUDA-Packaging-Pipeline" is for CUDA 12.
"Python CUDA ALT Packaging Pipeline" is for CUDA 11.

The two pipelines are very similar, except the CUDA versions are
different.
Each of them has three parts: build, test, publish.
"Python-CUDA-Packaging-Pipeline" is the first part: build.
"Python CUDA12 Package Test Pipeline" is the second part.
"Python-Cuda-Publishing-Pipeline" is the third part that publishes the
packages to an internal ADO feed.
### Description
Separating result processor out from profiler.py without changing the
behaviors of current profile.py



### Motivation and Context
Less dependency and smaller code for processing profile from other
scenarios.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
The input should be added by skip and bias (if it exits) firstly.
### Description
This PR 1) uses override shape instead of tensor original shape in
shader key to reduce some shader variants; 2) adds indices shape rank to
shader key in case some potential errors.
### Description
Fusing Pad & AveragePool requires AveragePool to use
`count_include_pad=1`. If the AveragePool already set some padding and
`count_include_pad=0`, fusion can't happen.

This PR adds a condition to perform fusion depending on those
attributes. If fusion occurs, `count_include_pad` is always set to `1`.

### Motivation and Context
Fix #22177 (mislabelled as a performance issue but there's an actual bug
in the implementation)
Bug introduced in #21556
mitigates #23183 while we
investigate final solution
### Description
Fix comparison of narrow type with wide type in loop condition.

### Motivation and Context
Comparison between types of different widths in a loop condition can
cause the loop to fail to terminate.
Some quantized models have QDQ around Conv/Gemm but the weight and/or
bias are not quantized. This PR adds WeightBiasQuantization optimizer to
quantize float weight and/or bias to INT8 and INT32 tensors
respectively. We only do this for weight and/or bias initializer so that
ConstantFolding will fold the sub-graph to real quantized initializers
during the graph optimization next round.
ONNX's MatMul is same as numpy.matmul, which supports input tensors with
rank >= 1. But QNN's MatMul can only support input tensors with rank >=
2. This PR is to add MatMulOpBuilder for QNN EP to build QNN graph to
support all possible cases of ONNX's MatMul, by adding Reshape nodes if
necessary, e.g., if Reshape 1D input to 2D if exists, and Reshape output
to expected shape at the end.
 
This PR also tries to use FullyConnected Op for MatMul if 2nd input is
2D initializer or 1D tensor because FullyConnected is faster than MatMul
on QNN EP. If 2nd input is 2D tensor, we require it an initializer
because FullyConnected requires 2nd input in [n, k] shape, we can
transpose it when graph building if it's an initializer (we don't want
to add extra Transpose node).

Use swin_base model as example, which contains several MatMul nodes with
2nd input is 2D initializer (not followed by Add), running on Gen3
mobile device, before the change, it takes 34.8876 ms, after this
change, it's 27.0639 ms.
### Description
Add a temporary path to RN 0.69.3 to update the boost url


### Motivation and Context
Fix the React-native CI until we update the RN to 0.70.15 or 0.73.3+
versions
### Description

Changes vcpkg manifest and configuration file (vcpkg.json &
vcpkg-configuration.json)

* Update vcpkg version to
https://github.com/microsoft/vcpkg/releases/tag/2024.12.16
* Use protobuf 3.21.12(= `v21.12`) to sync with
[cmake/deps.txt](https://github.com/microsoft/onnxruntime/blob/main/cmake/deps.txt)
  * Resolve #22750
* Add `onnx` to vcpkg manifest so `find_package(ONNX)` and
`find_dependency(Protobuf)` can work as expected.
  * Currently, It uses 1.16.2
* v1.17.0 will become available after
microsoft/vcpkg#42942

However, `onnx` in vcpkg doesn't configure
`ONNX_DISABLE_STATIC_REGISTRATION` build option.

* microsoft/vcpkg#38879
* Create "cmake/vcpkg-triplets/" folder and triplet files which use
`VCPKG_CMAKE_CONFIGURE_OPTIONS` for the option
* This requires `VCPKG_OVERLAY_TRIPLETS` environment variable for CI
steps, which is a bit inconvenient.
     I will try to find simple way to get same result

### Motivation and Context

* Help #23158 
  * "ONNX is not consumed from vcpkg"
* "Mismatch protobuf version. When vcpkg is enabled , we should not
fetch protoc from Github which may cause version mismatches."
* microsoft/vcpkg#43126
* #21348
### Description
Fix the issue for Gather int64 indices handling. Make it still insert Cast node if it's non-quantized Gather node.
### Description
Always make sure resources and callbacks are cleaned up



### Motivation and Context
We've seen problems where the log callback isn't deregistered which can lead to crashes

---------

Co-authored-by: Adrian Lizarraga <adrianlm2@gmail.com>
Update min iOS version to 15.1 to align with React Native 0.76. We need
to update React Native .
See
react-native-community/discussions-and-proposals#812
for background.

Similar to PR  #20773
### Description
Update documentation for Nuget packages for OVEP

Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
Currently we have Clip/Relu with Q fusion on level 2. But for EPs that
are using NodeUnit, these optimizers are not applied. If we want to
remove such redundant Clip/Relu nodes, we need to add code to handle it
for each EP separately.

The PR detects a Clip/Relu is made redundant with a Q node, and add this
information to the corresponding QDQ NodeUnit, so that EPs can ignore
it, and can handle the target node only in the QDQ NodeUnit.
Fix build error.
… attribute. (#23287)

### Description
<!-- Describe your changes. -->

Added a fatal error message for unsupported GroupQuerryAttention
do_rotary attribute.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
#22987
Help user understand that this attribute is not supported.
adrianlizarraga and others added 26 commits March 17, 2025 11:49
### Description
Updates packaging pipelines to build onnxruntime_qnn wheel for Python
3.13.


### Motivation and Context
Enable use of ONNX Runtime QNN EP on Python 3.13.
### Description
<!-- Describe your changes. -->
Removed the schema when unloading.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
This fix the crash where the onnxruntime reloads vitisai ep.

Co-authored-by: Yueqing Zhang <yueqingz@amd.com>
…23749)

### Description
- The calibrator uses `np.max/np.min` to get min/max values from
collected data. However, these functions return `nan` if any of the
array values is `nan` which subsequently leads invalid scale and failure
during quantization at
https://github.com/microsoft/onnxruntime/blob/93689c5995dcacbb99c3afa9ec477b305c71159f/onnxruntime/python/tools/quantization/quant_utils.py#L293.
- When quantizing models with `GroupQueryAttention`, the intermediate
activations corresponding to padded tokens can become nan. We can safely
ignore such values as they don't contribute to the final model output.
- Using `np.nanmax/np.nanmin` ensures that the calibrator can handle
`nan` values. If all values are nan, numpy raises a `RuntimeWarning:
All-NaN slice encountered` warning which can help debug the eventual
scale issue failure.

```python
import numpy as np

no_nans = np.array([1, 2, 3], dtype=np.float32)
some_nans = np.array([np.nan, 1, 2, 3, np.nan, np.nan], dtype=np.float32)
all_nans = np.array([np.nan, np.nan], dtype=np.float32)

for array in [no_nans, some_nans, all_nans]:
    print("np.max/np.min:", np.max(array), np.min(array))
    print("np.nanmax/np.nanmin:", np.nanmax(array), np.nanmin(array))
```
Output
```bash
np.max/np.min: 3.0 1.0
np.nanmax/np.nanmin: 3.0 1.0

np.max/np.min: nan nan
np.nanmax/np.nanmin: 3.0 1.0

np.max/np.min: nan nan
np.nanmax/np.nanmin: nan nan

RuntimeWarning: All-NaN slice encountered
  print("np.nanmax/np.nanmin:", np.nanmax(array), np.nanmin(array))
```

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
The latest driver for Linux A10 machines is 535.
The latest driver for Windows A10 machines is 550.
We have some unused A100 quota. If we can replace the A10 machines with
A100 machines, we might be able to upgrade our CUDA from 12.2 to 12.4.
### Description
To support latest cmake, this PR is having the changes for:

- Instead of using `${CMAKE_SYSTEM_NAME}`, use `CMAKE_SYSTEM_NAME`
- For onnxruntime_providers_shared and custom_op_library , setting
AIX_SHARED_LIBRARY_ARCHIVE property as OFF.
   Explanation is put as comment.
…23756)

Include QNN error handle value in fallback error message so we can see the actual error code in case we can't get the error message.
### Description
This PR reverts changes from [this
PR](https://github.com/microsoft/onnxruntime/pull/15759/files).

### Motivation and Context
This fixes a security vulnerability that was raised internally.
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Changes in this PR are for:

- Cleanup the patch for Eigen on AIX. Not needed anymore .
- Fix to recent test failures

```
1: [----------] Global test environment tear-down
1: [==========] 4737 tests from 310 test suites ran. (94682 ms total)
1: [  PASSED  ] 4733 tests.
1: [  SKIPPED ] 2 tests, listed below:
1: [  SKIPPED ] MatMulFpQ4.MatMul2DSym
1: [  SKIPPED ] MatMulFpQ4.MatMul2DBlkZp
1: [  FAILED  ] 2 tests, listed below:
1: [  FAILED  ] GraphTransformationTests.MatMulAddFusion_three_input_with_1d
1: [  FAILED  ] GraphTransformationTests.MatMulAddFusion_NeedReshape_3D
```

---------

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
For phi3.5-gqa-static sum_long(>1000 tokens) on meteor lake.

Before:
300 tokens in 27.0sec, e2e:11.1 tps, prompt: 212.4 tps, gen: 14.2 tps,
ttft: 5.85 sec

After:
300 tokens in 23.0sec, e2e:13.0 tps, prompt: 248.9 tps, gen: 16.6 tps,
ttft: 4.99 sec
### Description
 - Enable hgemm and softmax fp16 kernels for GQA
 - add intra-loop parallelism to RoPE fp16 kernel

__Benchmarking models__
- float32: [phi-3 cpu accuracy level
0](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx/tree/main/cpu_and_mobile/cpu-int4-rtn-block-32)
- float16: [phi-3 gpu accuracy level
0](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx/tree/main/cuda/cuda-int4-rtn-block-32)

Note: 
- Both fp32 and fp16 models share the same model structure and operator
settings.
- GQA takes ~15% of the runtime.
- prompt length 256, token generation length 512

Linux (ubuntu 24.04) Standard D16pls v5 (16 vcpus, 32 GiB memory)
| | fp32 (tps) | old fp16 (tps) | new fp16 (tps) | new fp16 vs old fp16
| new fp16 vs fp32 |
|--|--|--|--|--|--|
| prompt processing | 31.22 | 44.24 | 46.29 | +4.6% | +48.25% |
| token generation | 4.75  | 7.2 | 7.95 | +10.39% | +67.43% |

### Motivation and Context
Speed up GQA on FP16
### Description
Allow users to specify per EP specific resource constraints.
Currently, models that do not fit into device memory error out.

This PR lays groundwork for EP specific resource constrained graph
partitioning, subject to incremental feature additions.

Partitioning in this context means to assign graph nodes to a specific
device (Execution Provider)
up to a certain limit that is every automatically inferred or provided
by configuration.

In this implementation, we stop assigning nodes to CUDA once we reach
the specified memory limit.

This allows users to run models on devices with limited memory or other
limited resources and
offload parts of the graph on CPU or other EPs as configured.

The PR also introduces an ability to profile and save resource
consumption on a per node basis.
The results of one or more runs are saved into a CSV file which can then
be loaded to assist
partitioning.

Model architecture-based partitioning (like put N transformer blocks on
GPU and embedding on CPU) is not implemented in this PR but will be
coming in the future.

### Motivation and Context
We want to allow models to run in constrained environments.

### Pending
Annotation assisted partitioning
1. **Add new flag to build.py**: Introduced a
`--use_vcpkg_ms_internal_asset_cache` flag to `build.py`. The flag is
intended for internal use only.
2. **Reduce excessive logs**: Removed some excessive logs from
`vcpkg_helper.py`.
### Description
<!-- Describe your changes. -->
Credit to [chethanpk](https://github.com/chethanpk) who provided with
Rope Embedding in a patch. The patch is in the first commit of this PR.

I have been confirming perf improvement with this code change. My
analysis is based on phi-3-mini-4k-instruct-int4-int8-blklen32.
Benchmark from onnxruntim-genai does not show clear improvement. this is
because GQA only takes a small portion of the whole model (<10%) and
Rope within GQA only take small portion of the whole GQA (12%). The
following is the profile with and without avx2

we see cost of RoPE dropped from 82.42 to 18.86. Therefore I still
recommend to merge this PR.

with avx2 RoPE:
Name: GroupQueryAttention_rotary, Mean Duration: 18.86, Percentage:
3.16%

plain c++ RoPE:
Name: GroupQueryAttention_rotary, Mean Duration: 82.42, Percentage:
12.20%

mlas benchmark:
dim|interleaved|baseline|new
-|-|-|-
128 |false|735|18.1
256 |false|1470|31.7
512 |false|2938|59.2
1024 |false|5876|81.5
128 |true|368|23.1
256 |true|735|34.3
512 |true|1470|62.0
1024 |true|2937|125

---------

Signed-off-by: Liqun Fu <liqun.fu@microsoft.com>
Signed-off-by: liqunfu <liqun.fu@microsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
…ce (#23530)

[onnxruntime/build] Add CI testing for ORT build with generic interface

    Summary:
    - Remove unused cmake variables
    - Add target specific logic when generic interface is used.
    - Add QNN EP test case that use ORT generic interface build
Adjusts scatter-nd kernel implementation for the case when
reduction=none and there are duplicate values in the indices input
tensor. If duplicates are detected, a single thread processes all
indices to ensure correct results.
### Description
The existing implementation of session options for the QNN EP does not
honor the various bindings available. As such, even if set at runtime
they are ignored. Fix is to follow the pattern of the `webgpu` provider
and parse/populate the options accordingly.

Existing defaults are preserved, such that if options are not set the
prior behavior will persist.

### Motivation and Context
During debugging and development of Node implementations using the QNN
EP the need to set various parameters became apparent. Currently the
parameters can only be set via changes to the ORT dll code itself, which
is inflexible and slows development.

---------

Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
### Description
Add shape infer dispatcher for `GatherBlockQuantized` contrib op. It
reuses the dispatcher for `Gather` op since the first two inputs have
the same specs. The output elem type comes from input 2 (scales) for
`GatherBlockQuantized`.

### Motivation and Context
Support shape inference for models with `GatherBlockQuantized` op.
### Description
This pull request combines multiple improvements, bug fixes for the
OpenVINO Execution Provider (OVEP). The changes are summarized as
follows:

1. Support for various contrib Ops in OVEP.

2. Dimension Check Fixes for Greater, Pad, and MAX Ops: Fixed dimension
check failures for the Greater, Pad, and MAX ops in OVEP, ensuring they
now pass validation for all supported models.

3. Refactor Core and Shared Context Lifetimes: Refactored the lifetimes
of the OpenVINO core and shared context to remove dependency on shutdown
calls. This change avoids relying on static lifetime management and
improves stability and resource cleanup.

4. Fix for Duplicate DQ Node Removal: Addressed an issue where duplicate
Dequantize (DQ) nodes that were initializers were incorrectly removed.
Initializers should always be preserved, and this fix ensures that all
duplicate DQ nodes that are initializers are retained.

---------

Co-authored-by: Ankit Maheshkar <ankit.maheshkar@intel.com>
Co-authored-by: n1harika <niharika.sathish@intel.com>
Co-authored-by: rayngun <103146671+rayngun@users.noreply.github.com>
Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
Co-authored-by: Surendar Rama Sitaraman <surendar.rama.sitaraman@intel.com>
…ypes (#23771)

### Description
Fixes bug in the IsQDQPairSupported utility function, which is used by
various QDQ optimizers (e.g., DoubleQDQPairsRemover, QDQFinalCleanup,
etc.). The bug causes an exception when IsQDQPairIsSupported() is called
with a `Q(scale_f32) -> DQ(scale_f16)` sequence that uses different
scale types.



### Motivation and Context
Fix bug that prevents creating QDQ models that use scales of different
types.
### Description
Recent progress with SubGroupMatrix prototype in Dawn
https://issues.chromium.org/issues/348702031, exposes SIMD-Group Matrix
Functions to webgpu. This shader implements a matmulnbits using that
primitive.

Observed perf gains, in terms of LLM inference speed, prefill perf for
Phi 3.5 for a 1K token prefill see 3x improvement. 5.4s from 15s.

With Changes
```
./model_benchmark -i ~/Phi-3.5-mini-instruct-onnx-web -l 1000
Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
	avg (us):       5.42498e+06                    <<< SubGroupMatrix 5.4s
	avg (tokens/s): 184.517
	p50 (us):       5.41982e+06
	stddev (us):    12023.8
	n:              5 * 1001 token(s)
Token generation:
	avg (us):       91138.5
	avg (tokens/s): 10.9723
	p50 (us):       89488.5
	stddev (us):    35136.2
	n:              635 * 1 token(s)

```
Baseline
```
./model_benchmark -i ~/Phi-3.5-mini-instruct-onnx-web -l 1000
Batch size: 1, prompt tokens: 1001, tokens to generate: 128
Prompt processing (time to first token):
	avg (us):       1.45507e+07                     <<< Baseline 14.5s
	avg (tokens/s): 68.7938
	p50 (us):       1.45413e+07
	stddev (us):    22208.9
	n:              5 * 1001 token(s)
Token generation:
	avg (us):       94109.8
	avg (tokens/s): 10.6259
	p50 (us):       89660
	stddev (us):    61579
	n:              635 * 1 token(s)
```
### Description
<!-- Describe your changes. -->
Action item:
* ~~Add LTO support when cuda 12.8 & Relocatable Device Code
(RDC)/separate_compilation are enabled, to reduce potential perf
regression~~LTO needs further testing

* Reduce nuget/whl package size by selecting devices & their cuda
binary/PTX assembly during ORT build;
  * make sure ORT nuget package < 250 MB, python wheel < 300 MB
  
* Suggest creating internal repo to publish pre-built package with
Blackwell sm100/120 SASS and sm120 PTX to repo like
[onnxruntime-blackwell](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/onnxruntime-blackwell),
since the package size will be much larger than nuget/pypi repo limit
  
* Considering the most popular datacenter/consumer GPUs, here's the
cuda_arch list for linux/windows:
* With this change, perf on next release ORT is optimal on Linux with
Tesla P100 (sm60), V100 (sm70), T4 (sm75), A100 (sm80), A10 (sm86, py
whl), H100 (sm90); on Windows with GTX 980 (sm52), GTX 1080 (sm61), RTX
2080 (sm75), RTX 3090 (sm86), RTX 4090 (sm89). Other newer architecture
GPUs are compatible.
  
*
  
| OS | cmake_cuda_architecture | package size |
| ------------- | ------------------------------------------ |
------------ |
| Linux nupkg | 60-real;70-real;75-real;80-real;90 | 215 MB |
| Linux whl | 60-real;70-real;75-real;80-real;86-real;90 | 268 MB |
| Windows nupkg | 52-real;61-real;75-real;86-real;89-real;90-virtual |
197 MB |
| Windows whl | 52-real;61-real;75-real;86-real;89-real;90-virtual | 204
MB |

* [TODO] Vaildate on Windows CUDA CI pipeline with cu128

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?

- If it fixes an open issue, please link to the issue here. -->
Address discussed topics in
#23562 and
#23309

#### Stats

| libonnxruntime_providers_cuda lib size | Main 75;80;90 |
75-real;80-real;90-virtual | 75-real;80;90-virtual |
75-real;80-real;86-virtual;89-virtual;90-virtual | 75-real;86-real;89 |
75-real;80;90 | 75-real;80-real;90 | 61-real;75-real;86-real;89 |
| -------------------------------------- | ----------------- |
-------------------------- | --------------------- |
------------------------------------------------ | ------------------ |
------------- | ------------------ | -------------------------- |
| Linux | 446 MB | 241 MB | 362 MB | 482 MB | N/A | 422 MB | 301 MB | |
| Windows | 417 MB | 224 MB | 338 MB | 450 MB | 279 MB | N/A | | 292 MB
|

| nupkg size | Main 75;80;90 | 75-real;80-real;90-virtual |
75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual
| 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 |
61-real;75-real;86-real;89 |
| ---------- | ----------------- | -------------------------- |
--------------------- | ------------------------------------------------
| ------------------ | ------------- | ------------------ |
-------------------------- |
| Linux | 287 MB | TBD | 224 MB | 299 MB | | | 197 MB | N/A |
| Windows | 264 MB | TBD | 205 MB | 274 MB | | | N/A | 188 MB |

| whl size | Main 75;80;90 | 75-real;80-real;90-virtual |
75-real;80;90-virtual | 75-real;80-real;86-virtual;89-virtual;90-virtual
| 75-real;86-real;89 | 75-real;80;90 | 75-real;80-real;90 |
61-real;75-real;86-real;89 |
| -------- | ----------------- | -------------------------- |
--------------------- | ------------------------------------------------
| ------------------ | ------------- | ------------------ |
-------------------------- |
| Linux | 294 MB | 154 MB | TBD | TBD | N/A | 278 MB | 203 MB | N/A |
| Windows | 271 MB | 142 MB | TBD | 280 MB | 184 MB | N/A | N/A | 194 MB
|

### Reference
https://developer.nvidia.com/cuda-gpus
[Improving GPU Application Performance with NVIDIA CUDA 11.2 Device Link
Time
Optimization](https://developer.nvidia.com/blog/improving-gpu-app-performance-with-cuda-11-2-device-lto/)
[PTX
Compatibility](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ptx-compatibility)
[Application Compatibility on the NVIDIA Ada GPU
Architecture](https://docs.nvidia.com/cuda/ada-compatibility-guide/#application-compatibility-on-the-nvidia-ada-gpu-architecture)
[Software Migration Guide for NVIDIA Blackwell RTX GPUs: A Guide to CUDA
12.8, PyTorch, TensorRT, and
Llama.cpp](https://forums.developer.nvidia.com/t/software-migration-guide-for-nvidia-blackwell-rtx-gpus-a-guide-to-cuda-12-8-pytorch-tensorrt-and-llama-cpp/321330)

### Track some failed/unfinished experiments to control package size:
1. Build ORT with `CUDNN_FRONTEND_SKIP_JSON_LIB=ON` doesn't help much on
package size;
2. ORT packaging uses 7z to pack the package, which can only use zip's
deflate compression. In such format, setting compression ratio to ultra
`-mx=9` doesn't help much to control size (7z's LZMA compression is much
better but not supported by nuget/pypi)
3. Simply replacing `sm_xx` with `lto_xx` would increase cudaep library
size by ~50% (Haven't tested on perf yet). This needs further
validation.
…TRT (#23705)

This PR removes the implicit filtering-out DDS ops from running on TRT.
In other words, by default, DDS nodes will be run by TRT if it supports.

Moreover, it adds new provider option `trt_op_types_to_exclude`: 
- User can provide op type list to be excluded from running on TRT
- e.g. `trt_op_types_to_exclude="NonMaxSuppression,NonZero,RoiAlignl"`

(This PR basically adds back
[feature](#22681
previously being held to merge.)


[Note] 
There may be potential performance issues in TRT 10 when running models
that contain DDS operations such as NonMaxSuppression, NonZero, and
RoiAlign (e.g., Faster-RCNN).
If user encounters significant performance degradation, we suggest
specifying those DDS ops to be excluded from running by TRT, i.e.
trt_op_types_to_exclude=\"NonMaxSuppression,NonZero,RoiAlign\". Those
DDS nodes will be run by CUDA EP or CPU.
### Description
Set build user's uid when creating Migraphx/ROCM docker images
Cherry-pick the following changes into
[rel-1.21.0](https://github.com/microsoft/onnxruntime/tree/rel-1.21.0).
- (#23791)
- (#23710)
- (#23789)
- (#23829)

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Yifan Li <109183385+yf711@users.noreply.github.com>
Co-authored-by: Ankit Maheshkar <ankit.maheshkar@intel.com>
Co-authored-by: n1harika <niharika.sathish@intel.com>
Co-authored-by: Changming Sun <chasun@microsoft.com>
The second round of cherry-picks into
[rel-1.21.0](https://github.com/microsoft/onnxruntime/tree/rel-1.21.0).
The first one was done in
#23846.
- #23779
- #23856
- #23827
- #23834
- #23876
- #23892

---------

Co-authored-by: Jambay Kinley <jambaykinley@microsoft.com>
Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Ashish Garg <quic_ashigarg@quicinc.com>
Co-authored-by: Ashish Garg <ashigarg@qti.qualcomm.com>
@ashrit-ms ashrit-ms requested a review from a team as a code owner March 17, 2025 18:59
@ashrit-ms ashrit-ms self-assigned this Mar 17, 2025
@ashrit-ms ashrit-ms merged commit b1b4c44 into win-ort-main Mar 17, 2025
29 of 30 checks passed
@ashrit-ms ashrit-ms deleted the ashritms/v1.21.0-update branch March 17, 2025 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.