Skip to content

Conversation

@tianleiwu
Copy link
Contributor

@tianleiwu tianleiwu commented Apr 26, 2025

Description

  1. Add benchmark script for MatMulNBits.
  2. Update kernel based on benchmark results:
  • Change kernel back to handle m=1
  • Use simple loop kernel instead of unrolling
  • Change partial sum to float type to trade-off precision and performance (less precision loss, no obvious performance drop)

Example output of benchmark:

------------------------------------------------------------------------------------------------------------------------
Benchmarking MatMulNBits on NVIDIA A100-SXM4-80GB (Compute Capability: 8.0)
------------------------------------------------------------------------------------------------------------------------
CUDA Graph   | M        | N        | K        | Bits   | Block Size | Threads  | Latency (us)    | StdDev (us)  | TFLOPS
------------------------------------------------------------------------------------------------------------------------
True         | 1        | 3072     | 8192     | 4      | 32         | 0        | 95.7            | 5.7          | 0.526
True         | 1        | 3072     | 8192     | 8      | 32         | 0        | 110.7           | 81.1         | 0.454
True         | 1        | 3072     | 8192     | 4      | 128        | 0        | 93.7            | 41.2         | 0.537
True         | 1        | 3072     | 8192     | 8      | 128        | 0        | 105.0           | 129.3        | 0.479
True         | 1        | 5120     | 3072     | 4      | 32         | 0        | 86.7            | 49.9         | 0.363
True         | 1        | 5120     | 3072     | 8      | 32         | 0        | 90.1            | 41.1         | 0.349
True         | 1        | 5120     | 3072     | 4      | 128        | 0        | 83.9            | 46.7         | 0.375
True         | 1        | 5120     | 3072     | 8      | 128        | 0        | 85.2            | 57.1         | 0.369
True         | 1        | 8192     | 3072     | 4      | 32         | 0        | 107.3           | 29.2         | 0.469
True         | 1        | 8192     | 3072     | 8      | 32         | 0        | 102.3           | 57.1         | 0.492
True         | 1        | 8192     | 3072     | 4      | 128        | 0        | 99.2            | 61.2         | 0.507
True         | 1        | 8192     | 3072     | 8      | 128        | 0        | 97.5            | 47.4         | 0.516
True         | 1        | 200064   | 3072     | 4      | 32         | 0        | 1456.4          | 11.0         | 0.844
True         | 1        | 200064   | 3072     | 8      | 32         | 0        | 1336.4          | 10.3         | 0.920
True         | 1        | 200064   | 3072     | 4      | 128        | 0        | 1261.6          | 16.6         | 0.974
True         | 1        | 200064   | 3072     | 8      | 128        | 0        | 1232.6          | 17.9         | 0.997
True         | 256      | 3072     | 8192     | 4      | 32         | 0        | 211.1           | 5.8          | 61.030
True         | 256      | 3072     | 8192     | 8      | 32         | 0        | 217.8           | 62.8         | 59.154
True         | 256      | 3072     | 8192     | 4      | 128        | 0        | 208.7           | 63.3         | 61.751
True         | 256      | 3072     | 8192     | 8      | 128        | 0        | 213.0           | 58.2         | 60.491
True         | 256      | 5120     | 3072     | 4      | 32         | 0        | 151.9           | 57.4         | 53.028
True         | 256      | 5120     | 3072     | 8      | 32         | 0        | 156.2           | 71.1         | 51.554
True         | 256      | 5120     | 3072     | 4      | 128        | 0        | 151.4           | 22.6         | 53.198
True         | 256      | 5120     | 3072     | 8      | 128        | 0        | 154.6           | 47.1         | 52.092
True         | 256      | 8192     | 3072     | 4      | 32         | 0        | 219.0           | 4.4          | 58.847
True         | 256      | 8192     | 3072     | 8      | 32         | 0        | 226.6           | 14.5         | 56.860
True         | 256      | 8192     | 3072     | 4      | 128        | 0        | 206.7           | 39.9         | 62.333
True         | 256      | 8192     | 3072     | 8      | 128        | 0        | 216.2           | 41.3         | 59.587
True         | 256      | 200064   | 3072     | 4      | 32         | 0        | 3110.9          | 11.3         | 101.152
True         | 256      | 200064   | 3072     | 8      | 32         | 0        | 3290.9          | 8.3          | 95.619
True         | 256      | 200064   | 3072     | 4      | 128        | 0        | 3055.2          | 10.2         | 102.995
True         | 256      | 200064   | 3072     | 8      | 128        | 0        | 3220.4          | 9.8          | 97.712
True         | 1024     | 3072     | 8192     | 4      | 32         | 0        | 363.6           | 40.2         | 141.754
True         | 1024     | 3072     | 8192     | 8      | 32         | 0        | 369.0           | 46.0         | 139.669
True         | 1024     | 3072     | 8192     | 4      | 128        | 0        | 362.8           | 55.6         | 142.052
True         | 1024     | 3072     | 8192     | 8      | 128        | 0        | 367.5           | 56.5         | 140.256
True         | 1024     | 5120     | 3072     | 4      | 32         | 0        | 221.6           | 58.1         | 145.383
True         | 1024     | 5120     | 3072     | 8      | 32         | 0        | 225.4           | 56.6         | 142.938
True         | 1024     | 5120     | 3072     | 4      | 128        | 0        | 220.2           | 36.9         | 146.306
True         | 1024     | 5120     | 3072     | 8      | 128        | 0        | 224.1           | 57.8         | 143.751
True         | 1024     | 8192     | 3072     | 4      | 32         | 0        | 346.2           | 41.8         | 148.854
True         | 1024     | 8192     | 3072     | 8      | 32         | 0        | 352.8           | 21.6         | 146.097
True         | 1024     | 8192     | 3072     | 4      | 128        | 0        | 344.5           | 18.9         | 149.627
True         | 1024     | 8192     | 3072     | 8      | 128        | 0        | 350.6           | 10.6         | 147.016
True         | 1024     | 200064   | 3072     | 4      | 32         | 0        | 6822.0          | 44.1         | 184.504
True         | 1024     | 200064   | 3072     | 8      | 32         | 0        | 7018.5          | 38.4         | 179.339
True         | 1024     | 200064   | 3072     | 4      | 128        | 0        | 6757.8          | 51.5         | 186.257
True         | 1024     | 200064   | 3072     | 8      | 128        | 0        | 6947.7          | 38.1         | 181.167
------------------------------------------------------------------------------------------------------------------------

Motivation and Context

Follow up with #24509

@snnn
Copy link
Contributor

snnn commented Apr 26, 2025

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline

@azure-pipelines
Copy link

Azure Pipelines successfully started running 5 pipeline(s).

@kunal-vaishnavi
Copy link
Contributor

The fourth row in the benchmark table appears to have a higher standard deviation (129.3) than the other rows. Is there a specific reason as to why that happens?

@tianleiwu
Copy link
Contributor Author

The fourth row in the benchmark table appears to have a higher standard deviation (129.3) than the other rows. Is there a specific reason as to why that happens?

The result was from a shared VM. It might be impacted by other people's jobs at that time.

@snnn snnn merged commit 1dd9b99 into main Apr 26, 2025
85 of 89 checks passed
@snnn snnn deleted the tlwu/benchmark_matmul_8bits branch April 26, 2025 05:27
jywu-msft pushed a commit that referenced this pull request Apr 30, 2025
### Description

Cherry pick the following into
[rel-1.22.0](https://github.com/microsoft/onnxruntime/tree/rel-1.22.0)


- (#24487)
- (#24466)
- (#24493)
- (#24484)
- (#24494)
- (#24489)
- (#24504)
- (#24510)
- (#24456)
- (#24537)
- (#24501)
- (#24519)
- (#24513)
- (#24539)
- (#24514)
- (#24542)
- (#24585)

Not added:

Planning to cherry pick Cuda Matmulnbits PRs once the fix for failing
cuda pipeline is ready
- (#24491)
- (#24509)
- (#24564)

---------

Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: minfhong-quic <quic_minfhong@quicinc.com>
Co-authored-by: minfhong-quic <minfhong-quic@quicinc.com>
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Co-authored-by: Prathik Rao <prathik.rao@gmail.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Ankan Banerjee <ankan.ban@gmail.com>
Co-authored-by: Maximilian Müller <maximilianm@nvidia.com>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: iraut <iraut@nvidia.com>
Co-authored-by: Hrishikesh Manohar <hrishikeshm@nvidia.com>
Co-authored-by: Maximilian Müller <44298237+gedoensmax@users.noreply.github.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: xhcao <xinghua.cao@intel.com>
jatinwadhwa921 pushed a commit to intel/onnxruntime that referenced this pull request Apr 30, 2025
### Description

Cherry pick the following into
[rel-1.22.0](https://github.com/microsoft/onnxruntime/tree/rel-1.22.0)


- (microsoft#24487)
- (microsoft#24466)
- (microsoft#24493)
- (microsoft#24484)
- (microsoft#24494)
- (microsoft#24489)
- (microsoft#24504)
- (microsoft#24510)
- (microsoft#24456)
- (microsoft#24537)
- (microsoft#24501)
- (microsoft#24519)
- (microsoft#24513)
- (microsoft#24539)
- (microsoft#24514)
- (microsoft#24542)
- (microsoft#24585)

Not added:

Planning to cherry pick Cuda Matmulnbits PRs once the fix for failing
cuda pipeline is ready
- (microsoft#24491)
- (microsoft#24509)
- (microsoft#24564)

---------

Co-authored-by: vraspar <vrajang@outlook.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: minfhong-quic <quic_minfhong@quicinc.com>
Co-authored-by: minfhong-quic <minfhong-quic@quicinc.com>
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Co-authored-by: Prathik Rao <prathik.rao@gmail.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Ankan Banerjee <ankan.ban@gmail.com>
Co-authored-by: Maximilian Müller <maximilianm@nvidia.com>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: iraut <iraut@nvidia.com>
Co-authored-by: Hrishikesh Manohar <hrishikeshm@nvidia.com>
Co-authored-by: Maximilian Müller <44298237+gedoensmax@users.noreply.github.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: xhcao <xinghua.cao@intel.com>
vraspar pushed a commit that referenced this pull request May 1, 2025
### Description
1. Add benchmark script for MatMulNBits. 
2. Update kernel based on benchmark results:
  - Change kernel back to handle m=1
  - Use simple loop kernel instead of unrolling
- Change partial sum to float type to trade-off precision and
performance (less precision loss, no obvious performance drop)

Example output of benchmark:
```
------------------------------------------------------------------------------------------------------------------------
Benchmarking MatMulNBits on NVIDIA A100-SXM4-80GB (Compute Capability: 8.0)
------------------------------------------------------------------------------------------------------------------------
CUDA Graph   | M        | N        | K        | Bits   | Block Size | Threads  | Latency (us)    | StdDev (us)  | TFLOPS
------------------------------------------------------------------------------------------------------------------------
True         | 1        | 3072     | 8192     | 4      | 32         | 0        | 95.7            | 5.7          | 0.526
True         | 1        | 3072     | 8192     | 8      | 32         | 0        | 110.7           | 81.1         | 0.454
True         | 1        | 3072     | 8192     | 4      | 128        | 0        | 93.7            | 41.2         | 0.537
True         | 1        | 3072     | 8192     | 8      | 128        | 0        | 105.0           | 129.3        | 0.479
True         | 1        | 5120     | 3072     | 4      | 32         | 0        | 86.7            | 49.9         | 0.363
True         | 1        | 5120     | 3072     | 8      | 32         | 0        | 90.1            | 41.1         | 0.349
True         | 1        | 5120     | 3072     | 4      | 128        | 0        | 83.9            | 46.7         | 0.375
True         | 1        | 5120     | 3072     | 8      | 128        | 0        | 85.2            | 57.1         | 0.369
True         | 1        | 8192     | 3072     | 4      | 32         | 0        | 107.3           | 29.2         | 0.469
True         | 1        | 8192     | 3072     | 8      | 32         | 0        | 102.3           | 57.1         | 0.492
True         | 1        | 8192     | 3072     | 4      | 128        | 0        | 99.2            | 61.2         | 0.507
True         | 1        | 8192     | 3072     | 8      | 128        | 0        | 97.5            | 47.4         | 0.516
True         | 1        | 200064   | 3072     | 4      | 32         | 0        | 1456.4          | 11.0         | 0.844
True         | 1        | 200064   | 3072     | 8      | 32         | 0        | 1336.4          | 10.3         | 0.920
True         | 1        | 200064   | 3072     | 4      | 128        | 0        | 1261.6          | 16.6         | 0.974
True         | 1        | 200064   | 3072     | 8      | 128        | 0        | 1232.6          | 17.9         | 0.997
True         | 256      | 3072     | 8192     | 4      | 32         | 0        | 211.1           | 5.8          | 61.030
True         | 256      | 3072     | 8192     | 8      | 32         | 0        | 217.8           | 62.8         | 59.154
True         | 256      | 3072     | 8192     | 4      | 128        | 0        | 208.7           | 63.3         | 61.751
True         | 256      | 3072     | 8192     | 8      | 128        | 0        | 213.0           | 58.2         | 60.491
True         | 256      | 5120     | 3072     | 4      | 32         | 0        | 151.9           | 57.4         | 53.028
True         | 256      | 5120     | 3072     | 8      | 32         | 0        | 156.2           | 71.1         | 51.554
True         | 256      | 5120     | 3072     | 4      | 128        | 0        | 151.4           | 22.6         | 53.198
True         | 256      | 5120     | 3072     | 8      | 128        | 0        | 154.6           | 47.1         | 52.092
True         | 256      | 8192     | 3072     | 4      | 32         | 0        | 219.0           | 4.4          | 58.847
True         | 256      | 8192     | 3072     | 8      | 32         | 0        | 226.6           | 14.5         | 56.860
True         | 256      | 8192     | 3072     | 4      | 128        | 0        | 206.7           | 39.9         | 62.333
True         | 256      | 8192     | 3072     | 8      | 128        | 0        | 216.2           | 41.3         | 59.587
True         | 256      | 200064   | 3072     | 4      | 32         | 0        | 3110.9          | 11.3         | 101.152
True         | 256      | 200064   | 3072     | 8      | 32         | 0        | 3290.9          | 8.3          | 95.619
True         | 256      | 200064   | 3072     | 4      | 128        | 0        | 3055.2          | 10.2         | 102.995
True         | 256      | 200064   | 3072     | 8      | 128        | 0        | 3220.4          | 9.8          | 97.712
True         | 1024     | 3072     | 8192     | 4      | 32         | 0        | 363.6           | 40.2         | 141.754
True         | 1024     | 3072     | 8192     | 8      | 32         | 0        | 369.0           | 46.0         | 139.669
True         | 1024     | 3072     | 8192     | 4      | 128        | 0        | 362.8           | 55.6         | 142.052
True         | 1024     | 3072     | 8192     | 8      | 128        | 0        | 367.5           | 56.5         | 140.256
True         | 1024     | 5120     | 3072     | 4      | 32         | 0        | 221.6           | 58.1         | 145.383
True         | 1024     | 5120     | 3072     | 8      | 32         | 0        | 225.4           | 56.6         | 142.938
True         | 1024     | 5120     | 3072     | 4      | 128        | 0        | 220.2           | 36.9         | 146.306
True         | 1024     | 5120     | 3072     | 8      | 128        | 0        | 224.1           | 57.8         | 143.751
True         | 1024     | 8192     | 3072     | 4      | 32         | 0        | 346.2           | 41.8         | 148.854
True         | 1024     | 8192     | 3072     | 8      | 32         | 0        | 352.8           | 21.6         | 146.097
True         | 1024     | 8192     | 3072     | 4      | 128        | 0        | 344.5           | 18.9         | 149.627
True         | 1024     | 8192     | 3072     | 8      | 128        | 0        | 350.6           | 10.6         | 147.016
True         | 1024     | 200064   | 3072     | 4      | 32         | 0        | 6822.0          | 44.1         | 184.504
True         | 1024     | 200064   | 3072     | 8      | 32         | 0        | 7018.5          | 38.4         | 179.339
True         | 1024     | 200064   | 3072     | 4      | 128        | 0        | 6757.8          | 51.5         | 186.257
True         | 1024     | 200064   | 3072     | 8      | 128        | 0        | 6947.7          | 38.1         | 181.167
------------------------------------------------------------------------------------------------------------------------
```
### Motivation and Context
Follow up with #24509
jywu-msft pushed a commit that referenced this pull request May 1, 2025
### Description

Cherry pick the following into
[rel-1.22.0](https://github.com/microsoft/onnxruntime/tree/rel-1.22.0)

- (#24491)
- (#24509)
- (#24564)
- (#24574)
- (#24582)
- (#24584)
- (#24568)
- (#24587)
- (#24563)
- (#24592)
- (#24526)
- (#24552)
- (#24588)
- (#24605)
- (#24606)

---------

Co-authored-by: Jing Fang <126209182+fajin-corp@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Mark Schofield <mschofie@microsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Ashwath Shankarnarayan <quic_ashwshan@quicinc.com>
Co-authored-by: saurabh <saurabh1.kale@intel.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: Hector Li <hecli@microsoft.com>
ankitm3k pushed a commit to intel/onnxruntime that referenced this pull request May 12, 2025
### Description
1. Add benchmark script for MatMulNBits. 
2. Update kernel based on benchmark results:
  - Change kernel back to handle m=1
  - Use simple loop kernel instead of unrolling
- Change partial sum to float type to trade-off precision and
performance (less precision loss, no obvious performance drop)

Example output of benchmark:
```
------------------------------------------------------------------------------------------------------------------------
Benchmarking MatMulNBits on NVIDIA A100-SXM4-80GB (Compute Capability: 8.0)
------------------------------------------------------------------------------------------------------------------------
CUDA Graph   | M        | N        | K        | Bits   | Block Size | Threads  | Latency (us)    | StdDev (us)  | TFLOPS
------------------------------------------------------------------------------------------------------------------------
True         | 1        | 3072     | 8192     | 4      | 32         | 0        | 95.7            | 5.7          | 0.526
True         | 1        | 3072     | 8192     | 8      | 32         | 0        | 110.7           | 81.1         | 0.454
True         | 1        | 3072     | 8192     | 4      | 128        | 0        | 93.7            | 41.2         | 0.537
True         | 1        | 3072     | 8192     | 8      | 128        | 0        | 105.0           | 129.3        | 0.479
True         | 1        | 5120     | 3072     | 4      | 32         | 0        | 86.7            | 49.9         | 0.363
True         | 1        | 5120     | 3072     | 8      | 32         | 0        | 90.1            | 41.1         | 0.349
True         | 1        | 5120     | 3072     | 4      | 128        | 0        | 83.9            | 46.7         | 0.375
True         | 1        | 5120     | 3072     | 8      | 128        | 0        | 85.2            | 57.1         | 0.369
True         | 1        | 8192     | 3072     | 4      | 32         | 0        | 107.3           | 29.2         | 0.469
True         | 1        | 8192     | 3072     | 8      | 32         | 0        | 102.3           | 57.1         | 0.492
True         | 1        | 8192     | 3072     | 4      | 128        | 0        | 99.2            | 61.2         | 0.507
True         | 1        | 8192     | 3072     | 8      | 128        | 0        | 97.5            | 47.4         | 0.516
True         | 1        | 200064   | 3072     | 4      | 32         | 0        | 1456.4          | 11.0         | 0.844
True         | 1        | 200064   | 3072     | 8      | 32         | 0        | 1336.4          | 10.3         | 0.920
True         | 1        | 200064   | 3072     | 4      | 128        | 0        | 1261.6          | 16.6         | 0.974
True         | 1        | 200064   | 3072     | 8      | 128        | 0        | 1232.6          | 17.9         | 0.997
True         | 256      | 3072     | 8192     | 4      | 32         | 0        | 211.1           | 5.8          | 61.030
True         | 256      | 3072     | 8192     | 8      | 32         | 0        | 217.8           | 62.8         | 59.154
True         | 256      | 3072     | 8192     | 4      | 128        | 0        | 208.7           | 63.3         | 61.751
True         | 256      | 3072     | 8192     | 8      | 128        | 0        | 213.0           | 58.2         | 60.491
True         | 256      | 5120     | 3072     | 4      | 32         | 0        | 151.9           | 57.4         | 53.028
True         | 256      | 5120     | 3072     | 8      | 32         | 0        | 156.2           | 71.1         | 51.554
True         | 256      | 5120     | 3072     | 4      | 128        | 0        | 151.4           | 22.6         | 53.198
True         | 256      | 5120     | 3072     | 8      | 128        | 0        | 154.6           | 47.1         | 52.092
True         | 256      | 8192     | 3072     | 4      | 32         | 0        | 219.0           | 4.4          | 58.847
True         | 256      | 8192     | 3072     | 8      | 32         | 0        | 226.6           | 14.5         | 56.860
True         | 256      | 8192     | 3072     | 4      | 128        | 0        | 206.7           | 39.9         | 62.333
True         | 256      | 8192     | 3072     | 8      | 128        | 0        | 216.2           | 41.3         | 59.587
True         | 256      | 200064   | 3072     | 4      | 32         | 0        | 3110.9          | 11.3         | 101.152
True         | 256      | 200064   | 3072     | 8      | 32         | 0        | 3290.9          | 8.3          | 95.619
True         | 256      | 200064   | 3072     | 4      | 128        | 0        | 3055.2          | 10.2         | 102.995
True         | 256      | 200064   | 3072     | 8      | 128        | 0        | 3220.4          | 9.8          | 97.712
True         | 1024     | 3072     | 8192     | 4      | 32         | 0        | 363.6           | 40.2         | 141.754
True         | 1024     | 3072     | 8192     | 8      | 32         | 0        | 369.0           | 46.0         | 139.669
True         | 1024     | 3072     | 8192     | 4      | 128        | 0        | 362.8           | 55.6         | 142.052
True         | 1024     | 3072     | 8192     | 8      | 128        | 0        | 367.5           | 56.5         | 140.256
True         | 1024     | 5120     | 3072     | 4      | 32         | 0        | 221.6           | 58.1         | 145.383
True         | 1024     | 5120     | 3072     | 8      | 32         | 0        | 225.4           | 56.6         | 142.938
True         | 1024     | 5120     | 3072     | 4      | 128        | 0        | 220.2           | 36.9         | 146.306
True         | 1024     | 5120     | 3072     | 8      | 128        | 0        | 224.1           | 57.8         | 143.751
True         | 1024     | 8192     | 3072     | 4      | 32         | 0        | 346.2           | 41.8         | 148.854
True         | 1024     | 8192     | 3072     | 8      | 32         | 0        | 352.8           | 21.6         | 146.097
True         | 1024     | 8192     | 3072     | 4      | 128        | 0        | 344.5           | 18.9         | 149.627
True         | 1024     | 8192     | 3072     | 8      | 128        | 0        | 350.6           | 10.6         | 147.016
True         | 1024     | 200064   | 3072     | 4      | 32         | 0        | 6822.0          | 44.1         | 184.504
True         | 1024     | 200064   | 3072     | 8      | 32         | 0        | 7018.5          | 38.4         | 179.339
True         | 1024     | 200064   | 3072     | 4      | 128        | 0        | 6757.8          | 51.5         | 186.257
True         | 1024     | 200064   | 3072     | 8      | 128        | 0        | 6947.7          | 38.1         | 181.167
------------------------------------------------------------------------------------------------------------------------
```
### Motivation and Context
Follow up with microsoft#24509
@snnn
Copy link
Contributor

snnn commented Sep 5, 2025

This PR has been included in the rel-1.22.0 branch. Removing the release:1.22.0 label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants