[CUDA] MatMulNBits benchmark #24564

tianleiwu · 2025-04-26T01:04:11Z

Description

Add benchmark script for MatMulNBits.
Update kernel based on benchmark results:

Change kernel back to handle m=1
Use simple loop kernel instead of unrolling
Change partial sum to float type to trade-off precision and performance (less precision loss, no obvious performance drop)

Example output of benchmark:

------------------------------------------------------------------------------------------------------------------------
Benchmarking MatMulNBits on NVIDIA A100-SXM4-80GB (Compute Capability: 8.0)
------------------------------------------------------------------------------------------------------------------------
CUDA Graph   | M        | N        | K        | Bits   | Block Size | Threads  | Latency (us)    | StdDev (us)  | TFLOPS
------------------------------------------------------------------------------------------------------------------------
True         | 1        | 3072     | 8192     | 4      | 32         | 0        | 95.7            | 5.7          | 0.526
True         | 1        | 3072     | 8192     | 8      | 32         | 0        | 110.7           | 81.1         | 0.454
True         | 1        | 3072     | 8192     | 4      | 128        | 0        | 93.7            | 41.2         | 0.537
True         | 1        | 3072     | 8192     | 8      | 128        | 0        | 105.0           | 129.3        | 0.479
True         | 1        | 5120     | 3072     | 4      | 32         | 0        | 86.7            | 49.9         | 0.363
True         | 1        | 5120     | 3072     | 8      | 32         | 0        | 90.1            | 41.1         | 0.349
True         | 1        | 5120     | 3072     | 4      | 128        | 0        | 83.9            | 46.7         | 0.375
True         | 1        | 5120     | 3072     | 8      | 128        | 0        | 85.2            | 57.1         | 0.369
True         | 1        | 8192     | 3072     | 4      | 32         | 0        | 107.3           | 29.2         | 0.469
True         | 1        | 8192     | 3072     | 8      | 32         | 0        | 102.3           | 57.1         | 0.492
True         | 1        | 8192     | 3072     | 4      | 128        | 0        | 99.2            | 61.2         | 0.507
True         | 1        | 8192     | 3072     | 8      | 128        | 0        | 97.5            | 47.4         | 0.516
True         | 1        | 200064   | 3072     | 4      | 32         | 0        | 1456.4          | 11.0         | 0.844
True         | 1        | 200064   | 3072     | 8      | 32         | 0        | 1336.4          | 10.3         | 0.920
True         | 1        | 200064   | 3072     | 4      | 128        | 0        | 1261.6          | 16.6         | 0.974
True         | 1        | 200064   | 3072     | 8      | 128        | 0        | 1232.6          | 17.9         | 0.997
True         | 256      | 3072     | 8192     | 4      | 32         | 0        | 211.1           | 5.8          | 61.030
True         | 256      | 3072     | 8192     | 8      | 32         | 0        | 217.8           | 62.8         | 59.154
True         | 256      | 3072     | 8192     | 4      | 128        | 0        | 208.7           | 63.3         | 61.751
True         | 256      | 3072     | 8192     | 8      | 128        | 0        | 213.0           | 58.2         | 60.491
True         | 256      | 5120     | 3072     | 4      | 32         | 0        | 151.9           | 57.4         | 53.028
True         | 256      | 5120     | 3072     | 8      | 32         | 0        | 156.2           | 71.1         | 51.554
True         | 256      | 5120     | 3072     | 4      | 128        | 0        | 151.4           | 22.6         | 53.198
True         | 256      | 5120     | 3072     | 8      | 128        | 0        | 154.6           | 47.1         | 52.092
True         | 256      | 8192     | 3072     | 4      | 32         | 0        | 219.0           | 4.4          | 58.847
True         | 256      | 8192     | 3072     | 8      | 32         | 0        | 226.6           | 14.5         | 56.860
True         | 256      | 8192     | 3072     | 4      | 128        | 0        | 206.7           | 39.9         | 62.333
True         | 256      | 8192     | 3072     | 8      | 128        | 0        | 216.2           | 41.3         | 59.587
True         | 256      | 200064   | 3072     | 4      | 32         | 0        | 3110.9          | 11.3         | 101.152
True         | 256      | 200064   | 3072     | 8      | 32         | 0        | 3290.9          | 8.3          | 95.619
True         | 256      | 200064   | 3072     | 4      | 128        | 0        | 3055.2          | 10.2         | 102.995
True         | 256      | 200064   | 3072     | 8      | 128        | 0        | 3220.4          | 9.8          | 97.712
True         | 1024     | 3072     | 8192     | 4      | 32         | 0        | 363.6           | 40.2         | 141.754
True         | 1024     | 3072     | 8192     | 8      | 32         | 0        | 369.0           | 46.0         | 139.669
True         | 1024     | 3072     | 8192     | 4      | 128        | 0        | 362.8           | 55.6         | 142.052
True         | 1024     | 3072     | 8192     | 8      | 128        | 0        | 367.5           | 56.5         | 140.256
True         | 1024     | 5120     | 3072     | 4      | 32         | 0        | 221.6           | 58.1         | 145.383
True         | 1024     | 5120     | 3072     | 8      | 32         | 0        | 225.4           | 56.6         | 142.938
True         | 1024     | 5120     | 3072     | 4      | 128        | 0        | 220.2           | 36.9         | 146.306
True         | 1024     | 5120     | 3072     | 8      | 128        | 0        | 224.1           | 57.8         | 143.751
True         | 1024     | 8192     | 3072     | 4      | 32         | 0        | 346.2           | 41.8         | 148.854
True         | 1024     | 8192     | 3072     | 8      | 32         | 0        | 352.8           | 21.6         | 146.097
True         | 1024     | 8192     | 3072     | 4      | 128        | 0        | 344.5           | 18.9         | 149.627
True         | 1024     | 8192     | 3072     | 8      | 128        | 0        | 350.6           | 10.6         | 147.016
True         | 1024     | 200064   | 3072     | 4      | 32         | 0        | 6822.0          | 44.1         | 184.504
True         | 1024     | 200064   | 3072     | 8      | 32         | 0        | 7018.5          | 38.4         | 179.339
True         | 1024     | 200064   | 3072     | 4      | 128        | 0        | 6757.8          | 51.5         | 186.257
True         | 1024     | 200064   | 3072     | 8      | 128        | 0        | 6947.7          | 38.1         | 181.167
------------------------------------------------------------------------------------------------------------------------

Motivation and Context

Follow up with #24509

snnn · 2025-04-26T04:01:30Z

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline

azure-pipelines · 2025-04-26T04:01:48Z

Azure Pipelines successfully started running 5 pipeline(s).

kunal-vaishnavi · 2025-04-26T04:24:45Z

The fourth row in the benchmark table appears to have a higher standard deviation (129.3) than the other rows. Is there a specific reason as to why that happens?

tianleiwu · 2025-04-26T04:32:12Z

The fourth row in the benchmark table appears to have a higher standard deviation (129.3) than the other rows. Is there a specific reason as to why that happens?

The result was from a shared VM. It might be impacted by other people's jobs at that time.

### Description Cherry pick the following into [rel-1.22.0](https://github.com/microsoft/onnxruntime/tree/rel-1.22.0) - (#24487) - (#24466) - (#24493) - (#24484) - (#24494) - (#24489) - (#24504) - (#24510) - (#24456) - (#24537) - (#24501) - (#24519) - (#24513) - (#24539) - (#24514) - (#24542) - (#24585) Not added: Planning to cherry pick Cuda Matmulnbits PRs once the fix for failing cuda pipeline is ready - (#24491) - (#24509) - (#24564) --------- Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: minfhong-quic <quic_minfhong@quicinc.com> Co-authored-by: minfhong-quic <minfhong-quic@quicinc.com> Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Co-authored-by: Prathik Rao <prathik.rao@gmail.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Ankan Banerjee <ankan.ban@gmail.com> Co-authored-by: Maximilian Müller <maximilianm@nvidia.com> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: iraut <iraut@nvidia.com> Co-authored-by: Hrishikesh Manohar <hrishikeshm@nvidia.com> Co-authored-by: Maximilian Müller <44298237+gedoensmax@users.noreply.github.com> Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com> Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com> Co-authored-by: xhcao <xinghua.cao@intel.com>

### Description Cherry pick the following into [rel-1.22.0](https://github.com/microsoft/onnxruntime/tree/rel-1.22.0) - (microsoft#24487) - (microsoft#24466) - (microsoft#24493) - (microsoft#24484) - (microsoft#24494) - (microsoft#24489) - (microsoft#24504) - (microsoft#24510) - (microsoft#24456) - (microsoft#24537) - (microsoft#24501) - (microsoft#24519) - (microsoft#24513) - (microsoft#24539) - (microsoft#24514) - (microsoft#24542) - (microsoft#24585) Not added: Planning to cherry pick Cuda Matmulnbits PRs once the fix for failing cuda pipeline is ready - (microsoft#24491) - (microsoft#24509) - (microsoft#24564) --------- Co-authored-by: vraspar <vrajang@outlook.com> Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: minfhong-quic <quic_minfhong@quicinc.com> Co-authored-by: minfhong-quic <minfhong-quic@quicinc.com> Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Co-authored-by: Prathik Rao <prathik.rao@gmail.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Ankan Banerjee <ankan.ban@gmail.com> Co-authored-by: Maximilian Müller <maximilianm@nvidia.com> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: iraut <iraut@nvidia.com> Co-authored-by: Hrishikesh Manohar <hrishikeshm@nvidia.com> Co-authored-by: Maximilian Müller <44298237+gedoensmax@users.noreply.github.com> Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com> Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com> Co-authored-by: xhcao <xinghua.cao@intel.com>

### Description 1. Add benchmark script for MatMulNBits. 2. Update kernel based on benchmark results: - Change kernel back to handle m=1 - Use simple loop kernel instead of unrolling - Change partial sum to float type to trade-off precision and performance (less precision loss, no obvious performance drop) Example output of benchmark: ``` ------------------------------------------------------------------------------------------------------------------------ Benchmarking MatMulNBits on NVIDIA A100-SXM4-80GB (Compute Capability: 8.0) ------------------------------------------------------------------------------------------------------------------------ CUDA Graph | M | N | K | Bits | Block Size | Threads | Latency (us) | StdDev (us) | TFLOPS ------------------------------------------------------------------------------------------------------------------------ True | 1 | 3072 | 8192 | 4 | 32 | 0 | 95.7 | 5.7 | 0.526 True | 1 | 3072 | 8192 | 8 | 32 | 0 | 110.7 | 81.1 | 0.454 True | 1 | 3072 | 8192 | 4 | 128 | 0 | 93.7 | 41.2 | 0.537 True | 1 | 3072 | 8192 | 8 | 128 | 0 | 105.0 | 129.3 | 0.479 True | 1 | 5120 | 3072 | 4 | 32 | 0 | 86.7 | 49.9 | 0.363 True | 1 | 5120 | 3072 | 8 | 32 | 0 | 90.1 | 41.1 | 0.349 True | 1 | 5120 | 3072 | 4 | 128 | 0 | 83.9 | 46.7 | 0.375 True | 1 | 5120 | 3072 | 8 | 128 | 0 | 85.2 | 57.1 | 0.369 True | 1 | 8192 | 3072 | 4 | 32 | 0 | 107.3 | 29.2 | 0.469 True | 1 | 8192 | 3072 | 8 | 32 | 0 | 102.3 | 57.1 | 0.492 True | 1 | 8192 | 3072 | 4 | 128 | 0 | 99.2 | 61.2 | 0.507 True | 1 | 8192 | 3072 | 8 | 128 | 0 | 97.5 | 47.4 | 0.516 True | 1 | 200064 | 3072 | 4 | 32 | 0 | 1456.4 | 11.0 | 0.844 True | 1 | 200064 | 3072 | 8 | 32 | 0 | 1336.4 | 10.3 | 0.920 True | 1 | 200064 | 3072 | 4 | 128 | 0 | 1261.6 | 16.6 | 0.974 True | 1 | 200064 | 3072 | 8 | 128 | 0 | 1232.6 | 17.9 | 0.997 True | 256 | 3072 | 8192 | 4 | 32 | 0 | 211.1 | 5.8 | 61.030 True | 256 | 3072 | 8192 | 8 | 32 | 0 | 217.8 | 62.8 | 59.154 True | 256 | 3072 | 8192 | 4 | 128 | 0 | 208.7 | 63.3 | 61.751 True | 256 | 3072 | 8192 | 8 | 128 | 0 | 213.0 | 58.2 | 60.491 True | 256 | 5120 | 3072 | 4 | 32 | 0 | 151.9 | 57.4 | 53.028 True | 256 | 5120 | 3072 | 8 | 32 | 0 | 156.2 | 71.1 | 51.554 True | 256 | 5120 | 3072 | 4 | 128 | 0 | 151.4 | 22.6 | 53.198 True | 256 | 5120 | 3072 | 8 | 128 | 0 | 154.6 | 47.1 | 52.092 True | 256 | 8192 | 3072 | 4 | 32 | 0 | 219.0 | 4.4 | 58.847 True | 256 | 8192 | 3072 | 8 | 32 | 0 | 226.6 | 14.5 | 56.860 True | 256 | 8192 | 3072 | 4 | 128 | 0 | 206.7 | 39.9 | 62.333 True | 256 | 8192 | 3072 | 8 | 128 | 0 | 216.2 | 41.3 | 59.587 True | 256 | 200064 | 3072 | 4 | 32 | 0 | 3110.9 | 11.3 | 101.152 True | 256 | 200064 | 3072 | 8 | 32 | 0 | 3290.9 | 8.3 | 95.619 True | 256 | 200064 | 3072 | 4 | 128 | 0 | 3055.2 | 10.2 | 102.995 True | 256 | 200064 | 3072 | 8 | 128 | 0 | 3220.4 | 9.8 | 97.712 True | 1024 | 3072 | 8192 | 4 | 32 | 0 | 363.6 | 40.2 | 141.754 True | 1024 | 3072 | 8192 | 8 | 32 | 0 | 369.0 | 46.0 | 139.669 True | 1024 | 3072 | 8192 | 4 | 128 | 0 | 362.8 | 55.6 | 142.052 True | 1024 | 3072 | 8192 | 8 | 128 | 0 | 367.5 | 56.5 | 140.256 True | 1024 | 5120 | 3072 | 4 | 32 | 0 | 221.6 | 58.1 | 145.383 True | 1024 | 5120 | 3072 | 8 | 32 | 0 | 225.4 | 56.6 | 142.938 True | 1024 | 5120 | 3072 | 4 | 128 | 0 | 220.2 | 36.9 | 146.306 True | 1024 | 5120 | 3072 | 8 | 128 | 0 | 224.1 | 57.8 | 143.751 True | 1024 | 8192 | 3072 | 4 | 32 | 0 | 346.2 | 41.8 | 148.854 True | 1024 | 8192 | 3072 | 8 | 32 | 0 | 352.8 | 21.6 | 146.097 True | 1024 | 8192 | 3072 | 4 | 128 | 0 | 344.5 | 18.9 | 149.627 True | 1024 | 8192 | 3072 | 8 | 128 | 0 | 350.6 | 10.6 | 147.016 True | 1024 | 200064 | 3072 | 4 | 32 | 0 | 6822.0 | 44.1 | 184.504 True | 1024 | 200064 | 3072 | 8 | 32 | 0 | 7018.5 | 38.4 | 179.339 True | 1024 | 200064 | 3072 | 4 | 128 | 0 | 6757.8 | 51.5 | 186.257 True | 1024 | 200064 | 3072 | 8 | 128 | 0 | 6947.7 | 38.1 | 181.167 ------------------------------------------------------------------------------------------------------------------------ ``` ### Motivation and Context Follow up with #24509

### Description Cherry pick the following into [rel-1.22.0](https://github.com/microsoft/onnxruntime/tree/rel-1.22.0) - (#24491) - (#24509) - (#24564) - (#24574) - (#24582) - (#24584) - (#24568) - (#24587) - (#24563) - (#24592) - (#24526) - (#24552) - (#24588) - (#24605) - (#24606) --------- Co-authored-by: Jing Fang <126209182+fajin-corp@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: Baiju Meswani <bmeswani@microsoft.com> Co-authored-by: Scott McKay <skottmckay@gmail.com> Co-authored-by: Mark Schofield <mschofie@microsoft.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Ashwath Shankarnarayan <quic_ashwshan@quicinc.com> Co-authored-by: saurabh <saurabh1.kale@intel.com> Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com> Co-authored-by: Hector Li <hecli@microsoft.com>

### Description 1. Add benchmark script for MatMulNBits. 2. Update kernel based on benchmark results: - Change kernel back to handle m=1 - Use simple loop kernel instead of unrolling - Change partial sum to float type to trade-off precision and performance (less precision loss, no obvious performance drop) Example output of benchmark: ``` ------------------------------------------------------------------------------------------------------------------------ Benchmarking MatMulNBits on NVIDIA A100-SXM4-80GB (Compute Capability: 8.0) ------------------------------------------------------------------------------------------------------------------------ CUDA Graph | M | N | K | Bits | Block Size | Threads | Latency (us) | StdDev (us) | TFLOPS ------------------------------------------------------------------------------------------------------------------------ True | 1 | 3072 | 8192 | 4 | 32 | 0 | 95.7 | 5.7 | 0.526 True | 1 | 3072 | 8192 | 8 | 32 | 0 | 110.7 | 81.1 | 0.454 True | 1 | 3072 | 8192 | 4 | 128 | 0 | 93.7 | 41.2 | 0.537 True | 1 | 3072 | 8192 | 8 | 128 | 0 | 105.0 | 129.3 | 0.479 True | 1 | 5120 | 3072 | 4 | 32 | 0 | 86.7 | 49.9 | 0.363 True | 1 | 5120 | 3072 | 8 | 32 | 0 | 90.1 | 41.1 | 0.349 True | 1 | 5120 | 3072 | 4 | 128 | 0 | 83.9 | 46.7 | 0.375 True | 1 | 5120 | 3072 | 8 | 128 | 0 | 85.2 | 57.1 | 0.369 True | 1 | 8192 | 3072 | 4 | 32 | 0 | 107.3 | 29.2 | 0.469 True | 1 | 8192 | 3072 | 8 | 32 | 0 | 102.3 | 57.1 | 0.492 True | 1 | 8192 | 3072 | 4 | 128 | 0 | 99.2 | 61.2 | 0.507 True | 1 | 8192 | 3072 | 8 | 128 | 0 | 97.5 | 47.4 | 0.516 True | 1 | 200064 | 3072 | 4 | 32 | 0 | 1456.4 | 11.0 | 0.844 True | 1 | 200064 | 3072 | 8 | 32 | 0 | 1336.4 | 10.3 | 0.920 True | 1 | 200064 | 3072 | 4 | 128 | 0 | 1261.6 | 16.6 | 0.974 True | 1 | 200064 | 3072 | 8 | 128 | 0 | 1232.6 | 17.9 | 0.997 True | 256 | 3072 | 8192 | 4 | 32 | 0 | 211.1 | 5.8 | 61.030 True | 256 | 3072 | 8192 | 8 | 32 | 0 | 217.8 | 62.8 | 59.154 True | 256 | 3072 | 8192 | 4 | 128 | 0 | 208.7 | 63.3 | 61.751 True | 256 | 3072 | 8192 | 8 | 128 | 0 | 213.0 | 58.2 | 60.491 True | 256 | 5120 | 3072 | 4 | 32 | 0 | 151.9 | 57.4 | 53.028 True | 256 | 5120 | 3072 | 8 | 32 | 0 | 156.2 | 71.1 | 51.554 True | 256 | 5120 | 3072 | 4 | 128 | 0 | 151.4 | 22.6 | 53.198 True | 256 | 5120 | 3072 | 8 | 128 | 0 | 154.6 | 47.1 | 52.092 True | 256 | 8192 | 3072 | 4 | 32 | 0 | 219.0 | 4.4 | 58.847 True | 256 | 8192 | 3072 | 8 | 32 | 0 | 226.6 | 14.5 | 56.860 True | 256 | 8192 | 3072 | 4 | 128 | 0 | 206.7 | 39.9 | 62.333 True | 256 | 8192 | 3072 | 8 | 128 | 0 | 216.2 | 41.3 | 59.587 True | 256 | 200064 | 3072 | 4 | 32 | 0 | 3110.9 | 11.3 | 101.152 True | 256 | 200064 | 3072 | 8 | 32 | 0 | 3290.9 | 8.3 | 95.619 True | 256 | 200064 | 3072 | 4 | 128 | 0 | 3055.2 | 10.2 | 102.995 True | 256 | 200064 | 3072 | 8 | 128 | 0 | 3220.4 | 9.8 | 97.712 True | 1024 | 3072 | 8192 | 4 | 32 | 0 | 363.6 | 40.2 | 141.754 True | 1024 | 3072 | 8192 | 8 | 32 | 0 | 369.0 | 46.0 | 139.669 True | 1024 | 3072 | 8192 | 4 | 128 | 0 | 362.8 | 55.6 | 142.052 True | 1024 | 3072 | 8192 | 8 | 128 | 0 | 367.5 | 56.5 | 140.256 True | 1024 | 5120 | 3072 | 4 | 32 | 0 | 221.6 | 58.1 | 145.383 True | 1024 | 5120 | 3072 | 8 | 32 | 0 | 225.4 | 56.6 | 142.938 True | 1024 | 5120 | 3072 | 4 | 128 | 0 | 220.2 | 36.9 | 146.306 True | 1024 | 5120 | 3072 | 8 | 128 | 0 | 224.1 | 57.8 | 143.751 True | 1024 | 8192 | 3072 | 4 | 32 | 0 | 346.2 | 41.8 | 148.854 True | 1024 | 8192 | 3072 | 8 | 32 | 0 | 352.8 | 21.6 | 146.097 True | 1024 | 8192 | 3072 | 4 | 128 | 0 | 344.5 | 18.9 | 149.627 True | 1024 | 8192 | 3072 | 8 | 128 | 0 | 350.6 | 10.6 | 147.016 True | 1024 | 200064 | 3072 | 4 | 32 | 0 | 6822.0 | 44.1 | 184.504 True | 1024 | 200064 | 3072 | 8 | 32 | 0 | 7018.5 | 38.4 | 179.339 True | 1024 | 200064 | 3072 | 4 | 128 | 0 | 6757.8 | 51.5 | 186.257 True | 1024 | 200064 | 3072 | 8 | 128 | 0 | 6947.7 | 38.1 | 181.167 ------------------------------------------------------------------------------------------------------------------------ ``` ### Motivation and Context Follow up with microsoft#24509

snnn · 2025-09-05T20:48:39Z

This PR has been included in the rel-1.22.0 branch. Removing the release:1.22.0 label.

tianleiwu added 4 commits April 25, 2025 21:59

Add benchmark script

9308f46

choose unroll kernel

705185e

Replace unroll with simple loop

492b8da

refine accumulation

48cb505

tianleiwu requested review from jiafatom and kunal-vaishnavi April 26, 2025 01:15

tianleiwu added the release:1.22.0 label Apr 26, 2025

kunal-vaishnavi approved these changes Apr 26, 2025

View reviewed changes

snnn merged commit 1dd9b99 into main Apr 26, 2025
85 of 89 checks passed

snnn deleted the tlwu/benchmark_matmul_8bits branch April 26, 2025 05:27

vraspar mentioned this pull request Apr 28, 2025

Cherry-picks into rel-1.22.0 #24580

Merged

vraspar mentioned this pull request May 1, 2025

Cherry-picks into rel-1.22.0 #24611

Merged

snnn removed the release:1.22.0 label Sep 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CUDA] MatMulNBits benchmark #24564

[CUDA] MatMulNBits benchmark #24564

Uh oh!

tianleiwu commented Apr 26, 2025 •

edited

Loading

Uh oh!

snnn commented Apr 26, 2025

Uh oh!

azure-pipelines bot commented Apr 26, 2025

Uh oh!

kunal-vaishnavi commented Apr 26, 2025

Uh oh!

tianleiwu commented Apr 26, 2025

Uh oh!

Uh oh!

snnn commented Sep 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[CUDA] MatMulNBits benchmark #24564

[CUDA] MatMulNBits benchmark #24564

Uh oh!

Conversation

tianleiwu commented Apr 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

snnn commented Apr 26, 2025

Uh oh!

azure-pipelines bot commented Apr 26, 2025

Uh oh!

kunal-vaishnavi commented Apr 26, 2025

Uh oh!

tianleiwu commented Apr 26, 2025

Uh oh!

Uh oh!

snnn commented Sep 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tianleiwu commented Apr 26, 2025 •

edited

Loading