Skip to content

Conversation

@fajin-corp
Copy link
Contributor

@fajin-corp fajin-corp commented Apr 21, 2025

Description

Add 8bits support for matmulnbits on x86

AVX512 VNNI

M N K 8-bit Time (ns) 4-bit Time (ns) Slow down (8-bit / 4-bit)
1 4096 4096 34145 27723 1.23×
1 11008 4096 415285 68656 6.05×
1 4096 11008 407801 68061 5.99×
1 11008 11008 2674538 1003532 2.67×
4096 4096 4096 80338759 86321713 0.93×
4096 11008 4096 213421935 225245276 0.95×
4096 4096 11008 240164365 228966953 1.05×
4096 11008 11008 628352046 596738340 1.05×

AVX512

M N K 8-bit Time (ns) 4-bit Time (ns) Slow down (8-bit / 4-bit)
1 4096 4096 53324 37882 1.41×
1 11008 4096 244560 103255 2.37×
1 4096 11008 435131 95734 4.55×
1 11008 11008 2790710 1075216 2.60×
4096 4096 4096 200629000 132841540 1.51×
4096 11008 4096 532141914 350613184 1.52×
4096 4096 11008 544011977 351679619 1.55×
4096 11008 11008 1421865147 925593210 1.54×

Token generation is bottlenecked at memory access. 8b model's 2x size is major reason of token generation slow down.

For non-vnni platform, the i16 cannot fit in 4 i8. To avoid overflow extra instructions are needed. This is the major reason of non-vnni slow down.

Motivation and Context

MatMul4Bits model has repetition issue. 6b model resolved this issue.

@fajin-corp fajin-corp requested a review from a team as a code owner April 21, 2025 23:07
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

liqunfu
liqunfu previously approved these changes Apr 21, 2025
Copy link
Contributor

@liqunfu liqunfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, it will be better if you add performance results, either mlas benchmark or even better ort-genai benchmark. ort-genai benchmark is preferred because I found mlas benchmark tends to show higher improvement that does not exist with genai.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

@fajin-corp fajin-corp closed this Apr 23, 2025
@fajin-corp fajin-corp reopened this Apr 23, 2025
@liqunfu
Copy link
Contributor

liqunfu commented Apr 24, 2025

"For non-vnni platform, the i16 cannot fit in 4 i8. To avoid overflow extra instructions are needed. This is the major reason of non-vnni slow down." does this issue exist with 4bit non-vnni?

@fajin-corp fajin-corp merged commit 7801c51 into main Apr 24, 2025
87 of 89 checks passed
@fajin-corp fajin-corp deleted the fajin/matmul8bit_x64_kernel branch April 24, 2025 17:15
@fajin-corp
Copy link
Contributor Author

"For non-vnni platform, the i16 cannot fit in 4 i8. To avoid overflow extra instructions are needed. This is the major reason of non-vnni slow down." does this issue exist with 4bit non-vnni?

4bit does not have this issue. it is caused by (i8 * i8) * 2. the result is put in i16. (i4 * i4) * 2 can fit in i16

jywu-msft pushed a commit that referenced this pull request Apr 30, 2025
### Description

Cherry pick the following into
[rel-1.22.0](https://github.com/microsoft/onnxruntime/tree/rel-1.22.0)


- (#24487)
- (#24466)
- (#24493)
- (#24484)
- (#24494)
- (#24489)
- (#24504)
- (#24510)
- (#24456)
- (#24537)
- (#24501)
- (#24519)
- (#24513)
- (#24539)
- (#24514)
- (#24542)
- (#24585)

Not added:

Planning to cherry pick Cuda Matmulnbits PRs once the fix for failing
cuda pipeline is ready
- (#24491)
- (#24509)
- (#24564)

---------

Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: minfhong-quic <quic_minfhong@quicinc.com>
Co-authored-by: minfhong-quic <minfhong-quic@quicinc.com>
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Co-authored-by: Prathik Rao <prathik.rao@gmail.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Ankan Banerjee <ankan.ban@gmail.com>
Co-authored-by: Maximilian Müller <maximilianm@nvidia.com>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: iraut <iraut@nvidia.com>
Co-authored-by: Hrishikesh Manohar <hrishikeshm@nvidia.com>
Co-authored-by: Maximilian Müller <44298237+gedoensmax@users.noreply.github.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: xhcao <xinghua.cao@intel.com>
jatinwadhwa921 pushed a commit to intel/onnxruntime that referenced this pull request Apr 30, 2025
### Description

Cherry pick the following into
[rel-1.22.0](https://github.com/microsoft/onnxruntime/tree/rel-1.22.0)


- (microsoft#24487)
- (microsoft#24466)
- (microsoft#24493)
- (microsoft#24484)
- (microsoft#24494)
- (microsoft#24489)
- (microsoft#24504)
- (microsoft#24510)
- (microsoft#24456)
- (microsoft#24537)
- (microsoft#24501)
- (microsoft#24519)
- (microsoft#24513)
- (microsoft#24539)
- (microsoft#24514)
- (microsoft#24542)
- (microsoft#24585)

Not added:

Planning to cherry pick Cuda Matmulnbits PRs once the fix for failing
cuda pipeline is ready
- (microsoft#24491)
- (microsoft#24509)
- (microsoft#24564)

---------

Co-authored-by: vraspar <vrajang@outlook.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: minfhong-quic <quic_minfhong@quicinc.com>
Co-authored-by: minfhong-quic <minfhong-quic@quicinc.com>
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Co-authored-by: Prathik Rao <prathik.rao@gmail.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Ankan Banerjee <ankan.ban@gmail.com>
Co-authored-by: Maximilian Müller <maximilianm@nvidia.com>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: iraut <iraut@nvidia.com>
Co-authored-by: Hrishikesh Manohar <hrishikeshm@nvidia.com>
Co-authored-by: Maximilian Müller <44298237+gedoensmax@users.noreply.github.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>
Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: xhcao <xinghua.cao@intel.com>
vraspar pushed a commit that referenced this pull request May 1, 2025
### Description
Add 8bits support for matmulnbits on x86

__AVX512 VNNI__
| M | N | K | 8-bit Time (ns) | 4-bit Time (ns) | Slow down (8-bit /
4-bit) |

|:-----:|:-------:|:-------:|:----------------:|:----------------:|:------------------------:|
| 1 | 4096 | 4096 | 34145 | 27723 | **1.23×** |
| 1 | 11008 | 4096 | 415285 | 68656 | **6.05×** |
| 1 | 4096 | 11008 | 407801 | 68061 | **5.99×** |
| 1 | 11008 | 11008 | 2674538 | 1003532 | **2.67×** |
| 4096 | 4096 | 4096 | 80338759 | 86321713 | **0.93×** |
| 4096 | 11008 | 4096 | 213421935 | 225245276 | **0.95×** |
| 4096 | 4096 | 11008 | 240164365 | 228966953 | **1.05×** |
| 4096 | 11008 | 11008 | 628352046 | 596738340 | **1.05×** |

__AVX512__
| M | N | K | 8-bit Time (ns) | 4-bit Time (ns) | Slow down (8-bit /
4-bit) |

|:-----:|:-------:|:-------:|:----------------:|:----------------:|:------------------------:|
| 1 | 4096 | 4096 | 53324 | 37882 | **1.41×** |
| 1 | 11008 | 4096 | 244560 | 103255 | **2.37×** |
| 1 | 4096 | 11008 | 435131 | 95734 | **4.55×** |
| 1 | 11008 | 11008 | 2790710 | 1075216 | **2.60×** |
| 4096 | 4096 | 4096 | 200629000 | 132841540 | **1.51×** |
| 4096 | 11008 | 4096 | 532141914 | 350613184 | **1.52×** |
| 4096 | 4096 | 11008 | 544011977 | 351679619 | **1.55×** |
| 4096 | 11008 | 11008 | 1421865147 | 925593210 | **1.54×** |

Token generation is bottlenecked at memory access. 8b model's 2x size is
major reason of token generation slow down.

For non-vnni platform, the i16 cannot fit in 4 i8. To avoid overflow
extra instructions are needed. This is the major reason of non-vnni slow
down.

### Motivation and Context
MatMul4Bits model has repetition issue. 6b model resolved this issue.
jywu-msft pushed a commit that referenced this pull request May 1, 2025
### Description

Cherry pick the following into
[rel-1.22.0](https://github.com/microsoft/onnxruntime/tree/rel-1.22.0)

- (#24491)
- (#24509)
- (#24564)
- (#24574)
- (#24582)
- (#24584)
- (#24568)
- (#24587)
- (#24563)
- (#24592)
- (#24526)
- (#24552)
- (#24588)
- (#24605)
- (#24606)

---------

Co-authored-by: Jing Fang <126209182+fajin-corp@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: Baiju Meswani <bmeswani@microsoft.com>
Co-authored-by: Scott McKay <skottmckay@gmail.com>
Co-authored-by: Mark Schofield <mschofie@microsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Ashwath Shankarnarayan <quic_ashwshan@quicinc.com>
Co-authored-by: saurabh <saurabh1.kale@intel.com>
Co-authored-by: Adrian Lizarraga <adlizarraga@microsoft.com>
Co-authored-by: Hector Li <hecli@microsoft.com>
ankitm3k pushed a commit to intel/onnxruntime that referenced this pull request May 12, 2025
### Description
Add 8bits support for matmulnbits on x86

__AVX512 VNNI__
| M | N | K | 8-bit Time (ns) | 4-bit Time (ns) | Slow down (8-bit /
4-bit) |

|:-----:|:-------:|:-------:|:----------------:|:----------------:|:------------------------:|
| 1 | 4096 | 4096 | 34145 | 27723 | **1.23×** |
| 1 | 11008 | 4096 | 415285 | 68656 | **6.05×** |
| 1 | 4096 | 11008 | 407801 | 68061 | **5.99×** |
| 1 | 11008 | 11008 | 2674538 | 1003532 | **2.67×** |
| 4096 | 4096 | 4096 | 80338759 | 86321713 | **0.93×** |
| 4096 | 11008 | 4096 | 213421935 | 225245276 | **0.95×** |
| 4096 | 4096 | 11008 | 240164365 | 228966953 | **1.05×** |
| 4096 | 11008 | 11008 | 628352046 | 596738340 | **1.05×** |

__AVX512__
| M | N | K | 8-bit Time (ns) | 4-bit Time (ns) | Slow down (8-bit /
4-bit) |

|:-----:|:-------:|:-------:|:----------------:|:----------------:|:------------------------:|
| 1 | 4096 | 4096 | 53324 | 37882 | **1.41×** |
| 1 | 11008 | 4096 | 244560 | 103255 | **2.37×** |
| 1 | 4096 | 11008 | 435131 | 95734 | **4.55×** |
| 1 | 11008 | 11008 | 2790710 | 1075216 | **2.60×** |
| 4096 | 4096 | 4096 | 200629000 | 132841540 | **1.51×** |
| 4096 | 11008 | 4096 | 532141914 | 350613184 | **1.52×** |
| 4096 | 4096 | 11008 | 544011977 | 351679619 | **1.55×** |
| 4096 | 11008 | 11008 | 1421865147 | 925593210 | **1.54×** |

Token generation is bottlenecked at memory access. 8b model's 2x size is
major reason of token generation slow down.

For non-vnni platform, the i16 cannot fit in 4 i8. To avoid overflow
extra instructions are needed. This is the major reason of non-vnni slow
down.

### Motivation and Context
MatMul4Bits model has repetition issue. 6b model resolved this issue.
@snnn
Copy link
Contributor

snnn commented Sep 5, 2025

This PR has been included in the rel-1.22.0 branch. Removing the release:1.22.0 label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants