Skip to content

[Feature] KV cache per-token-head INT8/FP8 quantization#38378

Merged
mgoin merged 119 commits into
vllm-project:mainfrom
JartX:feature/kvcache_per_token
Apr 2, 2026
Merged

[Feature] KV cache per-token-head INT8/FP8 quantization#38378
mgoin merged 119 commits into
vllm-project:mainfrom
JartX:feature/kvcache_per_token

Conversation

@JartX

@JartX JartX commented Mar 27, 2026

Copy link
Copy Markdown
Contributor

Continue of PR: #36893 with changes requested by @mgoin
At comment:
#36893 (comment)

This PR adds per-token-head kv cache quantization to the Triton attention backend

INT8_PER_TOKEN, FP8_PER_TOKEN

| Metric | FP16 | FP8 | INT8_PER_TOKEN | FP8_PER_TOKEN |
|---|---|---|---|---|
| Duration (s) | 89.17 | 91.21 | **80.95 ✅** | 104.45 |
| Request throughput (req/s) | 2.24 | 2.19 | **2.47 ✅** | 1.91 |
| Output tok/s | 1148.30 | 1122.64 | **1264.91 ✅** | 980.33 |
| Total tok/s | 2296.61 | 2245.28 | **2529.83 ✅** | 1960.66 |
| Mean TTFT (ms) | 10130 | 11453 | **10072 ✅** | 11572 |
| Median TTFT (ms) | 9689 | 11017 | **9628 ✅** | 11078 |
| P99 TTFT (ms) | 21286 | 22612 | **21119 ✅** | 24015 |
| Mean TPOT (ms) | 150.93 | 152.30 | **135.66 ✅** | 178.22 |
| Median TPOT (ms) | 152.19 | 153.55 | **136.88 ✅** | 179.77 |
| P99 TPOT (ms) | 164.52 | 165.89 | **150.01 ✅** | 193.52 |
| Mean ITL (ms) | 150.93 | 152.30 | **135.66 ✅** | 178.22 |
| Median ITL (ms) | 137.34 | 139.27 | **121.57 ✅** | 164.21 |
| P99 ITL (ms) | 468.12 | 467.41 | **461.64 ✅** | 518.82 |

JartX and others added 30 commits March 12, 2026 15:32
Signed-off-by: JartX <sagformas@epdcenter.es>
Co-authored-by: yangyang4991 <yangyang4991@gmail.com>
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
…e_pr_34327_thanks_ericccyang

Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
Co-authored-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: JartX <sagformas@epdcenter.es>
@tjtanaa

tjtanaa commented Apr 2, 2026

Copy link
Copy Markdown
Member

@tjtanaa could you help benchmark the triton backend to make sure we aren't introducing any regressions to the non-kv quant case?

sure, let me get some figures.

@tjtanaa tjtanaa left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, amazing work.

There is no noticeable regression before and after this PR changes.

Configuration Output Throughput Diff % Mean TTFT Diff % Mean TPOT Diff % Duration Diff %
Ori noquant no prefix 7434.96 tok/s 0% 3366.76 ms 0% 30.90 ms 0% 35.26 s 0%
Orifp8quant No cache 7309.90 tok/s -1.68% 3411.33 ms +1.32% 31.44 ms +1.75% 35.86 s +1.70%
Afternoquant no cache 7416.04 tok/s -0.25% 3504.87 ms +4.10% 30.85 ms -0.16% 35.35 s +0.26%
Afterfp8quant no cache 7292.52 tok/s -1.92% 3428.95 ms +1.85% 31.51 ms +1.97% 35.95 s +1.96%

@mgoin mgoin merged commit 2ce3d0c into vllm-project:main Apr 2, 2026
79 of 80 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in AMD Apr 2, 2026
orozery added a commit that referenced this pull request Apr 2, 2026
@MatthewBonanni MatthewBonanni mentioned this pull request Apr 2, 2026
5 tasks
HenryTangDev pushed a commit to HenryTangMain/vllm that referenced this pull request Apr 6, 2026
…#38378)

Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: yangyang4991 <yangyang4991@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
@ashok-arora

Copy link
Copy Markdown

@JartX INT8 seems to be lagging behind FP16, is that expected even for longer context length and higher batch sizes?

@JartX

JartX commented Apr 7, 2026

Copy link
Copy Markdown
Contributor Author

@ashok-arora
Hi! Could you give me more details, please? Tell me which graphics cards you're using and what you mean by "behind." Performance? Accuracy?

Thanks a lot!

puririshi98 pushed a commit to puririshi98/vllm that referenced this pull request Apr 7, 2026
…#38378)

Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: yangyang4991 <yangyang4991@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026
…#38378)

Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: yangyang4991 <yangyang4991@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026
…#38378)

Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: yangyang4991 <yangyang4991@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…#38378)

Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: yangyang4991 <yangyang4991@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026
…#38378)

Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: yangyang4991 <yangyang4991@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026
…#38378)

Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: yangyang4991 <yangyang4991@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026
…#38378)

Signed-off-by: JartX <sagformas@epdcenter.es>
Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: yangyang4991 <yangyang4991@gmail.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation quantization ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

9 participants