[AMD] Add mha fp8-kv support by kkHuang-amd · Pull Request #21253 · sgl-project/sglang

kkHuang-amd · 2026-03-24T02:44:00Z

Motivation

Support FP8-kv when running model with using mha-attention

Modifications

Added aiter_backend change to support fp8 type of kv cache

Server command:

SGLANG_USE_AITER=1 python3 -m sglang.launch_server \
  --model-path /dockerx/raid/models/gpt-oss-120b/ \
  --tp 8 \
  --trust-remote-code \
  --chunked-prefill-size 131072 \
  --max-running-requests 128 \
  --mem-fraction-static 0.85 \
  --prefill-attention-backend aiter \
  --decode-attention-backend aiter \
  --page-size 64 \
  --disable-radix-cache \
  --kv-cache-dtype fp8_e4m3 \
  --port 8000

Accuracy Tests

Accuracy test command:

python3 benchmark/gsm8k/bench_sglang.py \
  --num-questions 2000 --parallel 2000 --num-shots 5 --port 9000

Accuracy: 0.836 Invalid: 0.014 Latency: 43.406 s Output throughput: 10142.803 token/s

Benchmarking and Profiling

Serving benchmark command:

python3 -m sglang.bench_serving \
  --host localhost --port 9000 \
  --model openai/gpt-oss-120b \
  --dataset-name random \
  --random-input 8192 --random-output 1024 \
  --random-range-ratio 1.0 \
  --max-concurrency 1 --num-prompt 8

Latency and Throughput

Metric	TP8 (BF16)	TP8 (FP8)	Improvement (TP8 vs TP8 BF16)
Median E2E Latency (ms)	3382.41	3356.55	0.7%
Total Throughput (tok/s)	9498.85	9567.93	0.7%

TTFT and ITL

Metric	TP8 (BF16)	TP8 (FP8)	Improvement (TP8 vs TP8 BF16)
Median TTFT (ms)	66.41	63.13	5.2%
Median ITL (ms)	3.5	3.46	1.2%

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-03-24T02:44:05Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…to get the better performance when kv cache type is fp8

gemini-code-assist · 2026-03-24T06:57:53Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

HaiShaw · 2026-03-24T23:23:18Z

/tag-and-rerun-ci

Co-authored-by: wunhuang <wunhuang@amd.com>

Add mha fp8-kv support

452f96f

Merge branch 'main' into mha-fp8kv-support

74f85a5

hzh0425 assigned hzh0425 and unassigned hzh0425 Mar 24, 2026

Keep q percision still bf16 or fp16 in unified-attention computation …

e43a495

…to get the better performance when kv cache type is fp8

kkHuang-amd marked this pull request as ready for review March 24, 2026 06:57

kkHuang-amd requested review from Fridge003, HaiShaw, Qiaolin-Yu, hebiao064, ispobock and merrymercy as code owners March 24, 2026 06:57

Refactor code

a86cfc3

kkHuang-amd changed the title ~~Add mha fp8-kv support~~ [AMD] Add mha fp8-kv support Mar 24, 2026

HaiShaw added the amd label Mar 24, 2026

github-actions Bot added the run-ci label Mar 24, 2026

HaiShaw reviewed Mar 25, 2026

View reviewed changes

Comment thread python/sglang/srt/layers/attention/aiter_backend.py

Comment thread python/sglang/srt/layers/attention/aiter_backend.py

Add some comments

f769de9

HaiShaw approved these changes Mar 25, 2026

View reviewed changes

HaiShaw merged commit 86e2622 into sgl-project:main Mar 25, 2026
45 of 68 checks passed

HaiShaw mentioned this pull request Mar 25, 2026

[AMD][bugfix] Allocate cuda_graph_kv_last_page_len on GPU device #21306

Closed

5 tasks

0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026

[AMD] Add mha fp8-kv support (sgl-project#21253)

9fce95a

Co-authored-by: wunhuang <wunhuang@amd.com>

johnnycxm pushed a commit to johnnycxm/sglang that referenced this pull request Mar 25, 2026

[AMD] Add mha fp8-kv support (sgl-project#21253)

0add592

Co-authored-by: wunhuang <wunhuang@amd.com>

johnnycxm pushed a commit to johnnycxm/sglang that referenced this pull request Mar 25, 2026

[AMD] Add mha fp8-kv support (sgl-project#21253)

366f636

Co-authored-by: wunhuang <wunhuang@amd.com>

ZiguanWang mentioned this pull request Mar 26, 2026

[AMD]:fix initialize cuda_graph_kv_last_page_len on correct device #21394

Closed

5 tasks

satyamk7054 pushed a commit to satyamk7054/sglang that referenced this pull request Apr 3, 2026

[AMD] Add mha fp8-kv support (sgl-project#21253)

27471d5

Co-authored-by: wunhuang <wunhuang@amd.com>

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

[AMD] Add mha fp8-kv support (sgl-project#21253)

7213590

Co-authored-by: wunhuang <wunhuang@amd.com>

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

[AMD] Add mha fp8-kv support (sgl-project#21253)

f9beecb

Co-authored-by: wunhuang <wunhuang@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Add mha fp8-kv support#21253

[AMD] Add mha fp8-kv support#21253
HaiShaw merged 5 commits intosgl-project:mainfrom
HaiShaw:mha-fp8kv-support

kkHuang-amd commented Mar 24, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 24, 2026

Uh oh!

gemini-code-assist Bot commented Mar 24, 2026

Uh oh!

HaiShaw commented Mar 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kkHuang-amd commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Latency and Throughput

TTFT and ITL

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 24, 2026

Uh oh!

gemini-code-assist Bot commented Mar 24, 2026

Uh oh!

HaiShaw commented Mar 24, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kkHuang-amd commented Mar 24, 2026 •

edited

Loading