Skip to content

[AMD] Add mha fp8-kv support#21253

Merged
HaiShaw merged 5 commits intosgl-project:mainfrom
HaiShaw:mha-fp8kv-support
Mar 25, 2026
Merged

[AMD] Add mha fp8-kv support#21253
HaiShaw merged 5 commits intosgl-project:mainfrom
HaiShaw:mha-fp8kv-support

Conversation

@kkHuang-amd
Copy link
Copy Markdown
Collaborator

@kkHuang-amd kkHuang-amd commented Mar 24, 2026

Motivation

Support FP8-kv when running model with using mha-attention

Modifications

Added aiter_backend change to support fp8 type of kv cache

Server command:
SGLANG_USE_AITER=1 python3 -m sglang.launch_server \
  --model-path /dockerx/raid/models/gpt-oss-120b/ \
  --tp 8 \
  --trust-remote-code \
  --chunked-prefill-size 131072 \
  --max-running-requests 128 \
  --mem-fraction-static 0.85 \
  --prefill-attention-backend aiter \
  --decode-attention-backend aiter \
  --page-size 64 \
  --disable-radix-cache \
  --kv-cache-dtype fp8_e4m3 \
  --port 8000

Accuracy Tests

Accuracy test command:
python3 benchmark/gsm8k/bench_sglang.py \
  --num-questions 2000 --parallel 2000 --num-shots 5 --port 9000

Accuracy: 0.836 Invalid: 0.014 Latency: 43.406 s Output throughput: 10142.803 token/s

Benchmarking and Profiling

Serving benchmark command:

python3 -m sglang.bench_serving \
  --host localhost --port 9000 \
  --model openai/gpt-oss-120b \
  --dataset-name random \
  --random-input 8192 --random-output 1024 \
  --random-range-ratio 1.0 \
  --max-concurrency 1 --num-prompt 8

Latency and Throughput

Metric TP8 (BF16) TP8 (FP8) Improvement (TP8 vs TP8 BF16)
Median E2E Latency (ms) 3382.41 3356.55 0.7%
Total Throughput (tok/s) 9498.85 9567.93 0.7%

TTFT and ITL

Metric TP8 (BF16) TP8 (FP8) Improvement (TP8 vs TP8 BF16)
Median TTFT (ms) 66.41 63.13 5.2%
Median ITL (ms) 3.5 3.46 1.2%

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@hzh0425 hzh0425 assigned hzh0425 and unassigned hzh0425 Mar 24, 2026
…to get the better performance when kv cache type is fp8
@kkHuang-amd kkHuang-amd marked this pull request as ready for review March 24, 2026 06:57
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@kkHuang-amd kkHuang-amd changed the title Add mha fp8-kv support [AMD] Add mha fp8-kv support Mar 24, 2026
@HaiShaw HaiShaw added the amd label Mar 24, 2026
@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented Mar 24, 2026

/tag-and-rerun-ci

Comment thread python/sglang/srt/layers/attention/aiter_backend.py
Comment thread python/sglang/srt/layers/attention/aiter_backend.py
@HaiShaw HaiShaw merged commit 86e2622 into sgl-project:main Mar 25, 2026
45 of 68 checks passed
0-693 pushed a commit to 0-693/sglang that referenced this pull request Mar 25, 2026
Co-authored-by: wunhuang <wunhuang@amd.com>
johnnycxm pushed a commit to johnnycxm/sglang that referenced this pull request Mar 25, 2026
Co-authored-by: wunhuang <wunhuang@amd.com>
johnnycxm pushed a commit to johnnycxm/sglang that referenced this pull request Mar 25, 2026
Co-authored-by: wunhuang <wunhuang@amd.com>
satyamk7054 pushed a commit to satyamk7054/sglang that referenced this pull request Apr 3, 2026
Co-authored-by: wunhuang <wunhuang@amd.com>
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
Co-authored-by: wunhuang <wunhuang@amd.com>
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Co-authored-by: wunhuang <wunhuang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants