Support kv8 (FP8) with torch_native attention backend by JackChuang · Pull Request #12596 · sgl-project/sglang

JackChuang · 2025-11-04T04:27:15Z

This patch fixes the issue where KV8 could not run when the attention backend was set to torch_native.

Motivation

Currently, when using --attention-backend torch_native, the --kv-cache-dtype fp8_e4m3 option is not supported, causing KV cache in FP8 to fail. This patch fixes the issue by ensuring that the query, key, and value tensors are cast to the same dtype before calling scaled_dot_product_attention.

Modifications

Modified TorchNativeAttnBackend in torch_native_backend.py
Added dtype casting for per_req_key and per_req_value to match per_req_query
Ensures scaled_dot_product_attention works correctly with FP8 KV cache

Accuracy Tests

Tested in another PR #12612

Benchmarking and Profiling

Tested in another PR #12612

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-11-04T04:27:19Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Fridge003 · 2025-11-13T00:36:24Z

@JackChuang Please update this doc
https://github.com/sgl-project/sglang/blob/main/docs/advanced_features/attention_backend.md?plain=1#L22

JackChuang · 2025-11-14T10:58:57Z

@Fridge003 Thanks for your review and approval. Could someone help merge this PR? Thanks~

Fridge003 · 2025-11-14T21:21:50Z

@JackChuang Please fix conflict

JackChuang · 2025-11-21T23:24:57Z

@Fridge003 Could you please help merge this PR when you have free cycles? Thank you.

This patch fixes the issue where KV8 could not run when the attention backend was set to torch_native. Updates the attention backend support document. Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>

Fridge003 · 2025-12-12T20:33:06Z

@JackChuang Do you have any example of accuracy benchmarking when enabling fp8 kv cache with torch native backend

JackChuang · 2025-12-12T23:10:03Z

@JackChuang Do you have any example of accuracy benchmarking when enabling fp8 kv cache with torch native backend

Didn't test accuracy but performance. I’ll run the accuracy tests and then update.

JackChuang · 2025-12-16T04:37:51Z

@Fridge003 Using native_torch with KV8, the precision is essentially lossless.

[KV16]
Accuracy: 0.947
Invalid: 0.000
Latency: 2783.572 s
Output throughput: 73.740 token/s

$ CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m sglang.launch_server --model-path /data02/models/Qwen3-235B-A22B --tp 4 --trust-remote-code --port 8041 --kv-cache-dtype fp8_e4m3 --disable-radix-cache --enable-torch-compile  --attention-backend torch_native

$ python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319 --port 8041

[KV8]
Accuracy: 0.949
Invalid: 0.001
Latency: 2984.291 s
Output throughput: 68.772 token/s

$ CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m sglang.launch_server --model-path /data02/models/Qwen3-235B-A22B --tp 4 --trust-remote-code --port 8042 --attention-backend torch_native --disable-radix-cache  --enable-torch-compile

$ python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319 --port 8042

Exp on B200

Fridge003 · 2025-12-21T01:24:46Z

@JackChuang Please merge the main branch.

) Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>

JackChuang requested review from BBuf, Edwardf0t1, HaiShaw, Ying1123, ch-wan, ispobock, kushanam, merrymercy and zhyncs as code owners November 4, 2025 04:27

JackChuang mentioned this pull request Nov 4, 2025

feat: Add FP4 (E2M1) KV Cache Support for MHA #12612

Merged

4 tasks

JackChuang force-pushed the horenc/torch_native_kv8_support_on_main_release branch from 773c7a0 to 64ff639 Compare November 14, 2025 04:25

JackChuang requested a review from Fridge003 as a code owner November 14, 2025 04:25

github-actions Bot added the documentation Improvements or additions to documentation label Nov 14, 2025

Fridge003 added the run-ci label Nov 14, 2025

Fridge003 approved these changes Nov 14, 2025

View reviewed changes

JackChuang force-pushed the horenc/torch_native_kv8_support_on_main_release branch from 64ff639 to 5faf913 Compare November 14, 2025 22:39

Support kv8 for backend torch_native

8ee8295

This patch fixes the issue where KV8 could not run when the attention backend was set to torch_native. Updates the attention backend support document. Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>

JackChuang force-pushed the horenc/torch_native_kv8_support_on_main_release branch from 5faf913 to 8ee8295 Compare December 12, 2025 12:29

Merge branch 'main' into horenc/torch_native_kv8_support_on_main_release

ce26e16

JackChuang requested review from Qiaolin-Yu and hebiao064 as code owners December 26, 2025 23:23

Fridge003 merged commit 349ce2d into sgl-project:main Dec 28, 2025
256 of 272 checks passed

YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026

Support kv8 (FP8) with torch_native attention backend (sgl-project#12596

c4dd627

) Signed-off-by: Ho-Ren (Jack) Chuang <horenchuang@bytedance.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support kv8 (FP8) with torch_native attention backend#12596

Support kv8 (FP8) with torch_native attention backend#12596
Fridge003 merged 2 commits intosgl-project:mainfrom
bytedance-iaas:horenc/torch_native_kv8_support_on_main_release

JackChuang commented Nov 4, 2025 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Nov 4, 2025

Uh oh!

Fridge003 commented Nov 13, 2025

Uh oh!

JackChuang commented Nov 14, 2025

Uh oh!

Fridge003 commented Nov 14, 2025

Uh oh!

JackChuang commented Nov 21, 2025

Uh oh!

Fridge003 commented Dec 12, 2025

Uh oh!

JackChuang commented Dec 12, 2025

Uh oh!

JackChuang commented Dec 16, 2025 •

edited

Loading

Uh oh!

Fridge003 commented Dec 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JackChuang commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Nov 4, 2025

Uh oh!

Fridge003 commented Nov 13, 2025

Uh oh!

JackChuang commented Nov 14, 2025

Uh oh!

Fridge003 commented Nov 14, 2025

Uh oh!

JackChuang commented Nov 21, 2025

Uh oh!

Fridge003 commented Dec 12, 2025

Uh oh!

JackChuang commented Dec 12, 2025

Uh oh!

JackChuang commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fridge003 commented Dec 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JackChuang commented Nov 4, 2025 •

edited

Loading

JackChuang commented Dec 16, 2025 •

edited

Loading