Fix regression caused by fa3 block_table by wenscarl · Pull Request #15009 · sgl-project/sglang

wenscarl · 2025-12-12T19:38:48Z

Motivation

Fixes: #14665 [Bug] Llama4 on DGXH100 has regressed by 14.23%

Modifications

Accuracy Tests

Benchmarking and Profiling

Benchmark script:

python3 -m sglang.bench_one_batch \
    --model-path meta-llama/Llama-4-Scout-17B-16E \
    --disable-radix-cache \
    --context-length 2048 \
    --attention-backend fa3 \
    --load-format dummy \
    --batch-size 128 \
    --tp-size 8 \
    --input-len 1000 \
    --output-len 1000

Before:

Prefill. latency: 2.12386 s, throughput:  60267.49 token/s
Decode 0. Batch size: 128, latency: 0.17131 s, throughput:    747.20 token/s
Decode 1. Batch size: 128, latency: 0.11283 s, throughput:   1134.45 token/s
Decode 2. Batch size: 128, latency: 0.11220 s, throughput:   1140.83 token/s
Decode 3. Batch size: 128, latency: 0.11324 s, throughput:   1130.34 token/s
Decode 4. Batch size: 128, latency: 0.11100 s, throughput:   1153.15 token/s
Decode.  median latency: 0.11281 s, median throughput:   1134.62 token/s
Total. latency:  5.681 s, throughput:  23253.82 token/s
Benchmark ...
Prefill. latency: 1.81066 s, throughput:  70692.62 token/s
Decode 0. Batch size: 128, latency: 0.11238 s, throughput:   1139.00 token/s
Decode 1. Batch size: 128, latency: 0.11217 s, throughput:   1141.11 token/s
Decode 2. Batch size: 128, latency: 0.11157 s, throughput:   1147.31 token/s
Decode 3. Batch size: 128, latency: 0.11164 s, throughput:   1146.54 token/s
Decode 4. Batch size: 128, latency: 0.11063 s, throughput:   1157.03 token/s
Decode.  median latency: 0.15820 s, median throughput:    809.09 token/s
Total. latency: 160.106 s, throughput:   1598.95 token/s

After

Prefill. latency: 2.03622 s, throughput:  62861.66 token/s
Decode 0. Batch size: 128, latency: 0.08541 s, throughput:   1498.60 token/s
Decode 1. Batch size: 128, latency: 0.01811 s, throughput:   7067.50 token/s
Decode 2. Batch size: 128, latency: 0.01751 s, throughput:   7309.17 token/s
Decode 3. Batch size: 128, latency: 0.01738 s, throughput:   7365.30 token/s
Decode 4. Batch size: 128, latency: 0.01736 s, throughput:   7371.69 token/s
Decode.  median latency: 0.01744 s, median throughput:   7338.32 token/s
Total. latency:  2.646 s, throughput:  49923.53 token/s
Benchmark ...
Prefill. latency: 1.72161 s, throughput:  74349.17 token/s
Decode 0. Batch size: 128, latency: 0.01989 s, throughput:   6435.87 token/s
Decode 1. Batch size: 128, latency: 0.01811 s, throughput:   7067.99 token/s
Decode 2. Batch size: 128, latency: 0.01759 s, throughput:   7276.41 token/s
Decode 3. Batch size: 128, latency: 0.01768 s, throughput:   7241.46 token/s
Decode 4. Batch size: 128, latency: 0.01753 s, throughput:   7303.33 token/s
Decode.  median latency: 0.01828 s, median throughput:   7003.46 token/s
Total. latency: 20.002 s, throughput:  12798.53 token/s

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-12T19:38:51Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Kangyan-Zhou · 2025-12-12T23:58:52Z

/tag-and-rerun-ci

Fridge003

Nice catch

Fridge003 · 2025-12-13T01:12:39Z

/rerun-failed-ci

Fridge003 · 2025-12-13T03:11:04Z

failed b200 tests are unrelated to fa3 backend

…n_eagle3_npu * 'main' of https://github.com/sgl-project/sglang: (25 commits) [NPU] perf update with kvcache nz & w4a8 quant (sgl-project#14423) [PP Prefill][NIXL] Fix PP mode transfer completion tracking to wait for all ranks (sgl-project#15027) Fix GLM-4.6 tool calls don't support streaming output for arguments i… (sgl-project#13989) feature: adding nightly wheel workflow and indexer (sgl-project#14924) [diffusion] feat: Improve LoRA compatibility by adding unified format detection and diffusers-based normalization (sgl-project#14659) [Fix] Disable trtllm moe backend for draft model for a qucik fix (sgl-project#15002) [diffusion] fix: use NDRotaryEmbedding in flux_2 (sgl-project#15034) Mistral Large 3 NVFP4 support (sgl-project#14485) call check_quantized_moe_compatibility after initialize (sgl-project#13876) Add sgl_router_attempt_http_responses_total for single attempt information (sgl-project#15037) Add error code in prometheus metrics and add X-SMG-Error-Code header (sgl-project#15036) Provide more fine grained error reason for reqwest error (sgl-project#15032) Tiny change http router response format to unify (sgl-project#15031) Tiny unify grpc existing error responses into new format (sgl-project#15030) Add `code` field and unify error responses for router (sgl-project#15028) Super tiny remove unused log_request (sgl-project#15035) Fix decode OOM caused by retraction (sgl-project#14939) [CI]Add gb200 runner back (sgl-project#15024) Add a special label for b200 CI runner that can run kernel tests (sgl-project#15033) Fix regression caused by fa3 block_table (sgl-project#15009) ... # Conflicts: # python/sglang/srt/hardware_backend/npu/attention/ascend_backend.py

Fix regression caused by fa3 block_table

32c5b01

wenscarl requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners December 12, 2025 19:38

wenscarl mentioned this pull request Dec 12, 2025

[Bug] Llama4 on DGXH100 has regressed by 14.23% #14665

Closed

5 tasks

wenscarl added the nvidia label Dec 12, 2025

anurlybayev added the hopper SM90 label Dec 12, 2025

github-actions Bot added the run-ci label Dec 12, 2025

Fridge003 approved these changes Dec 13, 2025

View reviewed changes

zhyncs approved these changes Dec 13, 2025

View reviewed changes

Fridge003 merged commit 665cb02 into sgl-project:main Dec 13, 2025
210 of 233 checks passed

Prozac614 pushed a commit to Prozac614/sglang that referenced this pull request Dec 17, 2025

Fix regression caused by fa3 block_table (sgl-project#15009)

dd900c0

YChange01 pushed a commit to YChange01/sglang that referenced this pull request Jan 13, 2026

Fix regression caused by fa3 block_table (sgl-project#15009)

c27c34d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix regression caused by fa3 block_table#15009

Fix regression caused by fa3 block_table#15009
Fridge003 merged 1 commit intosgl-project:mainfrom
wenscarl:fix_fa3_metadata_regression

wenscarl commented Dec 12, 2025 •

edited by anurlybayev

Loading

Uh oh!

gemini-code-assist Bot commented Dec 12, 2025

Uh oh!

Kangyan-Zhou commented Dec 12, 2025

Uh oh!

Fridge003 left a comment

Uh oh!

Fridge003 commented Dec 13, 2025

Uh oh!

Fridge003 commented Dec 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

wenscarl commented Dec 12, 2025 • edited by anurlybayev Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist Bot commented Dec 12, 2025

Uh oh!

Kangyan-Zhou commented Dec 12, 2025

Uh oh!

Fridge003 left a comment

Choose a reason for hiding this comment

Uh oh!

Fridge003 commented Dec 13, 2025

Uh oh!

Fridge003 commented Dec 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wenscarl commented Dec 12, 2025 •

edited by anurlybayev

Loading