[V32/GLM5] Control the threshold of applying dense attention with an environ by Fridge003 · Pull Request #20062 · sgl-project/sglang

Fridge003 · 2026-03-06T22:25:26Z

Motivation

Add an environ SGLANG_NSA_DENSE_ATTN_KV_LEN_THRESHOLD, for controlling whether to use dense MHA or sparse MLA kernel. It's set to index.topk by default, thus not breaking the original logic.
For GLM-5 model on blackwell, this environ is set to 0 so as to avoid kernel issues.
Remove the useless SGLANG_VERIFY_FUSED_METADATA_COPY environ

Accuracy Tests

Benchmarking and Profiling

Bench serving with V32 TP8 on Hopper

python3 -m sglang.bench_serving --backend sglang --num-prompts 16 --dataset-name random --random-input 2048 --random-output 1024 --random-range-ratio 1.0 --max-concurrency 16

Default (threshold 2048=topk):

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 16        
Successful requests:                     16        
Benchmark duration (s):                  23.25     
Total input tokens:                      32768     
Total input text tokens:                 32768     
Total generated tokens:                  16384     
Total generated tokens (retokenized):    16360     
Request throughput (req/s):              0.69      
Input token throughput (tok/s):          1409.62   
Output token throughput (tok/s):         704.81    
Peak output token throughput (tok/s):    848.00    
Peak concurrent requests:                16        
Total token throughput (tok/s):          2114.43   
Concurrency:                             15.99     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   23228.47  
Median E2E Latency (ms):                 23228.64  
P90 E2E Latency (ms):                    23229.51  
P99 E2E Latency (ms):                    23229.58  
---------------Time to First Token----------------
Mean TTFT (ms):                          3434.62   
Median TTFT (ms):                        3485.04   
P99 TTFT (ms):                           3808.01   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.35     
Median TPOT (ms):                        19.30     
P99 TPOT (ms):                           20.01     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           19.35     
Median ITL (ms):                         18.99     
P95 ITL (ms):                            19.33     
P99 ITL (ms):                            19.52     
Max ITL (ms):                            1124.70   
==================================================

SGLANG_NSA_DENSE_ATTN_KV_LEN_THRESHOLD=16384 （TTFT is almost halved):

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 16        
Successful requests:                     16        
Benchmark duration (s):                  21.30     
Total input tokens:                      32768     
Total input text tokens:                 32768     
Total generated tokens:                  16384     
Total generated tokens (retokenized):    16374     
Request throughput (req/s):              0.75      
Input token throughput (tok/s):          1538.39   
Output token throughput (tok/s):         769.19    
Peak output token throughput (tok/s):    848.00    
Peak concurrent requests:                16        
Total token throughput (tok/s):          2307.58   
Concurrency:                             15.99     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   21281.47  
Median E2E Latency (ms):                 21281.40  
P90 E2E Latency (ms):                    21282.55  
P99 E2E Latency (ms):                    21282.73  
---------------Time to First Token----------------
Mean TTFT (ms):                          1422.21   
Median TTFT (ms):                        1473.14   
P99 TTFT (ms):                           1787.25   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.41     
Median TPOT (ms):                        19.36     
P99 TPOT (ms):                           20.05     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           19.41     
Median ITL (ms):                         19.04     
P95 ITL (ms):                            19.37     
P99 ITL (ms):                            19.54     
Max ITL (ms):                            1099.13   
==================================================

GLM-5 successfully launching on B200 with this change

TP8 GSM8k 8shot:
Accuracy: 0.953
Invalid: 0.000
Latency: 109.110 s
Output throughput: 1258.606 token/s

DP8 GSM8k 8shot:
Accuracy: 0.955
Invalid: 0.000
Latency: 24.158 s
Output throughput: 5707.262 token/s

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-03-06T22:25:29Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2026-03-06T23:47:04Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Fridge003 · 2026-03-06T23:47:09Z

/tag-and-rerun-ci

Fridge003 · 2026-03-07T08:34:44Z

/rerun-stage stage-c-test-8-gpu-h200

Fridge003 · 2026-03-07T08:34:55Z

/rerun-stage stage-c-test-4-gpu-b200

github-actions · 2026-03-07T08:35:04Z

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies).

github-actions · 2026-03-07T08:35:09Z

🔗 View workflow run

github-actions · 2026-03-07T08:35:12Z

✅ Triggered stage-c-test-4-gpu-b200 to run independently (skipping dependencies).

github-actions · 2026-03-07T08:35:17Z

🔗 View workflow run

Fridge003 · 2026-03-09T20:39:31Z

/rerun-stage stage-c-test-4-gpu-b200

github-actions · 2026-03-09T20:39:53Z

✅ Triggered stage-c-test-4-gpu-b200 to run independently (skipping dependencies).

github-actions · 2026-03-09T20:39:59Z

🔗 View workflow run

…environ (sgl-project#20062)

Fridge003 added 3 commits March 6, 2026 22:11

fix environs

239e8e9

auto set threshold to topk

a086642

fix

005a8f7

Fridge003 requested review from HaiShaw, Qiaolin-Yu, hebiao064, ispobock and merrymercy as code owners March 6, 2026 22:25

github-actions Bot added the documentation Improvements or additions to documentation label Mar 6, 2026

Fridge003 marked this pull request as draft March 6, 2026 22:29

fix

71f90f5

Fridge003 marked this pull request as ready for review March 6, 2026 23:47

Merge branch 'main' into baizhou/v32_dense_control

2e4f1ff

github-actions Bot added the run-ci label Mar 6, 2026

Fridge003 added 2 commits March 6, 2026 16:47

upd

9b8a03b

upd

b3779bb

Fridge003 added 4 commits March 7, 2026 15:15

Merge branch 'main' into baizhou/v32_dense_control

c9d03e0

Merge branch 'main' into baizhou/v32_dense_control

80fdbb9

upd

5c0e601

fix

105aa44

github-actions Bot added the deepseek label Mar 9, 2026

upd

1160474

Fridge003 force-pushed the baizhou/v32_dense_control branch from dd6d6c7 to 1160474 Compare March 9, 2026 20:41

Fridge003 merged commit be63f98 into main Mar 9, 2026
55 of 105 checks passed

Fridge003 deleted the baizhou/v32_dense_control branch March 9, 2026 21:36

liubiyongge pushed a commit to liubiyongge/sglang that referenced this pull request Mar 13, 2026

[V32/GLM5] Control the threshold of applying dense attention with an …

1f61bae

…environ (sgl-project#20062)

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

[V32/GLM5] Control the threshold of applying dense attention with an …

f0c1a43

…environ (sgl-project#20062)

JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026

[V32/GLM5] Control the threshold of applying dense attention with an …

13dac01

…environ (sgl-project#20062)

yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026

[V32/GLM5] Control the threshold of applying dense attention with an …

16754ba

…environ (sgl-project#20062)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V32/GLM5] Control the threshold of applying dense attention with an environ#20062

[V32/GLM5] Control the threshold of applying dense attention with an environ#20062
Fridge003 merged 12 commits intomainfrom
baizhou/v32_dense_control

Fridge003 commented Mar 6, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Mar 6, 2026

Uh oh!

gemini-code-assist Bot commented Mar 6, 2026

Uh oh!

Fridge003 commented Mar 6, 2026

Uh oh!

Fridge003 commented Mar 7, 2026

Uh oh!

Fridge003 commented Mar 7, 2026

Uh oh!

github-actions Bot commented Mar 7, 2026

Uh oh!

github-actions Bot commented Mar 7, 2026

Uh oh!

github-actions Bot commented Mar 7, 2026

Uh oh!

github-actions Bot commented Mar 7, 2026

Uh oh!

Fridge003 commented Mar 9, 2026

Uh oh!

github-actions Bot commented Mar 9, 2026

Uh oh!

github-actions Bot commented Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Fridge003 commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Mar 6, 2026

Uh oh!

gemini-code-assist Bot commented Mar 6, 2026

Uh oh!

Fridge003 commented Mar 6, 2026

Uh oh!

Fridge003 commented Mar 7, 2026

Uh oh!

Fridge003 commented Mar 7, 2026

Uh oh!

github-actions Bot commented Mar 7, 2026

Uh oh!

github-actions Bot commented Mar 7, 2026

Uh oh!

github-actions Bot commented Mar 7, 2026

Uh oh!

github-actions Bot commented Mar 7, 2026

Uh oh!

Fridge003 commented Mar 9, 2026

Uh oh!

github-actions Bot commented Mar 9, 2026

Uh oh!

github-actions Bot commented Mar 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fridge003 commented Mar 6, 2026 •

edited

Loading