Skip to content

[V32/GLM5] Control the threshold of applying dense attention with an environ#20062

Merged
Fridge003 merged 12 commits intomainfrom
baizhou/v32_dense_control
Mar 9, 2026
Merged

[V32/GLM5] Control the threshold of applying dense attention with an environ#20062
Fridge003 merged 12 commits intomainfrom
baizhou/v32_dense_control

Conversation

@Fridge003
Copy link
Copy Markdown
Collaborator

@Fridge003 Fridge003 commented Mar 6, 2026

Motivation

  • Add an environ SGLANG_NSA_DENSE_ATTN_KV_LEN_THRESHOLD, for controlling whether to use dense MHA or sparse MLA kernel. It's set to index.topk by default, thus not breaking the original logic.
  • For GLM-5 model on blackwell, this environ is set to 0 so as to avoid kernel issues.
  • Remove the useless SGLANG_VERIFY_FUSED_METADATA_COPY environ

Accuracy Tests

Benchmarking and Profiling

Bench serving with V32 TP8 on Hopper

python3 -m sglang.bench_serving --backend sglang --num-prompts 16 --dataset-name random --random-input 2048 --random-output 1024 --random-range-ratio 1.0 --max-concurrency 16

Default (threshold 2048=topk):

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 16        
Successful requests:                     16        
Benchmark duration (s):                  23.25     
Total input tokens:                      32768     
Total input text tokens:                 32768     
Total generated tokens:                  16384     
Total generated tokens (retokenized):    16360     
Request throughput (req/s):              0.69      
Input token throughput (tok/s):          1409.62   
Output token throughput (tok/s):         704.81    
Peak output token throughput (tok/s):    848.00    
Peak concurrent requests:                16        
Total token throughput (tok/s):          2114.43   
Concurrency:                             15.99     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   23228.47  
Median E2E Latency (ms):                 23228.64  
P90 E2E Latency (ms):                    23229.51  
P99 E2E Latency (ms):                    23229.58  
---------------Time to First Token----------------
Mean TTFT (ms):                          3434.62   
Median TTFT (ms):                        3485.04   
P99 TTFT (ms):                           3808.01   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.35     
Median TPOT (ms):                        19.30     
P99 TPOT (ms):                           20.01     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           19.35     
Median ITL (ms):                         18.99     
P95 ITL (ms):                            19.33     
P99 ITL (ms):                            19.52     
Max ITL (ms):                            1124.70   
==================================================

SGLANG_NSA_DENSE_ATTN_KV_LEN_THRESHOLD=16384 (TTFT is almost halved):

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Max request concurrency:                 16        
Successful requests:                     16        
Benchmark duration (s):                  21.30     
Total input tokens:                      32768     
Total input text tokens:                 32768     
Total generated tokens:                  16384     
Total generated tokens (retokenized):    16374     
Request throughput (req/s):              0.75      
Input token throughput (tok/s):          1538.39   
Output token throughput (tok/s):         769.19    
Peak output token throughput (tok/s):    848.00    
Peak concurrent requests:                16        
Total token throughput (tok/s):          2307.58   
Concurrency:                             15.99     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   21281.47  
Median E2E Latency (ms):                 21281.40  
P90 E2E Latency (ms):                    21282.55  
P99 E2E Latency (ms):                    21282.73  
---------------Time to First Token----------------
Mean TTFT (ms):                          1422.21   
Median TTFT (ms):                        1473.14   
P99 TTFT (ms):                           1787.25   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.41     
Median TPOT (ms):                        19.36     
P99 TPOT (ms):                           20.05     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           19.41     
Median ITL (ms):                         19.04     
P95 ITL (ms):                            19.37     
P99 ITL (ms):                            19.54     
Max ITL (ms):                            1099.13   
==================================================

GLM-5 successfully launching on B200 with this change

TP8 GSM8k 8shot:
Accuracy: 0.953
Invalid: 0.000
Latency: 109.110 s
Output throughput: 1258.606 token/s

DP8 GSM8k 8shot:
Accuracy: 0.955
Invalid: 0.000
Latency: 24.158 s
Output throughput: 5707.262 token/s

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Mar 6, 2026
@Fridge003 Fridge003 marked this pull request as draft March 6, 2026 22:29
@Fridge003 Fridge003 marked this pull request as ready for review March 6, 2026 23:47
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@Fridge003
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label Mar 6, 2026
@Fridge003
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-8-gpu-h200

@Fridge003
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-4-gpu-b200

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 7, 2026

✅ Triggered stage-c-test-8-gpu-h200 to run independently (skipping dependencies).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 7, 2026

🔗 View workflow run

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 7, 2026

✅ Triggered stage-c-test-4-gpu-b200 to run independently (skipping dependencies).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 7, 2026

🔗 View workflow run

@Fridge003
Copy link
Copy Markdown
Collaborator Author

/rerun-stage stage-c-test-4-gpu-b200

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 9, 2026

✅ Triggered stage-c-test-4-gpu-b200 to run independently (skipping dependencies).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Mar 9, 2026

🔗 View workflow run

@Fridge003 Fridge003 force-pushed the baizhou/v32_dense_control branch from dd6d6c7 to 1160474 Compare March 9, 2026 20:41
@Fridge003 Fridge003 merged commit be63f98 into main Mar 9, 2026
55 of 105 checks passed
@Fridge003 Fridge003 deleted the baizhou/v32_dense_control branch March 9, 2026 21:36
liubiyongge pushed a commit to liubiyongge/sglang that referenced this pull request Mar 13, 2026
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
JustinTong0323 pushed a commit to JustinTong0323/sglang that referenced this pull request Apr 7, 2026
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deepseek documentation Improvements or additions to documentation run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant