[HiSparse] Optimize the scheduling of decode backup. by huangtingwei9988 · Pull Request #21932 · sgl-project/sglang

huangtingwei9988 · 2026-04-02T09:29:17Z

Motivation

In overlap scheduling, the backup of decode tokens within the `prepare_for_decode` method currently requires waiting for the forward stream to complete, resulting in significant CPU bubbles. This PR shifts the timing of the backup operation to occur at the conclusion of the forward pass, and verifies whether the backup is complete just prior to the commencement of the next forward iteration.

sequenceDiagram
    participant S as Scheduler
    participant C as HiSparseCoordinator
    participant P as Decode Producer Stream
    participant B as Decode Backup Stream

    S->>C: map_last_loc_to_buffer(...)
    C->>C: prepare previous-token backup metadata
    C->>B: enqueue backup work
    Note over B: Backup is queued immediately<br/>but waits on forward_done_event
    C->>C: grow device buffer / remap reserved slot

    P->>C: wait_for_pending_backup()
    Note over P,C: Before each decode pass,<br/>wait for the previous backup to finish
    C-->>P: clear pending_backup_done_event

    P->>P: run decode forward
    P->>C: note_decode_forward_done()
    C->>C: record decode_forward_done_event

    B->>B: wait(decode_forward_done_event)
    B->>B: host_locs = alloc(...)
    B->>B: req_to_host_pool[...] = host_locs
    B->>B: backup_from_device_all_layer(...)
    B->>C: record backup_done_event

    C->>C: publish pending_backup_done_event
    Note over C: The next decode pass<br/>waits on this event

With this optimization, end-to-end TPOT performance improves by 5%.

Benchmark

h20-96g

export FLASHINFER_DISABLE_VERSION_CHECK=1
python3 -m sglang.launch_server \
      --model-path /data/nas/yongke.zyk/model_hub/DeepSeek-V3.2-Exp \
      --trust-remote-code \
      --port 8188 \
      --host 0.0.0.0 \
      --chunked-prefill-size 4096 \
      --watchdog-timeout 100000 \
      --max-running-requests 160 \
      --tp-size 8  --load-balance-method round_robin --prefill-round-robin-balance \
      --page-size 64 \
      --mem-fraction-static 0.91 \
      --watchdog-timeout 10000 \
      --kv-cache-dtype bfloat16 --nsa-decode-backend flashmla_sparse  \
      --model-loader-extra-config='{"enable_multithread_load": "true","num_threads": 64}' \
      --nnodes 1 --node-rank 0  --disable-radix-cache --enable-hisparse --hicache-ratio 2 --hisparse-config '{"top_k": 2048, "device_buffer_size": 4096}'

bench serving

BATCH_SIZE=5
DATA_SIZE=30
RANDOM_INPUT=8192
RANDOM_OUTPUT=1500
WARM_UP_REQUESTS=2

for REQUEST_RATES in 5.6
do
python3 -m sglang.bench_serving \
      --host 10.13.3.162 --port 8188 \
      --backend sglang-oai-chat \
      --model /data/nas/shenghai.htw/DeepSeek-V3.2-Exp \
      --dataset-path /data/nas/moyun.zty/data/ShareGPT_V3_unfiltered_cleaned_split.json \
      --dataset-name random \
      --num-prompt $DATA_SIZE \
      --random-input $RANDOM_INPUT \
      --random-output $RANDOM_OUTPUT \
      --random-range-ratio 1 \
      --request-rate $REQUEST_RATES \
      --max-concurrency $BATCH_SIZE \
      --warmup-requests $WARM_UP_REQUESTS
done

before:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    5.6       
Max request concurrency:                 5         
Successful requests:                     30        
Benchmark duration (s):                  348.05    
Total input tokens:                      245760    
Total input text tokens:                 245760    
Total generated tokens:                  45000     
Total generated tokens (retokenized):    39983     
Request throughput (req/s):              0.09      
Input token throughput (tok/s):          706.10    
Output token throughput (tok/s):         129.29    
Peak output token throughput (tok/s):    165.00    
Peak concurrent requests:                10        
Total token throughput (tok/s):          835.39    
Concurrency:                             5.00      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   57958.54  
Median E2E Latency (ms):                 57913.35  
P90 E2E Latency (ms):                    58639.01  
P99 E2E Latency (ms):                    58937.89  
---------------Time to First Token----------------
Mean TTFT (ms):                          7160.04   
Median TTFT (ms):                        7709.10   
P99 TTFT (ms):                           10149.52  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.89     
Median TPOT (ms):                        33.51     
P99 TPOT (ms):                           36.86     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           33.94     
Median ITL (ms):                         30.71     
P95 ITL (ms):                            46.76     
P99 ITL (ms):                            54.12     
Max ITL (ms):                            7489.29   
==================================================

after


============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    5.6       
Max request concurrency:                 5         
Successful requests:                     30        
Benchmark duration (s):                  331.69    
Total input tokens:                      245760    
Total input text tokens:                 245760    
Total generated tokens:                  45000     
Total generated tokens (retokenized):    39230     
Request throughput (req/s):              0.09      
Input token throughput (tok/s):          740.94    
Output token throughput (tok/s):         135.67    
Peak output token throughput (tok/s):    175.00    
Peak concurrent requests:                10        
Total token throughput (tok/s):          876.61    
Concurrency:                             5.00      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   55234.13  
Median E2E Latency (ms):                 55181.58  
P90 E2E Latency (ms):                    55790.93  
P99 E2E Latency (ms):                    56089.99  
---------------Time to First Token----------------
Mean TTFT (ms):                          7168.33   
Median TTFT (ms):                        7703.16   
P99 TTFT (ms):                           10233.19  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          32.07     
Median TPOT (ms):                        31.71     
P99 TPOT (ms):                           34.97     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           32.12     
Median ITL (ms):                         29.93     
P95 ITL (ms):                            38.50     
P99 ITL (ms):                            47.34     
Max ITL (ms):                            7521.45   
==================================================

Accuracy Tests

gsm8k

before

#python3 bench_sglang.py --num-questions 200 --port 8188
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:54<00:00,  3.67it/s]
Accuracy: 0.975
Invalid: 0.000
Latency: 54.487 s
Output throughput: 335.898 token/s

after

#python3 bench_sglang.py --num-questions 200 --port 8188
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:53<00:00,  3.76it/s]
Accuracy: 0.975
Invalid: 0.000
Latency: 53.207 s
Output throughput: 346.646 token/s

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist

Code Review

This pull request implements an asynchronous backup mechanism for decode tokens within the HiSparseCoordinator, aiming to overlap host memory transfers with model execution using a dedicated stream and CUDA events. The review feedback identifies a critical issue regarding Tensor Parallelism where backups might be skipped on non-scheduler ranks, and suggests several performance optimizations, such as using collections.deque for the pending backup queue and removing redundant tensor operations like .clone(), .contiguous(), and inefficient list comprehensions.

Co-authored-by: hzh0425 <hzh0425@apache.org>

hzh0425

Looks good

huangtingwei9988 · 2026-04-07T16:00:30Z

/rerun-failed-ci

huangtingwei9988 requested review from Fridge003, Ying1123, hnyls2002, ispobock, merrymercy and xiezhq-hermann as code owners April 2, 2026 09:29

hzh0425 assigned xiezhq-hermann and hzh0425 Apr 2, 2026

gemini-code-assist Bot reviewed Apr 2, 2026

View reviewed changes

xiezhq-hermann added the run-ci label Apr 3, 2026

xiezhq-hermann reviewed Apr 3, 2026

View reviewed changes

Comment thread python/sglang/srt/managers/hisparse_coordinator.py Outdated

xiezhq-hermann reviewed Apr 3, 2026

View reviewed changes

Comment thread python/sglang/srt/managers/hisparse_coordinator.py Outdated

xiezhq-hermann reviewed Apr 3, 2026

View reviewed changes

Comment thread python/sglang/srt/model_executor/model_runner.py

xiezhq-hermann reviewed Apr 3, 2026

View reviewed changes

Comment thread python/sglang/srt/model_executor/model_runner.py Outdated

huangtingwei9988 and others added 2 commits April 4, 2026 17:48

Optimize the scheduling of Decode Backup.

b4aadfe

Co-authored-by: hzh0425 <hzh0425@apache.org>

opt code

3584ff3

Co-authored-by: hzh0425 <hzh0425@apache.org>

huangtingwei9988 force-pushed the hisparse_fix branch from e4ca5be to 3584ff3 Compare April 4, 2026 10:03

huangtingwei9988 added 2 commits April 4, 2026 20:54

revert

4fa4664

opt code

6db9ac5

xiezhq-hermann reviewed Apr 6, 2026

View reviewed changes

Comment thread python/sglang/srt/managers/hisparse_coordinator.py Outdated

huangtingwei9988 and others added 6 commits April 6, 2026 12:28

opt code

3fd94a8

Co-authored-by: hzh0425 <hzh0425@apache.org>

opt code

371c1d2

Co-authored-by: hzh0425 <hzh0425@apache.org>

opt code

adf80e6

Co-authored-by: hzh0425 <hzh0425@apache.org>

opt code

2dc56ae

Co-authored-by: hzh0425 <hzh0425@apache.org>

fix

43e5756

fix

b67f64f

xiezhq-hermann requested review from CatherineSue, iforgetmyname, key4ng and ping1jing2 as code owners April 7, 2026 09:16

xiezhq-hermann requested review from HydraQYH, Kangyan-Zhou, bingxche, ishandhanani and yctseng0211 as code owners April 7, 2026 09:16

xiezhq-hermann force-pushed the hisparse_fix branch from ac00794 to b67f64f Compare April 7, 2026 09:22

Merge branch 'main' into hisparse_fix

785375b

xiezhq-hermann removed documentation Improvements or additions to documentation quant LLM Quantization labels Apr 7, 2026

minor fix on sync event

afe2040

xiezhq-hermann approved these changes Apr 7, 2026

View reviewed changes

hzh0425 approved these changes Apr 7, 2026

View reviewed changes

hnyls2002 mentioned this pull request Apr 29, 2026

Deepseek V4 #23882

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HiSparse] Optimize the scheduling of decode backup.#21932

[HiSparse] Optimize the scheduling of decode backup.#21932
xiezhq-hermann merged 12 commits intosgl-project:mainfrom
antgroup:hisparse_fix

huangtingwei9988 commented Apr 2, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hzh0425 left a comment

Uh oh!

huangtingwei9988 commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

huangtingwei9988 commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Benchmark

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hzh0425 left a comment

Choose a reason for hiding this comment

Uh oh!

huangtingwei9988 commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

huangtingwei9988 commented Apr 2, 2026 •

edited

Loading