Skip to content

[HiSparse] Optimize the scheduling of decode backup.#21932

Merged
xiezhq-hermann merged 12 commits intosgl-project:mainfrom
antgroup:hisparse_fix
Apr 7, 2026
Merged

[HiSparse] Optimize the scheduling of decode backup.#21932
xiezhq-hermann merged 12 commits intosgl-project:mainfrom
antgroup:hisparse_fix

Conversation

@huangtingwei9988
Copy link
Copy Markdown
Collaborator

@huangtingwei9988 huangtingwei9988 commented Apr 2, 2026

Motivation

image In overlap scheduling, the backup of decode tokens within the `prepare_for_decode` method currently requires waiting for the forward stream to complete, resulting in significant CPU bubbles. This PR shifts the timing of the backup operation to occur at the conclusion of the forward pass, and verifies whether the backup is complete just prior to the commencement of the next forward iteration.
sequenceDiagram
    participant S as Scheduler
    participant C as HiSparseCoordinator
    participant P as Decode Producer Stream
    participant B as Decode Backup Stream

    S->>C: map_last_loc_to_buffer(...)
    C->>C: prepare previous-token backup metadata
    C->>B: enqueue backup work
    Note over B: Backup is queued immediately<br/>but waits on forward_done_event
    C->>C: grow device buffer / remap reserved slot

    P->>C: wait_for_pending_backup()
    Note over P,C: Before each decode pass,<br/>wait for the previous backup to finish
    C-->>P: clear pending_backup_done_event

    P->>P: run decode forward
    P->>C: note_decode_forward_done()
    C->>C: record decode_forward_done_event

    B->>B: wait(decode_forward_done_event)
    B->>B: host_locs = alloc(...)
    B->>B: req_to_host_pool[...] = host_locs
    B->>B: backup_from_device_all_layer(...)
    B->>C: record backup_done_event

    C->>C: publish pending_backup_done_event
    Note over C: The next decode pass<br/>waits on this event

Loading

With this optimization, end-to-end TPOT performance improves by 5%.

Benchmark

h20-96g

export FLASHINFER_DISABLE_VERSION_CHECK=1
python3 -m sglang.launch_server \
      --model-path /data/nas/yongke.zyk/model_hub/DeepSeek-V3.2-Exp \
      --trust-remote-code \
      --port 8188 \
      --host 0.0.0.0 \
      --chunked-prefill-size 4096 \
      --watchdog-timeout 100000 \
      --max-running-requests 160 \
      --tp-size 8  --load-balance-method round_robin --prefill-round-robin-balance \
      --page-size 64 \
      --mem-fraction-static 0.91 \
      --watchdog-timeout 10000 \
      --kv-cache-dtype bfloat16 --nsa-decode-backend flashmla_sparse  \
      --model-loader-extra-config='{"enable_multithread_load": "true","num_threads": 64}' \
      --nnodes 1 --node-rank 0  --disable-radix-cache --enable-hisparse --hicache-ratio 2 --hisparse-config '{"top_k": 2048, "device_buffer_size": 4096}'

bench serving

BATCH_SIZE=5
DATA_SIZE=30
RANDOM_INPUT=8192
RANDOM_OUTPUT=1500
WARM_UP_REQUESTS=2

for REQUEST_RATES in 5.6
do
python3 -m sglang.bench_serving \
      --host 10.13.3.162 --port 8188 \
      --backend sglang-oai-chat \
      --model /data/nas/shenghai.htw/DeepSeek-V3.2-Exp \
      --dataset-path /data/nas/moyun.zty/data/ShareGPT_V3_unfiltered_cleaned_split.json \
      --dataset-name random \
      --num-prompt $DATA_SIZE \
      --random-input $RANDOM_INPUT \
      --random-output $RANDOM_OUTPUT \
      --random-range-ratio 1 \
      --request-rate $REQUEST_RATES \
      --max-concurrency $BATCH_SIZE \
      --warmup-requests $WARM_UP_REQUESTS
done

before:

============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    5.6       
Max request concurrency:                 5         
Successful requests:                     30        
Benchmark duration (s):                  348.05    
Total input tokens:                      245760    
Total input text tokens:                 245760    
Total generated tokens:                  45000     
Total generated tokens (retokenized):    39983     
Request throughput (req/s):              0.09      
Input token throughput (tok/s):          706.10    
Output token throughput (tok/s):         129.29    
Peak output token throughput (tok/s):    165.00    
Peak concurrent requests:                10        
Total token throughput (tok/s):          835.39    
Concurrency:                             5.00      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   57958.54  
Median E2E Latency (ms):                 57913.35  
P90 E2E Latency (ms):                    58639.01  
P99 E2E Latency (ms):                    58937.89  
---------------Time to First Token----------------
Mean TTFT (ms):                          7160.04   
Median TTFT (ms):                        7709.10   
P99 TTFT (ms):                           10149.52  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          33.89     
Median TPOT (ms):                        33.51     
P99 TPOT (ms):                           36.86     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           33.94     
Median ITL (ms):                         30.71     
P95 ITL (ms):                            46.76     
P99 ITL (ms):                            54.12     
Max ITL (ms):                            7489.29   
==================================================

after


============ Serving Benchmark Result ============
Backend:                                 sglang-oai-chat
Traffic request rate:                    5.6       
Max request concurrency:                 5         
Successful requests:                     30        
Benchmark duration (s):                  331.69    
Total input tokens:                      245760    
Total input text tokens:                 245760    
Total generated tokens:                  45000     
Total generated tokens (retokenized):    39230     
Request throughput (req/s):              0.09      
Input token throughput (tok/s):          740.94    
Output token throughput (tok/s):         135.67    
Peak output token throughput (tok/s):    175.00    
Peak concurrent requests:                10        
Total token throughput (tok/s):          876.61    
Concurrency:                             5.00      
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   55234.13  
Median E2E Latency (ms):                 55181.58  
P90 E2E Latency (ms):                    55790.93  
P99 E2E Latency (ms):                    56089.99  
---------------Time to First Token----------------
Mean TTFT (ms):                          7168.33   
Median TTFT (ms):                        7703.16   
P99 TTFT (ms):                           10233.19  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          32.07     
Median TPOT (ms):                        31.71     
P99 TPOT (ms):                           34.97     
---------------Inter-Token Latency----------------
Mean ITL (ms):                           32.12     
Median ITL (ms):                         29.93     
P95 ITL (ms):                            38.50     
P99 ITL (ms):                            47.34     
Max ITL (ms):                            7521.45   
==================================================

Accuracy Tests

gsm8k

before

#python3 bench_sglang.py --num-questions 200 --port 8188
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:54<00:00,  3.67it/s]
Accuracy: 0.975
Invalid: 0.000
Latency: 54.487 s
Output throughput: 335.898 token/s

after

#python3 bench_sglang.py --num-questions 200 --port 8188
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:53<00:00,  3.76it/s]
Accuracy: 0.975
Invalid: 0.000
Latency: 53.207 s
Output throughput: 346.646 token/s

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements an asynchronous backup mechanism for decode tokens within the HiSparseCoordinator, aiming to overlap host memory transfers with model execution using a dedicated stream and CUDA events. The review feedback identifies a critical issue regarding Tensor Parallelism where backups might be skipped on non-scheduler ranks, and suggests several performance optimizations, such as using collections.deque for the pending backup queue and removing redundant tensor operations like .clone(), .contiguous(), and inefficient list comprehensions.

Comment thread python/sglang/srt/managers/schedule_batch.py
Comment thread python/sglang/srt/managers/hisparse_coordinator.py Outdated
Comment thread python/sglang/srt/managers/hisparse_coordinator.py Outdated
Comment thread python/sglang/srt/managers/hisparse_coordinator.py Outdated
Comment thread python/sglang/srt/managers/schedule_batch.py Outdated
Comment thread python/sglang/srt/managers/hisparse_coordinator.py Outdated
Comment thread python/sglang/srt/managers/hisparse_coordinator.py Outdated
Comment thread python/sglang/srt/model_executor/model_runner.py
Comment thread python/sglang/srt/model_executor/model_runner.py Outdated
huangtingwei9988 and others added 2 commits April 4, 2026 17:48
Co-authored-by: hzh0425 <hzh0425@apache.org>
Co-authored-by: hzh0425 <hzh0425@apache.org>
Comment thread python/sglang/srt/managers/hisparse_coordinator.py Outdated
huangtingwei9988 and others added 6 commits April 6, 2026 12:28
Co-authored-by: hzh0425 <hzh0425@apache.org>
Co-authored-by: hzh0425 <hzh0425@apache.org>
Co-authored-by: hzh0425 <hzh0425@apache.org>
Co-authored-by: hzh0425 <hzh0425@apache.org>
@github-actions github-actions Bot added documentation Improvements or additions to documentation quant LLM Quantization amd dependencies Pull requests that update a dependency file lora Multi-modal multi-modal language model deepseek speculative-decoding hicache Hierarchical Caching for SGLang sgl-kernel blackwell SM100/SM120 npu diffusion SGLang Diffusion model-gateway mthreads jit-kernel labels Apr 7, 2026
@xiezhq-hermann xiezhq-hermann removed documentation Improvements or additions to documentation quant LLM Quantization labels Apr 7, 2026
Copy link
Copy Markdown
Collaborator

@hzh0425 hzh0425 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@huangtingwei9988
Copy link
Copy Markdown
Collaborator Author

/rerun-failed-ci

@hnyls2002 hnyls2002 mentioned this pull request Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hicache Hierarchical Caching for SGLang ready-to-merge The PR is ready to merge after the CI is green. run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants