Skip to content

Optimize the update flashinfer indices#1262

Merged
Ying1123 merged 5 commits intosgl-project:mainfrom
xiaobochen123:opt/cpu
Sep 1, 2024
Merged

Optimize the update flashinfer indices#1262
Ying1123 merged 5 commits intosgl-project:mainfrom
xiaobochen123:opt/cpu

Conversation

@xiaobochen123
Copy link
Copy Markdown
Contributor

@xiaobochen123 xiaobochen123 commented Aug 30, 2024

When running a large batch, sglang also have the CPU bottleneck. One of the bottleneck occurs when updating the flashinfer kv indices. The naive pytorch implement slow when batch is very large.

  • Hardware: 1xH800
  • Model: Llama-3-8B
// launch server
python3 -m sglang.launch_server    \
    --trust-remote-code                  \
    --disable-cuda-graph    \
    --model  xxxx   \
    --context-length 4096    \
    --max-running-requests 4096        \
    --tensor-parallel-size 1        \
    --chunked-prefill-size -1    \
    --disable-radix-cache 

// test 
python3 bench_serving.py \
        --backend sglang    \
        --tokenizer xxxxx    \
        --dataset-name random     \
        --num-prompts 5000    \
        --random-output-len 128 \
        --random-input-len 20  

When running batch is very large, this PR can reduce 30% cpu time between steps (decoding stage). Improve 10% e2e-performance.

image

Comment thread python/sglang/srt/model_executor/forward_batch_info.py Outdated
@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Aug 30, 2024

Hi @xiaobochen123 Nice work! May you look at the Unit Test failure?

@leo6022
Copy link
Copy Markdown

leo6022 commented Aug 30, 2024

Hi @xiaobochen123 Nice work! May you look at the Unit Test failure?

@zhyncs Yes, I am looking at it.

Comment thread python/sglang/srt/model_executor/forward_batch_info.py Outdated
Comment thread python/sglang/srt/model_executor/forward_batch_info.py Outdated
Comment thread python/sglang/srt/model_executor/forward_batch_info.py Outdated
@zhyncs zhyncs changed the title Optimize the update flash-infer indices Optimize the update flashinfer indices Aug 30, 2024
@merrymercy
Copy link
Copy Markdown
Contributor

@xiaobochen123 @leo6022 This is pretty good! How did you find this bottleneck? Can we fix the test cases and merge this as soon as possible?

@xiaobochen123
Copy link
Copy Markdown
Contributor Author

@merrymercy
I use nsys to profile server and find it.

The code meet a triton error(cause unknown). I write it on diff way to avoid the error.

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Aug 30, 2024

@xiaobochen123 May you update the latest benchmark result?

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Aug 30, 2024

CI E2E Test

this PR

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Successful requests:                     400       
Benchmark duration (s):                  159.43    
Total input tokens:                      817076    
Total generated tokens:                  408097    
Total generated tokens (retokenized):    408188    
Request throughput (req/s):              2.51      
Input token throughput (tok/s):          5124.83   
Output token throughput (tok/s):         2559.65   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   98060.09  
Median E2E Latency (ms):                 105310.69 
---------------Time to First Token----------------
Mean TTFT (ms):                          45585.95  
Median TTFT (ms):                        33951.94  
P99 TTFT (ms):                           66.38 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          57.78     
Median TPOT (ms):                        57.19     
P99 TPOT (ms):                           139.83    
---------------Inter-token Latency----------------
Mean ITL (ms):                           51.49     
Median ITL (ms):                         47.75     
P99 ITL (ms):                            194.01    
==================================================

main

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Successful requests:                     400       
Benchmark duration (s):                  160.64    
Total input tokens:                      817076    
Total generated tokens:                  408097    
Total generated tokens (retokenized):    408200    
Request throughput (req/s):              2.49      
Input token throughput (tok/s):          5086.49   
Output token throughput (tok/s):         2540.50   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   97010.70  
Median E2E Latency (ms):                 102937.17 
---------------Time to First Token----------------
Mean TTFT (ms):                          43815.59  
Median TTFT (ms):                        32881.86  
P99 TTFT (ms):                           114076.64 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          57.97     
Median TPOT (ms):                        58.53     
P99 TPOT (ms):                           183.71    
---------------Inter-token Latency----------------
Mean ITL (ms):                           52.20     
Median ITL (ms):                         48.81     
P99 ITL (ms):                            184.93    
==================================================

Based on the results of CI, there is almost no improvement. Is the performance of the new fix implementation as expected? @xiaobochen123

@xiaobochen123
Copy link
Copy Markdown
Contributor Author

@zhyncs You're not testing enough concurrency and batch-size. The cpu bottlenecks only show up at very high concurrency, such as running-req at 4000+ in my tests. So you can see that my bench_serving.py set input-len=32 and output-len=128, just to run a lot of concurrency.

My new profile result:

  • Base: QPS=178,In-throughput=2961, Out-throughput=11486
  • This PR: QPS=190,In-throughput=3155, Out-throughput=12237

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Aug 30, 2024

my bench_serving.py set input-len=32 and output-len=128

Where is this information mentioned?

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Aug 30, 2024

CI E2E Test

this PR

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Successful requests:                     400       
Benchmark duration (s):                  159.43    
Total input tokens:                      817076    
Total generated tokens:                  408097    
Total generated tokens (retokenized):    408188    
Request throughput (req/s):              2.51      
Input token throughput (tok/s):          5124.83   
Output token throughput (tok/s):         2559.65   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   98060.09  
Median E2E Latency (ms):                 105310.69 
---------------Time to First Token----------------
Mean TTFT (ms):                          45585.95  
Median TTFT (ms):                        33951.94  
P99 TTFT (ms):                           66.38 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          57.78     
Median TPOT (ms):                        57.19     
P99 TPOT (ms):                           139.83    
---------------Inter-token Latency----------------
Mean ITL (ms):                           51.49     
Median ITL (ms):                         47.75     
P99 ITL (ms):                            194.01    
==================================================

main

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Successful requests:                     400       
Benchmark duration (s):                  160.64    
Total input tokens:                      817076    
Total generated tokens:                  408097    
Total generated tokens (retokenized):    408200    
Request throughput (req/s):              2.49      
Input token throughput (tok/s):          5086.49   
Output token throughput (tok/s):         2540.50   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   97010.70  
Median E2E Latency (ms):                 102937.17 
---------------Time to First Token----------------
Mean TTFT (ms):                          43815.59  
Median TTFT (ms):                        32881.86  
P99 TTFT (ms):                           114076.64 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          57.97     
Median TPOT (ms):                        58.53     
P99 TPOT (ms):                           183.71    
---------------Inter-token Latency----------------
Mean ITL (ms):                           52.20     
Median ITL (ms):                         48.81     
P99 ITL (ms):                            184.93    
==================================================

Based on the results of CI, there is almost no improvement. Is the performance of the new fix implementation as expected? @xiaobochen123

In this case, the Median TTFT even got worse.

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Aug 30, 2024

hold on I'll verify with your benchmark settings

@zhyncs
Copy link
Copy Markdown
Collaborator

zhyncs commented Aug 30, 2024

# H100 SXM

# server
python3 -m sglang.launch_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --disable-radix-cache

# client
python3 -m sglang.bench_serving --backend sglang --dataset-name random  --num-prompts 5000 --random-output-len 128 --random-input-len 32

# main
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  26.01
Total input tokens:                      82922
Total generated tokens:                  321605
Total generated tokens (retokenized):    321028
Request throughput (req/s):              192.25
Input token throughput (tok/s):          3188.31
Output token throughput (tok/s):         12365.56
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   18592.04
Median E2E Latency (ms):                 20878.07
---------------Time to First Token----------------
Mean TTFT (ms):                          10666.49
Median TTFT (ms):                        8186.85
P99 TTFT (ms):                           21438.90
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          129.69
Median TPOT (ms):                        152.35
P99 TPOT (ms):                           182.63
---------------Inter-token Latency----------------
Mean ITL (ms):                           145.53
Median ITL (ms):                         97.44
P99 ITL (ms):                            540.48
==================================================

# pr
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  26.41
Total input tokens:                      82922
Total generated tokens:                  321605
Total generated tokens (retokenized):    321065
Request throughput (req/s):              189.31
Input token throughput (tok/s):          3139.66
Output token throughput (tok/s):         12176.88
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   18240.40
Median E2E Latency (ms):                 20422.47
---------------Time to First Token----------------
Mean TTFT (ms):                          10233.91
Median TTFT (ms):                        6640.07
P99 TTFT (ms):                           23509.95
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          131.43
Median TPOT (ms):                        159.41
P99 TPOT (ms):                           191.21
---------------Inter-token Latency----------------
Mean ITL (ms):                           150.97
Median ITL (ms):                         106.25
P99 ITL (ms):                            570.93
==================================================

This is my benchmark result on the H100 SXM, and compared to the main, there has been no improvement, even some decline. I think this PR still needs further confirmation on the details. cc @merrymercy @Ying1123

@xiaobochen123
Copy link
Copy Markdown
Contributor Author

@zhyncs I profiled the triton kernel and the torch native impl. Set batch=4096, max_context_len=4096, the triton kernel only took 70us, while the torch native implementation took 15ms.

I tested the server a few times and found fluctuations in performance. I will check the reason.

@Ying1123
Copy link
Copy Markdown
Contributor

Ying1123 commented Aug 31, 2024

I also observe non-trivial fluctuations. Overall, the e2e performance improvement could be ~2-3%. The code change is straightforward. Although the performance check is not super confident, I think this is a safe merge. @zhyncs

@Ying1123 Ying1123 merged commit d134c13 into sgl-project:main Sep 1, 2024
timethink pushed a commit to timethink/sglang that referenced this pull request Mar 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants