use sgl_per_token_group_quant_fp8 kernel by BBuf · Pull Request #3493 · sgl-project/sglang

BBuf · 2025-02-11T16:22:31Z

end2end perfomance:

```shell
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --port 30000

# Run 2 times commands:

python3 -m sglang.bench_serving --backend sglang --num-prompts 1000 --request-rate 8

first time

main:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    8.0       
Max reqeuest concurrency:                not set   
Successful requests:                     1000      
Benchmark duration (s):                  234.70    
Total input tokens:                      301701    
Total generated tokens:                  188375    
Total generated tokens (retokenized):    187534    
Request throughput (req/s):              4.26      
Input token throughput (tok/s):          1285.46   
Output token throughput (tok/s):         802.61    
Total token throughput (tok/s):          2088.07   
Concurrency:                             262.34    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   61571.47  
Median E2E Latency (ms):                 52159.16  
---------------Time to First Token----------------
Mean TTFT (ms):                          2047.59   
Median TTFT (ms):                        630.04    
P99 TTFT (ms):                           19879.67  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          580.62    
Median TPOT (ms):                        461.36    
P99 TPOT (ms):                           2277.66   
---------------Inter-token Latency----------------
Mean ITL (ms):                           319.34    
Median ITL (ms):                         157.48    
P99 ITL (ms):                            3867.63   
==================================================

pr:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    8.0       
Max reqeuest concurrency:                not set   
Successful requests:                     1000      
Benchmark duration (s):                  234.13    
Total input tokens:                      301701    
Total generated tokens:                  188375    
Total generated tokens (retokenized):    187556    
Request throughput (req/s):              4.27      
Input token throughput (tok/s):          1288.58   
Output token throughput (tok/s):         804.56    
Total token throughput (tok/s):          2093.14   
Concurrency:                             261.91    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   61322.63  
Median E2E Latency (ms):                 51756.45  
---------------Time to First Token----------------
Mean TTFT (ms):                          2185.78   
Median TTFT (ms):                        605.41    
P99 TTFT (ms):                           21702.75  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          593.39    
Median TPOT (ms):                        484.39    
P99 TPOT (ms):                           2621.74   
---------------Inter-token Latency----------------
Mean ITL (ms):                           318.73    
Median ITL (ms):                         154.98    
P99 ITL (ms):                            2892.38   
==================================================

second time

main:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    8.0       
Max reqeuest concurrency:                not set   
Successful requests:                     1000      
Benchmark duration (s):                  231.16    
Total input tokens:                      301701    
Total generated tokens:                  188375    
Total generated tokens (retokenized):    187505    
Request throughput (req/s):              4.33      
Input token throughput (tok/s):          1305.18   
Output token throughput (tok/s):         814.92    
Total token throughput (tok/s):          2120.10   
Concurrency:                             245.80    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   56819.06  
Median E2E Latency (ms):                 47473.95  
---------------Time to First Token----------------
Mean TTFT (ms):                          834.04    
Median TTFT (ms):                        522.79    
P99 TTFT (ms):                           4564.58   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          465.60    
Median TPOT (ms):                        388.09    
P99 TPOT (ms):                           1439.86   
---------------Inter-token Latency----------------
Mean ITL (ms):                           299.96    
Median ITL (ms):                         157.45    
P99 ITL (ms):                            2379.64   
==================================================

pr:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    8.0       
Max reqeuest concurrency:                not set   
Successful requests:                     1000      
Benchmark duration (s):                  229.98    
Total input tokens:                      301701    
Total generated tokens:                  188375    
Total generated tokens (retokenized):    187525    
Request throughput (req/s):              4.35      
Input token throughput (tok/s):          1311.87   
Output token throughput (tok/s):         819.10    
Total token throughput (tok/s):          2130.97   
Concurrency:                             242.99    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   55881.33  
Median E2E Latency (ms):                 46573.34  
---------------Time to First Token----------------
Mean TTFT (ms):                          772.20    
Median TTFT (ms):                        513.58    
P99 TTFT (ms):                           3324.78   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          447.66    
Median TPOT (ms):                        388.91    
P99 TPOT (ms):                           1173.11   
---------------Inter-token Latency----------------
Mean ITL (ms):                           295.57    
Median ITL (ms):                         155.52    
P99 ITL (ms):                            2161.58   
==================================================

Both latency and throughput achieved an end-to-end speedup of approximately 1% in DeepSeek R1.

@zhyncs

HandH1998 · 2025-02-13T02:35:00Z

@BBuf This kernel only supports Nvidia GPU for now, right?

BBuf · 2025-02-13T02:39:51Z

@BBuf This kernel only supports Nvidia GPU for now, right?

Yeah, but I think support on AMD GPUs is also quite simple; it only requires a slight rewrite of this kernel. However, I am not familiar with AMD's development.

HandH1998 · 2025-02-13T02:51:27Z

Thanks. I think I need to manage some specific cases for AMD GPU when applying this kernel.

BBuf added 2 commits February 12, 2025 00:18

upd

871ef0c

upd

14e5b1d

BBuf requested review from HaiShaw, Ying1123, ispobock, merrymercy and zhyncs as code owners February 11, 2025 16:22

BBuf changed the title ~~Use sgl per token group quant fp8 in Deepseek V3/R1~~ Use sgl_per_token_group_quant_fp8 kernel in Deepseek V3/R1 (Speed up 1% end2end latency) Feb 11, 2025

zhyncs changed the title ~~Use sgl_per_token_group_quant_fp8 kernel in Deepseek V3/R1 (Speed up 1% end2end latency)~~ use sgl_per_token_group_quant_fp8 kernel Feb 11, 2025

Merge branch 'main' into use_sgl_per_token_group_quant_fp8

11f8a22

zhyncs suggested changes Feb 11, 2025

View reviewed changes

Comment thread python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py Outdated

Comment thread python/sglang/srt/layers/quantization/fp8_kernel.py Outdated

Comment thread python/sglang/srt/layers/quantization/fp8_kernel.py

BBuf and others added 3 commits February 12, 2025 09:47

ud

8fe3f18

ud

2226ab3

Merge branch 'main' into use_sgl_per_token_group_quant_fp8

f90642a

HandH1998 mentioned this pull request Feb 12, 2025

Apply sgl w8a8 fp8 kernel #3148

Merged

zhyncs mentioned this pull request Feb 12, 2025

chore: bump 0.0.3.post4 sgl-kernel #3523

Merged

5 tasks

zhyncs added 2 commits February 12, 2025 17:29

Merge branch 'main' into use_sgl_per_token_group_quant_fp8

7da8359

upd

f5ff0e1

zhyncs approved these changes Feb 12, 2025

View reviewed changes

zhyncs merged commit 45e3a7b into main Feb 12, 2025

zhyncs deleted the use_sgl_per_token_group_quant_fp8 branch February 12, 2025 10:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use sgl_per_token_group_quant_fp8 kernel#3493

use sgl_per_token_group_quant_fp8 kernel#3493
zhyncs merged 8 commits intomainfrom
use_sgl_per_token_group_quant_fp8

BBuf commented Feb 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HandH1998 commented Feb 13, 2025

Uh oh!

BBuf commented Feb 13, 2025

Uh oh!

HandH1998 commented Feb 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

BBuf commented Feb 11, 2025

first time

second time

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HandH1998 commented Feb 13, 2025

Uh oh!

BBuf commented Feb 13, 2025

Uh oh!

HandH1998 commented Feb 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants