Skip to content

use sgl_per_token_group_quant_fp8 kernel#3493

Merged
zhyncs merged 8 commits intomainfrom
use_sgl_per_token_group_quant_fp8
Feb 12, 2025
Merged

use sgl_per_token_group_quant_fp8 kernel#3493
zhyncs merged 8 commits intomainfrom
use_sgl_per_token_group_quant_fp8

Conversation

@BBuf
Copy link
Copy Markdown
Collaborator

@BBuf BBuf commented Feb 11, 2025

end2end perfomance:

```shell
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3 --tp 8 --trust-remote-code --port 30000

# Run 2 times commands:

python3 -m sglang.bench_serving --backend sglang --num-prompts 1000 --request-rate 8

first time

main:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    8.0       
Max reqeuest concurrency:                not set   
Successful requests:                     1000      
Benchmark duration (s):                  234.70    
Total input tokens:                      301701    
Total generated tokens:                  188375    
Total generated tokens (retokenized):    187534    
Request throughput (req/s):              4.26      
Input token throughput (tok/s):          1285.46   
Output token throughput (tok/s):         802.61    
Total token throughput (tok/s):          2088.07   
Concurrency:                             262.34    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   61571.47  
Median E2E Latency (ms):                 52159.16  
---------------Time to First Token----------------
Mean TTFT (ms):                          2047.59   
Median TTFT (ms):                        630.04    
P99 TTFT (ms):                           19879.67  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          580.62    
Median TPOT (ms):                        461.36    
P99 TPOT (ms):                           2277.66   
---------------Inter-token Latency----------------
Mean ITL (ms):                           319.34    
Median ITL (ms):                         157.48    
P99 ITL (ms):                            3867.63   
==================================================

pr:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    8.0       
Max reqeuest concurrency:                not set   
Successful requests:                     1000      
Benchmark duration (s):                  234.13    
Total input tokens:                      301701    
Total generated tokens:                  188375    
Total generated tokens (retokenized):    187556    
Request throughput (req/s):              4.27      
Input token throughput (tok/s):          1288.58   
Output token throughput (tok/s):         804.56    
Total token throughput (tok/s):          2093.14   
Concurrency:                             261.91    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   61322.63  
Median E2E Latency (ms):                 51756.45  
---------------Time to First Token----------------
Mean TTFT (ms):                          2185.78   
Median TTFT (ms):                        605.41    
P99 TTFT (ms):                           21702.75  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          593.39    
Median TPOT (ms):                        484.39    
P99 TPOT (ms):                           2621.74   
---------------Inter-token Latency----------------
Mean ITL (ms):                           318.73    
Median ITL (ms):                         154.98    
P99 ITL (ms):                            2892.38   
==================================================

second time

main:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    8.0       
Max reqeuest concurrency:                not set   
Successful requests:                     1000      
Benchmark duration (s):                  231.16    
Total input tokens:                      301701    
Total generated tokens:                  188375    
Total generated tokens (retokenized):    187505    
Request throughput (req/s):              4.33      
Input token throughput (tok/s):          1305.18   
Output token throughput (tok/s):         814.92    
Total token throughput (tok/s):          2120.10   
Concurrency:                             245.80    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   56819.06  
Median E2E Latency (ms):                 47473.95  
---------------Time to First Token----------------
Mean TTFT (ms):                          834.04    
Median TTFT (ms):                        522.79    
P99 TTFT (ms):                           4564.58   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          465.60    
Median TPOT (ms):                        388.09    
P99 TPOT (ms):                           1439.86   
---------------Inter-token Latency----------------
Mean ITL (ms):                           299.96    
Median ITL (ms):                         157.45    
P99 ITL (ms):                            2379.64   
==================================================

pr:

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    8.0       
Max reqeuest concurrency:                not set   
Successful requests:                     1000      
Benchmark duration (s):                  229.98    
Total input tokens:                      301701    
Total generated tokens:                  188375    
Total generated tokens (retokenized):    187525    
Request throughput (req/s):              4.35      
Input token throughput (tok/s):          1311.87   
Output token throughput (tok/s):         819.10    
Total token throughput (tok/s):          2130.97   
Concurrency:                             242.99    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   55881.33  
Median E2E Latency (ms):                 46573.34  
---------------Time to First Token----------------
Mean TTFT (ms):                          772.20    
Median TTFT (ms):                        513.58    
P99 TTFT (ms):                           3324.78   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          447.66    
Median TPOT (ms):                        388.91    
P99 TPOT (ms):                           1173.11   
---------------Inter-token Latency----------------
Mean ITL (ms):                           295.57    
Median ITL (ms):                         155.52    
P99 ITL (ms):                            2161.58   
==================================================

Both latency and throughput achieved an end-to-end speedup of approximately 1% in DeepSeek R1.

@zhyncs

@BBuf BBuf changed the title Use sgl per token group quant fp8 in Deepseek V3/R1 Use sgl_per_token_group_quant_fp8 kernel in Deepseek V3/R1 (Speed up 1% end2end latency) Feb 11, 2025
@zhyncs zhyncs changed the title Use sgl_per_token_group_quant_fp8 kernel in Deepseek V3/R1 (Speed up 1% end2end latency) use sgl_per_token_group_quant_fp8 kernel Feb 11, 2025
Comment thread python/sglang/srt/layers/moe/fused_moe_triton/fused_moe.py Outdated
Comment thread python/sglang/srt/layers/quantization/fp8_kernel.py Outdated
Comment thread python/sglang/srt/layers/quantization/fp8_kernel.py
@zhyncs zhyncs merged commit 45e3a7b into main Feb 12, 2025
@zhyncs zhyncs deleted the use_sgl_per_token_group_quant_fp8 branch February 12, 2025 10:40
@HandH1998
Copy link
Copy Markdown
Collaborator

@BBuf This kernel only supports Nvidia GPU for now, right?

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Feb 13, 2025

@BBuf This kernel only supports Nvidia GPU for now, right?

Yeah, but I think support on AMD GPUs is also quite simple; it only requires a slight rewrite of this kernel. However, I am not familiar with AMD's development.

@HandH1998
Copy link
Copy Markdown
Collaborator

Thanks. I think I need to manage some specific cases for AMD GPU when applying this kernel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants