Skip to content

[Kernels] Improve H200 Fused MoE Config#28992

Merged
robertgshaw2-redhat merged 2 commits intomainfrom
robertgshaw2-redhat-patch-1
Nov 19, 2025
Merged

[Kernels] Improve H200 Fused MoE Config#28992
robertgshaw2-redhat merged 2 commits intomainfrom
robertgshaw2-redhat-patch-1

Conversation

@robertgshaw2-redhat
Copy link
Copy Markdown
Collaborator

@robertgshaw2-redhat robertgshaw2-redhat commented Nov 19, 2025

Purpose

  • just file
MODEL := "deepseek-ai/DeepSeek-V3.1"

launch_vllm:
    VLLM_ALLREDUCE_USE_SYMM_MEM=0 VLLM_MOE_USE_DEEP_GEMM=0 VLLM_USE_DEEP_GEMM=1 VLLM_TORCH_PROFILER_DIR=$(pwd)/profiles-vllm-dsr1 chg run --gpus 8 -- vllm serve {{MODEL}} -tp 8

bench_decode BATCH_SIZE NUM_PROMPTS:
    vllm bench serve \
        --port {{PORT}} \
        --model {{MODEL}} \
        --dataset-name random \
        --random-input-len 2 \
        --random-output-len 100 \
        --max-concurrency {{BATCH_SIZE}} \
        --num-prompts {{NUM_PROMPTS}} \
        --seed $(date +%M%H%M%S) \
        --percentile-metrics ttft,tpot,itl \
        --ignore-eos

sweep_decode:
    just bench_decode 4 40 && \
    just bench_decode 8 80 && \
    just bench_decode 16 160 && \
    just bench_decode 32 320 && \
    just bench_decode 64 640 && \
    just bench_decode 128 1280
  • before (batch 16)
============ Serving Benchmark Result ============
Successful requests:                     160       
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  23.38     
Total input tokens:                      160       
Total generated tokens:                  16000     
Request throughput (req/s):              6.84      
Output token throughput (tok/s):         684.26    
Peak output token throughput (tok/s):    704.00    
Peak concurrent requests:                32.00     
Total Token throughput (tok/s):          691.10    
---------------Time to First Token----------------
Mean TTFT (ms):                          44.03     
Median TTFT (ms):                        43.80     
P99 TTFT (ms):                           64.44     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          23.15     
Median TPOT (ms):                        23.13     
P99 TPOT (ms):                           23.41     
---------------Inter-token Latency----------------
Mean ITL (ms):                           23.15     
Median ITL (ms):                         23.16     
P99 ITL (ms):                            25.07     
==================
  • after (batch 16)
============ Serving Benchmark Result ============
Successful requests:                     160       
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  19.72     
Total input tokens:                      160       
Total generated tokens:                  16000     
Request throughput (req/s):              8.12      
Output token throughput (tok/s):         811.55    
Peak output token throughput (tok/s):    832.00    
Peak concurrent requests:                32.00     
Total Token throughput (tok/s):          819.67    
---------------Time to First Token----------------
Mean TTFT (ms):                          40.21     
Median TTFT (ms):                        40.70     
P99 TTFT (ms):                           49.83     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.49     
Median TPOT (ms):                        19.50     
P99 TPOT (ms):                           19.59     
---------------Inter-token Latency----------------
Mean ITL (ms):                           19.49     
Median ITL (ms):                         19.47     
P99 ITL (ms):                            20.93     
==================================================

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Update config

Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
@robertgshaw2-redhat robertgshaw2-redhat marked this pull request as ready for review November 19, 2025 05:17
@robertgshaw2-redhat robertgshaw2-redhat enabled auto-merge (squash) November 19, 2025 14:59
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 19, 2025
@robertgshaw2-redhat robertgshaw2-redhat merged commit fe69f33 into main Nov 19, 2025
50 checks passed
@robertgshaw2-redhat robertgshaw2-redhat deleted the robertgshaw2-redhat-patch-1 branch November 19, 2025 19:23
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants