[AMD][ROCm] MoRI EP: a high-performance all2all backend by alexsun07 · Pull Request #27273 · vllm-project/vllm

alexsun07 · 2025-10-21T17:03:13Z

Purpose

This PR is to integrate MoRI-EP, a high performance all2all comm kernel, with vLLM as an all2all backend. See MoRI project here. And MoRI supports cuda graph.

This PR follows the design of vLLM's Fused MoE Modular Kernel. The Fused MoE Modular Kernel is composed of following components:
[Router] → [Quantize-Dispatch] → [Permute-Experts-Unpermute] → [Combine]

For MoRI+AITER path, which is the high performance practice from AMD, it would be:
[Router] → [Quantize-Dispatch] → [Experts] → [Combine]

Two new classes are introduced:

MoriPrepareAndFinalize: do the [Quantize-Dispatch] and [Combine]
AiterExperts: do the [Experts] and don't do permute or unpermute

Summary of performance comparison between MoRI-EP and naive backend (bs=128 per DP rank):

all2all	EP size	Mean TPOT	Output tps per node	perf
naive	8	128.42	7119.64	1.00x
mori	8	94.14	9439.57	1.33x
naive (eager)	16	305.36	2740.34	1.00x
mori	16	110.87	7343.28	2.68x

How to install MoRI

See https://github.com/ROCm/mori

Test Plan

Test platform: MI300X

Accuracy

Serve on DeepSeek-V3/R1 (Block scale quant)

VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MOE=1 \
VLLM_USE_V1=1 \
vllm serve deepseek-ai/DeepSeek-V3 \
    -tp 1 \
    -dp 8 \
    --port 30000 \
    --all2all-backend mori \
    --enable-expert-parallel

Serve on DeepSeek-R1-PTPC (per token per channel quant)
see here for more info about PTPC.

VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MOE=1 \
VLLM_USE_V1=1 \
vllm serve EmbeddedLLM/deepseek-r1-FP8-Dynamic \
    -tp 1 \
    -dp 8 \
    --port 30000 \
    --all2all-backend mori \
    --enable-expert-parallel

Evaluate by gsm8k

lm_eval --model local-completions \
    --tasks gsm8k \
    --model_args model=<model_path>,base_url=http://localhost:30000/v1/completions,num_concurrent=256,max_retries=3,tokenized_requests=False

Performance

Test EP8 and EP16 performance, compare with naive all2all backend

EP8 with mori backend

VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MOE=1 \
VLLM_USE_V1=1 \
VLLM_MOE_DP_CHUNK_SIZE=512 \
vllm serve EmbeddedLLM/deepseek-r1-FP8-Dynamic \
    -tp 1 \
    -dp 8 \
    --port 30000 \
    --all2all-backend mori \
    --max-num-seqs 128 \
    --enable-expert-parallel \
    --cudagraph-capture-sizes 1 2 4 8 16 32 64 128

EP8 with naive backend:
replace --all2all-backend mori with --all2all-backend naive.

EP16 with mori backend

# node0
VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MOE=1 \
VLLM_USE_V1=1 \
VLLM_MOE_DP_CHUNK_SIZE=512 \
vllm serve /nfs/DeepSeek-R1-PTPC \
    -dp 16 \
    --data-parallel-size-local 8 \
    --data-parallel-address <node-0-ip> --data-parallel-rpc-port <node-0-port> \
    --enable-expert-parallel \
    --all2all-backend mori \
    --port 30000 \
    --max-num-seqs 128 \
    --cuda-graph-sizes 1 2 4 8 16 32 64 128 \
    --trust-remote-code 

# node1
VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MOE=1 \
VLLM_USE_V1=1 \
VLLM_MOE_DP_CHUNK_SIZE=512 \
vllm serve /nfs/DeepSeek-R1-PTPC \
    -dp 16 \
    --headless \
    --data-parallel-size-local 8 \
    --data-parallel-start-rank 8 \
    --data-parallel-address <node-0-ip> --data-parallel-rpc-port <node-0-port> \
    --enable-expert-parallel \
    --all2all-backend mori \
    --port 30000 \
    --max-num-seqs 128 \
    --cuda-graph-sizes 1 2 4 8 16 32 64 128 \
    --trust-remote-code

EP16 with naive backend:
replace --all2all-backend mori with --all2all-backend naive, and use --enforce-eager.

Benchmark:
use --random-input-len 1 --random-prefix-len 1023 because we want to simulate the PD disagg and test decode performance without prefill.

vllm bench serve \
    --max-concurrency <1024 * node_num> \
    --num-prompts <4096 * node_num> \
    --model <model_path>
    --port 30000 \
    --ignore-eos \
    --trust-remote-code \
    --dataset-name random \
    --seed 2025 \
    --random-input-len 1 \
    --random-prefix-len 1023 \
    --random-output-len 500

Test Result

Accuracy

MoRI-EP with DeepSeek-V3

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9469|±  |0.0062|
|     |       |strict-match    |     5|exact_match|↑  |0.9447|±  |0.0063|

MoRI-EP with DeepSeek-R1-PTPC

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9538|±  |0.0058|
|     |       |strict-match    |     5|exact_match|↑  |0.9530|±  |0.0058|

Decode Performance

Summary

all2all	EP size	Mean TPOT	Output tps per node	perf
naive	8	128.42	7119.64	1.00x
mori	8	94.14	9439.57	1.33x
naive (eager)	16	305.36	2740.34	1.00x
mori	16	110.87	7343.28	2.68x

EP8 mori all2all backend

============ Serving Benchmark Result ============
Successful requests:                     4096      
Failed requests:                         0         
Maximum request concurrency:             1024      
Benchmark duration (s):                  216.96    
Total input tokens:                      4190208   
Total generated tokens:                  2048000   
Request throughput (req/s):              18.88     
Output token throughput (tok/s):         9439.57   
Peak output token throughput (tok/s):    13171.00  
Peak concurrent requests:                1152.00   
Total Token throughput (tok/s):          28752.92  
---------------Time to First Token----------------
Mean TTFT (ms):                          3079.99   
Median TTFT (ms):                        1172.27   
P99 TTFT (ms):                           14658.47  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          94.14     
Median TPOT (ms):                        95.69     
P99 TPOT (ms):                           98.65     
---------------Inter-token Latency----------------
Mean ITL (ms):                           105.46    
Median ITL (ms):                         84.14     
P99 ITL (ms):                            503.41    
==================================================

EP8 naive all2all backend

============ Serving Benchmark Result ============
Successful requests:                     4096      
Failed requests:                         0         
Maximum request concurrency:             1024      
Benchmark duration (s):                  287.65    
Total input tokens:                      4190208   
Total generated tokens:                  2048000   
Request throughput (req/s):              14.24     
Output token throughput (tok/s):         7119.64   
Peak output token throughput (tok/s):    10230.00  
Peak concurrent requests:                1152.00   
Total Token throughput (tok/s):          21686.42  
---------------Time to First Token----------------
Mean TTFT (ms):                          3118.80   
Median TTFT (ms):                        1093.97   
P99 TTFT (ms):                           15430.33  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          128.42    
Median TPOT (ms):                        129.82    
P99 TPOT (ms):                           137.51    
---------------Inter-token Latency----------------
Mean ITL (ms):                           133.46    
Median ITL (ms):                         112.55    
P99 ITL (ms):                            513.15    
==================================================

EP16 mori all2all backend

============ Serving Benchmark Result ============
Successful requests:                     8192
Failed requests:                         0
Maximum request concurrency:             2048
Benchmark duration (s):                  278.89
Total input tokens:                      8380416
Total generated tokens:                  4096000
Request throughput (req/s):              29.37
Output token throughput (tok/s):         14686.55
Peak output token throughput (tok/s):    20942.00
Peak concurrent requests:                2271.00
Total Token throughput (tok/s):          44735.22
---------------Time to First Token----------------
Mean TTFT (ms):                          10838.91
Median TTFT (ms):                        7431.13
P99 TTFT (ms):                           34603.02
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          110.87
Median TPOT (ms):                        111.76
P99 TPOT (ms):                           127.69
---------------Inter-token Latency----------------
Mean ITL (ms):                           209.21
Median ITL (ms):                         94.86
P99 ITL (ms):                            864.02
==================================================

EP16 naive all2all backend

============ Serving Benchmark Result ============
Successful requests:                     8192
Failed requests:                         0
Maximum request concurrency:             2048
Benchmark duration (s):                  747.35
Total input tokens:                      8380416
Total generated tokens:                  4096000
Request throughput (req/s):              10.96
Output token throughput (tok/s):         5480.68
Peak output token throughput (tok/s):    9665.00
Peak concurrent requests:                2187.00
Total Token throughput (tok/s):          16694.17
---------------Time to First Token----------------
Mean TTFT (ms):                          10112.99
Median TTFT (ms):                        7514.72
P99 TTFT (ms):                           35132.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          305.36
Median TPOT (ms):                        305.49
P99 TPOT (ms):                           317.03
---------------Inter-token Latency----------------
Mean ITL (ms):                           328.70
Median ITL (ms):                         297.74
P99 ITL (ms):                            857.16
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

github-actions · 2025-10-21T17:03:27Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

gemini-code-assist

Code Review

This pull request integrates MoRI, a high-performance all-to-all communication kernel, as a new backend for vLLM, primarily targeting AMD GPUs. The changes span across several files to add the necessary configurations, manager class, and logic to use this new backend. While the integration is mostly well-structured, I've identified a couple of areas for improvement related to code duplication and consistency, which I've detailed in the comments.

mergify · 2025-10-27T14:39:02Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alexsun07.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

mergify · 2025-11-10T08:01:38Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alexsun07.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

alexsun07 · 2025-11-13T06:51:49Z

CC @sunway513

sunway513 · 2025-11-13T07:36:26Z

Cc @houseroad @robertgshaw2-redhat @HAIAI for review..

mgoin · 2025-11-13T15:41:31Z

Why was this closed?

alexsun07 · 2025-11-13T15:44:45Z

Why was this closed?

Hi, @mgoin I accidently force pushed when I was rebasing the upstream. And it was forced closed by github. Do you know how to reopen it?

alexsun07 · 2025-11-13T15:55:40Z

This PR was force closed by github and cannot be reopened. I'll use a new PR instead: #28664

Sorry for the trouble. Please help review the new PR. Thanks!

alexsun07 requested review from gshtras, mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners October 21, 2025 17:03

alexsun07 marked this pull request as draft October 21, 2025 17:03

mergify Bot added the rocm Related to AMD ROCm label Oct 21, 2025

chatgpt-codex-connector Bot reviewed Oct 21, 2025

View reviewed changes

Comment thread vllm/distributed/device_communicators/all2all.py Outdated

Comment thread vllm/distributed/device_communicators/cuda_communicator.py Outdated

gemini-code-assist Bot reviewed Oct 21, 2025

View reviewed changes

Comment thread vllm/distributed/device_communicators/all2all.py Outdated

Comment thread vllm/model_executor/layers/fused_moe/config.py Outdated

HAIAI self-requested a review October 21, 2025 18:03

mergify Bot added the needs-rebase label Oct 27, 2025

alexsun07 force-pushed the mori_ep branch from bb462a9 to c5ae6bd Compare November 4, 2025 17:50

mergify Bot removed the needs-rebase label Nov 4, 2025

alexsun07 force-pushed the mori_ep branch from c5ae6bd to 2f434d4 Compare November 4, 2025 18:07

alexsun07 changed the title ~~[WIP][AMD] MoRI EP integration~~ [AMD][ROCm] MoRI EP: a high-performance all2all backend Nov 4, 2025

alexsun07 marked this pull request as ready for review November 5, 2025 02:45

alexsun07 requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, simon-mo and youkaichao as code owners November 5, 2025 02:45

chatgpt-codex-connector Bot reviewed Nov 5, 2025

View reviewed changes

Comment thread vllm/model_executor/layers/fused_moe/fused_aiter_moe.py Outdated

mergify Bot added needs-rebase nvidia labels Nov 10, 2025

github-project-automation Bot added this to NVIDIA Nov 11, 2025

alexsun07 force-pushed the mori_ep branch from 851c8dc to 4716e66 Compare November 13, 2025 15:22

alexsun07 requested a review from tjtanaa as a code owner November 13, 2025 15:22

mergify Bot added deepseek Related to DeepSeek models and removed needs-rebase labels Nov 13, 2025

alexsun07 closed this Nov 13, 2025

alexsun07 force-pushed the mori_ep branch from 4716e66 to 86d15bf Compare November 13, 2025 15:35

github-project-automation Bot moved this to Done in NVIDIA Nov 13, 2025

alexsun07 mentioned this pull request Nov 13, 2025

[AMD][ROCm] MoRI EP: a high-performance all2all backend #28664

Merged

5 tasks

Uh oh!

Conversation

alexsun07 commented Oct 21, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

How to install MoRI

Test Plan

Accuracy

Performance

Test Result

Accuracy

Decode Performance

Uh oh!

github-actions Bot commented Oct 21, 2025

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Oct 27, 2025

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

mergify Bot commented Nov 10, 2025

Uh oh!

alexsun07 commented Nov 13, 2025

Uh oh!

sunway513 commented Nov 13, 2025

Uh oh!

mgoin commented Nov 13, 2025

Uh oh!

alexsun07 commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexsun07 commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alexsun07 commented Oct 21, 2025 •

edited by github-actions Bot

Loading

alexsun07 commented Nov 13, 2025 •

edited

Loading