[AMD][ROCm] MoRI EP: a high-performance all2all backend#27273
[AMD][ROCm] MoRI EP: a high-performance all2all backend#27273alexsun07 wants to merge 0 commit intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Code Review
This pull request integrates MoRI, a high-performance all-to-all communication kernel, as a new backend for vLLM, primarily targeting AMD GPUs. The changes span across several files to add the necessary configurations, manager class, and logic to use this new backend. While the integration is mostly well-structured, I've identified a couple of areas for improvement related to code duplication and consistency, which I've detailed in the comments.
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
This pull request has merge conflicts that must be resolved before it can be |
|
CC @sunway513 |
|
Cc @houseroad @robertgshaw2-redhat @HAIAI for review.. |
|
Why was this closed? |
Hi, @mgoin I accidently force pushed when I was rebasing the upstream. And it was forced closed by github. Do you know how to reopen it? |
|
This PR was force closed by github and cannot be reopened. I'll use a new PR instead: #28664 Sorry for the trouble. Please help review the new PR. Thanks! |
Purpose
This PR is to integrate MoRI-EP, a high performance all2all comm kernel, with vLLM as an all2all backend. See MoRI project here. And MoRI supports cuda graph.
This PR follows the design of vLLM's Fused MoE Modular Kernel. The Fused MoE Modular Kernel is composed of following components:
[Router] → [Quantize-Dispatch] → [Permute-Experts-Unpermute] → [Combine]
For MoRI+AITER path, which is the high performance practice from AMD, it would be:
[Router] → [Quantize-Dispatch] → [Experts] → [Combine]
Two new classes are introduced:
Summary of performance comparison between MoRI-EP and naive backend (bs=128 per DP rank):
How to install MoRI
See https://github.com/ROCm/mori
Test Plan
Test platform: MI300X
Accuracy
Serve on DeepSeek-V3/R1 (Block scale quant)
Serve on DeepSeek-R1-PTPC (per token per channel quant)
see here for more info about PTPC.
Evaluate by gsm8k
Performance
Test EP8 and EP16 performance, compare with naive all2all backend
EP8 with mori backend
EP8 with naive backend:
replace
--all2all-backend moriwith--all2all-backend naive.EP16 with mori backend
EP16 with naive backend:
replace
--all2all-backend moriwith--all2all-backend naive, and use--enforce-eager.Benchmark:
use --random-input-len 1 --random-prefix-len 1023 because we want to simulate the PD disagg and test decode performance without prefill.
Test Result
Accuracy
MoRI-EP with DeepSeek-V3
MoRI-EP with DeepSeek-R1-PTPC
Decode Performance
Summary
EP8 mori all2all backend
EP8 naive all2all backend
EP16 mori all2all backend
EP16 naive all2all backend
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.