Skip to content

MNNVL MoE All-to-All Support#1134

Merged
yzh119 merged 13 commits intoflashinfer-ai:mainfrom
yzh119:comm-all2all
Jun 24, 2025
Merged

MNNVL MoE All-to-All Support#1134
yzh119 merged 13 commits intoflashinfer-ai:mainfrom
yzh119:comm-all2all

Conversation

@cyx-6
Copy link
Copy Markdown
Collaborator

@cyx-6 cyx-6 commented Jun 10, 2025

📌 Description

Introduce the MnnvlMemory and MnnvlMoe from TensorRT-LLM, for large scale expert parallism. The MnnvlMoe features a MnnvlMemory workspace for all-to-all(v) communication operation, aligned to mpi alltoallv interface and functionality.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Comment thread csrc/trtllm_alltoall.cu Outdated
Comment thread csrc/trtllm_alltoall.cu Outdated
Comment thread csrc/trtllm_alltoall.cu Outdated
Comment thread csrc/trtllm_alltoall.cu Outdated
@cyx-6
Copy link
Copy Markdown
Collaborator Author

cyx-6 commented Jun 11, 2025

above issues are all fixed

Comment thread csrc/pytorch_extension_utils.h Outdated
Comment thread csrc/pytorch_extension_utils.h Outdated
Comment thread flashinfer/comm.py Outdated
@cyx-6
Copy link
Copy Markdown
Collaborator Author

cyx-6 commented Jun 11, 2025

removed and decoupled

@yongwww yongwww marked this pull request as ready for review June 11, 2025 15:47
@yongwww
Copy link
Copy Markdown
Member

yongwww commented Jun 11, 2025

The multi-gpu tests are skipped in CI due to the ci resource limit. They pass on my multi-B200 node.


Update (Jun 15, 2025): It turns out the MNNVL fabric wasn’t actually being used for data transfers in the multi-gpu tests, so I’ll remove those tests. The MNNVL setup along with the updated multi-GPU and multi-node tests will be added shortly

@yongwww yongwww changed the title All-to-all communication operator support alltoallv communication operator support Jun 11, 2025
@yongwww yongwww changed the title alltoallv communication operator support MNNVL AllToAllV communication operator support Jun 16, 2025
@yongwww yongwww mentioned this pull request Jun 16, 2025
5 tasks
yzh119 pushed a commit that referenced this pull request Jun 17, 2025
<!-- .github/pull_request_template.md -->

## 📌 Description

Install the python packages for CI docker: mpi4py, pynvml. They will be
used for the comm ops.

## 🔍 Related Issues

#1145,
#1134

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->
yongwww and others added 2 commits June 22, 2025 04:59
Use trtllm_alltoall.cuh instead of trtllm_alltoall.cu

Upd

Use pytorch_extension_utils

upd

upd

upd

compiled

Fix build

Register python ops

fix

Upd

Add unittest

fix

fix

fix

Add multi-gpu test cases

Add cross-gpu test

Remove the invalid cross-gpu test

add mnnvl (wip)

comm module
cyx-6 added 2 commits June 23, 2025 00:48
upd

.

.

.

.

.
Comment thread flashinfer/comm/mnnvl.py
@yzh119
Copy link
Copy Markdown
Collaborator

yzh119 commented Jun 23, 2025

Thanks @yongwww and @cyx-6 for the great work.

As our communication kernel dependencies become complicated, we should update the documentation on how to install MPI/gdrcopy/etc.

@yzh119
Copy link
Copy Markdown
Collaborator

yzh119 commented Jun 24, 2025

cc @yyihuang for a another look on 26cdc5e

@yzh119 yzh119 merged commit 3dd4f03 into flashinfer-ai:main Jun 24, 2025
2 checks passed
@cyx-6 cyx-6 changed the title MNNVL AllToAllV communication operator support MNNVL MoE All-to-All Support Jun 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants