-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Tensor parallelism for Mixture of Experts #2074
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Let's add some functional unit tests to ensure the new code paths are triggered in our tests. This will ensure things are at least functionally working in the future. Would be great to have basic correctness unit tests as well but we can discuss that offline. |
|
@jeffra I have added some tests in |
…ed into moe-tensor-parallelism
Hi, I tested the tensor parallelism for MoE, the loss curves still higher and then NAN after 600 millions of tokens |

Note: This PR is in conjunction with this PR on the Megatron-Deepspeed repo.
This PR adds tensor parallelism for non-experts. This combined with ZeRO-2 allows us to scale to roughly 2x larger base models than ZeRO-2. When tensor parallelism is enabled only for non-experts, there are duplicate tokens at each gate. It is important to drop the duplicates before they reach the experts, otherwise we run into convergence issues. In the current implementation, we drop tokens right before the AlltoAll and gather them right after the AlltoAll. These calls are done in
sharded_moe.pyUpdate: This PR now supports tensor parallelism for experts as well.