[Dev] Add DeepEP v2 flex dispatcher backend#4793
Conversation
c80462a to
1cd0541
Compare
|
/ok to test 804a926 |
| ) | ||
| if ctx.async_finish: | ||
| event.current_stream_wait() | ||
| return None, grad_x, None, None, None, None, None |
There was a problem hiding this comment.
DeepepV2Combine.forward has six inputs after ctx (buffer, x, handle, num_sms, async_finish, allocate_on_comm_stream), but this backward returns seven gradients. PyTorch autograd will raise an incorrect-gradient-count error when this combine participates in training backward. Can we drop one trailing None here?
return None, grad_x, None, None, None, None| self.router_dtype = config.moe_router_dtype | ||
| self.capacity_factor = config.moe_expert_capacity_factor | ||
| self.permute_fusion = config.moe_permute_fusion | ||
| self.num_sms = config.moe_deepep_num_sms |
There was a problem hiding this comment.
Question: should the v2 path leave num_sms as 0, or use buffer.get_theoretical_num_sms(...), instead of reusing moe_deepep_num_sms? DeepEP v2's ElasticBuffer can analytically choose SM/QP counts, so carrying over the v1 default may override the intended v2 behavior/perf tuning.
There was a problem hiding this comment.
However, this also means we lose the args for adjusting the number of DeepEP v2 SMs. I prefer to keep this, but set the default value to None, and then let v1 and v2 use different default values when it is None.
|
/ok to test 98c7373 |
Signed-off-by: tongliu <tongliu@nvidia.com>
Signed-off-by: tongliu <tongliu@nvidia.com>
Signed-off-by: tongliu <tongliu@nvidia.com>
Signed-off-by: tongliu <tongliu@nvidia.com>
Signed-off-by: tongliu <tongliu@nvidia.com>
Signed-off-by: tongliu <tongliu@nvidia.com>
Signed-off-by: tongliu <tongliu@nvidia.com>
98c7373 to
a71d0d1
Compare
|
/ok to test 30e602f |
@Autumn1998, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
|
/ok to test a71d0d1 |
What does this PR do ?
The DeepEP V2 support as a backend of flex dispatcher.
PR on main: #5153
Issue tracking
For PRs from open-source community contributors:
Linked issue:
Contribution process
Pre-checks
Code review
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.