[model] feat: add MiMo-V2-Flash model support#3163
Conversation
|
Drafted this WIP PR for MiMoV2 Flash support according to this issue |
|
@sbhavani Hey,
Should I expose support similar to MLA but in standard SelfAttention for decoupled |
@beccohov : usually we patch the config / implementation / fwd in bridge directly and use custom layer spec to specify the customized version of it. Here's the GitHub link to the OLMoE provider file: |
|
Hey, @yaoyu-33, I have a few questions before I mark this PR as "ready for review":
|
|
@beccohov I'd recommend creating an issue in TE to track CP support. I think the incomplete CP coverage for non-vanilla softmax should be fixed eventually. |
|
Created issue in TE. |
|
The TE issue with CP is fixed in TE2.15. |
|
@sbhavani hey, would it be possible to have review please? |
|
CC @snowmanwwg |
|
@sbhavani the label is removed automatically anyways for some reason |
cuichenx
left a comment
There was a problem hiding this comment.
Could you add a readme page and runs scripts following example here? https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/models/minimax_m2
Can you also add a functional test following this example https://github.com/NVIDIA-NeMo/Megatron-Bridge/pull/2602/changes
| @@ -0,0 +1,22 @@ | |||
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | |||
There was a problem hiding this comment.
| # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. | |
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. |
There was a problem hiding this comment.
Thanks, changed in all places
|
@cuichenx I believe I've addressed all your comments — would appreciate a re-review! |
|
@beccohov We are working on caching this model on CI server. I will re-run the failed CI test, when the fix is in place. No action needed from your side. Thanks. |
|
@kamran-nvidia can you please share the PR / something where I can track the progress, so that I can understand when I'll be able to finish this PR? |
|
@JRD971000 Can you help @beccohov please? Thanks. |
|
/ok to test 017062b |
|
Afaiu, the tests failed due to (2) from my comment here. So what should I do? Should I add |
|
@beccohov nvm, I will force merge this. It's ci caching issue. |
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
|
/ok to test 341094f |
|
will follow up add back the functional tests after caching issue resolved. |
|
Hi @beccohov @yaoyu-33 — small follow-up while reading this merged work.
The scale can't be folded into
I opened a two-line follow-up: #4155. Would appreciate a quick look when you have a moment — happy to add a focused unit test if that helps the review. |
|
Hi @Eisenhower, So, one way is no upscale back HF weights while loading them and then use this scaling in training. This is especially valuable when you train from scratch. |
|
Thanks for the context. I think the HF implementation you checked was likely the old version. I raised this issue with the Xiaomi MiMo team afterwards, and they have since updated the HF modeling code to apply I agree the extra allocation from a naive |
|
@beccohov Thanks for the context. I think the HF implementation you checked was likely the old version. I raised this issue with the Xiaomi MiMo team afterwards, and they have since updated the HF modeling code to apply I agree the extra allocation from a naive |
|
@Eisenhower Sure, if HF weights will be updated then the situation changes. You can then add a follow-up PR with this small change! |
Signed-off-by: Arkadii Be <beccohov@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Co-authored-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
What does this PR do ?
Add Megatron Bridge support for MiMo-V2-Flash (Xiaomi), a 309B / ~15B active parameter LLM with hybrid attention, fine-grained MoE, value scaling, asymmetric V head dims, and Multi-Token Prediction.
Changelog
GitHub Actions CI
The CI requires approval from an NVIDIA developer to run for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information