[model] feat: add MiMo-V2-Flash model support by beccohov · Pull Request #3163 · NVIDIA-NeMo/Megatron-Bridge

beccohov · 2026-04-05T17:19:55Z

What does this PR do ?

Add Megatron Bridge support for MiMo-V2-Flash (Xiaomi), a 309B / ~15B active parameter LLM with hybrid attention, fine-grained MoE, value scaling, asymmetric V head dims, and Multi-Token Prediction.

Changelog

Add MiMoV2FlashBridge with FP8 block-wise dequantization (supports non-uniform block sizes). Return fp32 weights to allow internal type cast.
Add MiMoV2FlashModelProvider with dual-base RoPE, per-layer KV head switching, and asymmetric V head dim
Add custom MiMoV2FlashSelfAttention that rebuilds linear_qkv/linear_proj for V head dim ≠ K head dim (128 vs 192)
Add custom MiMoV2FlashTEDotProductAttention with per-layer sliding window, attention sink bias, and TE k/v channel support
Add MiMoV2FlashQKVMapping for asymmetric QKV merge/split during checkpoint conversion
Add MTP support: dense MLP spec (not MoE), SWA attention, with MTP layer count auto-detected from safetensor keys
Add TP/CP assertions for known limitations (TP ≤ min KV groups, CP unsupported due to TE learnable softmax + CP). I decided not to replicate kv heads because it's not efficient.
Add unit tests: provider bridge config mapping, MTP detection, config round-trip, mapping registry coverage, QKV round-trip, FP8 dequant, weight loading hooks
Validated: per-layer fp32 numerics match HF reference, end-to-end generation produces correct output, parallelism tested (TP, EP, SP, TP+EP combinations). Also tested generation on 8GPUs.

GitHub Actions CI

The CI requires approval from an NVIDIA developer to run for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to [QUESTION] Support for Xiaomi MiMo-V2-Flash Architecture (Hybrid Attention + Fine-Grained MoE) NVIDIA/Megatron-LM#2976

copy-pr-bot · 2026-04-05T17:20:00Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

beccohov · 2026-04-05T17:21:17Z

Drafted this WIP PR for MiMoV2 Flash support according to this issue

beccohov · 2026-04-05T19:57:29Z

@sbhavani Hey,
MiMo-V2-Flash uses asymmetric attention head dimensions: Q/K use head_dim=192 but V uses v_head_dim=128. This is currently not supported by MCore's standard SelfAttention for two reasons:

linear_proj input size — Attention.__init__ sizes linear_proj input as kv_channels * num_attention_heads = 192 * 64 = 12288 (here), but the correct size is v_head_dim * num_attention_heads = 128 * 64 = 8192.
QKV split in forward pass — get_query_key_value_tensors splits the fused QKV output using hidden_size_per_attention_head (= kv_channels = 192) for both K and V, so V would be extracted with the wrong size (here).

Should I expose support similar to MLA but in standard SelfAttention for decoupled V size in megatron? Alternatively, is there a recommended pattern for bridging models with asymmetric V dims without modifying MCore?

yaoyu-33 · 2026-04-06T18:02:17Z

Should I expose support similar to MLA but in standard SelfAttention for decoupled V size in megatron? Alternatively, is there a recommended pattern for bridging models with asymmetric V dims without modifying MCore?

@beccohov : usually we patch the config / implementation / fwd in bridge directly and use custom layer spec to specify the customized version of it.

Here's the GitHub link to the OLMoE provider file:
https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/src/megatron/bridge/models/olmoe/olmoe_provider.py
Layer spec builder — L41-45
Custom OLMoESelfAttention class — L98-188

beccohov · 2026-04-11T17:03:38Z

Hey, @yaoyu-33, I have a few questions before I mark this PR as "ready for review":

Currently my implementation of MiMo-V2-Flash asserts CP because TE lacks a backend for CP + learnable softmax (attention sink bias). SWA layers use softmax_type="learnable" which is refused by all TE backends when CP is enabled. However, seems like it is relatively easy to fix in TE (here we'll go via MLA branch and we have v.shape wrong because of GQA). I tried to patch it in provider, but there are lot's of TE assertions that strictly control ability of hacks. Is asserting acceptable, or should we implement a workaround ?
The HF config has attention_value_scale=0.707, but the released HF modeling code does not apply it in the forward pass (i.e. the released weights rescale V back). We currently skip it too (matching HF behavior). For training, however, I think we should re-enable the scale (i.e. to do value = value * 0.707) to match MiMo-v2 Flash fully. What do you think ? This is basically minor fix and I think it plays role only for large scale pretraining with fp8. For finetunes I think it's better to skip this scaling, because naive implementation would consume more memory on activations.
Should mimo_v2_flash/ be merged into the existing mimo/ directory? The existing mimo/ contains both the original Xiaomi MiMo bridge (Qwen2-based) and the unrelated multimodal MIMO provider. I kept it separate to avoid confusion, but happy to merge if preferred.

sbhavani · 2026-04-11T17:21:53Z

@beccohov I'd recommend creating an issue in TE to track CP support. I think the incomplete CP coverage for non-vanilla softmax should be fixed eventually.

beccohov · 2026-04-11T18:32:11Z

Created issue in TE.
Apart from this issue, what about v_scale and directory ?

beccohov · 2026-04-24T09:53:38Z

The TE issue with CP is fixed in TE2.15.

beccohov · 2026-05-03T08:51:12Z

@sbhavani hey, would it be possible to have review please?
Thanks!

sbhavani · 2026-05-04T15:06:43Z

CC @snowmanwwg

beccohov · 2026-05-04T19:21:49Z

@sbhavani the label is removed automatically anyways for some reason

sbhavani · 2026-05-04T21:43:04Z

@beccohov thanks! we need to fix that automation bug

CC @yaoyu-33

cuichenx

Could you add a readme page and runs scripts following example here? https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/models/minimax_m2

Can you also add a functional test following this example https://github.com/NVIDIA-NeMo/Megatron-Bridge/pull/2602/changes

cuichenx · 2026-05-11T20:05:13Z

@@ -0,0 +1,22 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.


Suggested change

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.

Thanks, changed in all places

beccohov · 2026-05-17T19:49:23Z

@cuichenx I believe I've addressed all your comments — would appreciate a re-review!

kamran-nvidia · 2026-05-26T12:42:23Z

@beccohov We are working on caching this model on CI server. I will re-run the failed CI test, when the fix is in place. No action needed from your side. Thanks.

beccohov · 2026-05-27T19:58:02Z

@kamran-nvidia can you please share the PR / something where I can track the progress, so that I can understand when I'll be able to finish this PR?

kamran-nvidia · 2026-05-27T20:01:56Z

@JRD971000 Can you help @beccohov please? Thanks.

yaoyu-33 · 2026-05-28T17:22:18Z

/ok to test 017062b

beccohov · 2026-05-30T11:57:45Z

Afaiu, the tests failed due to (2) from my comment here. So what should I do? Should I add trust_remote_code ?
cc @yaoyu-33 @kamran-nvidia

yaoyu-33 · 2026-05-31T22:36:49Z

@beccohov nvm, I will force merge this. It's ci caching issue.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 · 2026-05-31T22:39:44Z

/ok to test 341094f

yaoyu-33 · 2026-05-31T22:40:20Z

will follow up add back the functional tests after caching issue resolved.

Eisenhower · 2026-06-04T12:46:30Z

Hi @beccohov @yaoyu-33 — small follow-up while reading this merged work.

MiMoV2FlashTEDotProductAttention.__init__ reads attention_value_scale from config and stores it as self._attention_value_scale (modeling_mimo_v2_flash.py#L256), but the value is never consumed in forward (L270-L271). For the released XiaomiMiMo/MiMo-V2-Flash checkpoint (attention_value_scale = 0.707) this silently scales the attention output by ~1.414× at every layer relative to the reference implementations.

The scale can't be folded into softmax_scale (softmax is non-linear) and linear_proj.weight is mapped 1:1 from HF o_proj.weight, so it has to land on V (or equivalently on the attention output). Both upstream references do this on V before the kernel:

HF: https://huggingface.co/XiaomiMiMo/MiMo-V2-Flash/blob/main/modeling_mimo_v2_flash.py — value_states = value_states * self.v_scale
vLLM: https://github.com/vllm-project/vllm/blob/e68988a24807c9dfb2bf6936eb17425ce7812c5f/vllm/model_executor/models/mimo_v2.py#L327-L329 — v = v * self.v_scale

I opened a two-line follow-up: #4155. Would appreciate a quick look when you have a moment — happy to add a focused unit test if that helps the review.

beccohov · 2026-06-04T13:56:00Z

Hi @Eisenhower,
thanks for your comment. Actually, the reason why I did not scale is that HF weights seem to be already static scaled in advance (V weights). When I was implementing this I noticed that by quality degradation.
So you don't need to scale this to reproduce results if you load from retrained model because it was already trained with suitable magnitudes.
However, for training dynamics you may want to scale (this will affect gradients magnitude for V and also if you retrain from scratch this may change magnitude of attention scores). But naive scaling will increase memory consumption (and this would be notable on this large model) so ideally you need to implement this scaling to avoid memory overhead.

So, one way is no upscale back HF weights while loading them and then use this scaling in training. This is especially valuable when you train from scratch.
Another way is to skip this scaling completely (as in my implementation) and keep in mind that gradients for V would be slightly different. Since we use init from pertained model, I believe this won't significantly change model convergence (since we use smaller lr and model is more stable for post-training). But will save you memory (if you don't use full AC).

Eisenhower · 2026-06-10T12:45:53Z

Thanks for the context. I think the HF implementation you checked was likely the old version. I raised this issue with the Xiaomi MiMo team afterwards, and they have since updated the HF modeling code to apply attention_value_scale on value_states in forward. vLLM also applies the same scale before attention.

I agree the extra allocation from a naive value = value * scale is worth optimizing. My concern is mainly RL consistency: rollout usually follows the vLLM/HF inference path, while training/logprob/KL are computed by the actor path. If rollout applies the scale but training does not, the two forward paths are no longer equivalent. For that reason I would prefer Megatron-Bridge to match the HF/vLLM semantics first, and then optimize the implementation or fold the scale only if we can keep the two paths numerically equivalent.

Eisenhower · 2026-06-10T12:53:16Z

@beccohov Thanks for the context. I think the HF implementation you checked was likely the old version. I raised this issue with the Xiaomi MiMo team afterwards, and they have since updated the HF modeling code to apply attention_value_scale on value_states in forward. vLLM also applies the same scale before attention.

I agree the extra allocation from a naive value = value * scale is worth optimizing. My concern is mainly RL consistency: rollout usually follows the vLLM/HF inference path, while training/logprob/KL are computed by the actor path. If rollout applies the scale but training does not, the two forward paths are no longer equivalent. For that reason I would prefer Megatron-Bridge to match the HF/vLLM semantics first, and then optimize the implementation or fold the scale only if we can keep the two paths numerically equivalent.

beccohov · 2026-06-10T14:03:41Z

@Eisenhower Sure, if HF weights will be updated then the situation changes. You can then add a follow-up PR with this small change!

Signed-off-by: Arkadii Be <beccohov@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: Chen Cui <chcui@nvidia.com> Co-authored-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

github-actions Bot added the community-request label Apr 5, 2026

beccohov mentioned this pull request Apr 5, 2026

[QUESTION] Support for Xiaomi MiMo-V2-Flash Architecture (Hybrid Attention + Fine-Grained MoE) NVIDIA/Megatron-LM#2976

Closed

beccohov mentioned this pull request Apr 11, 2026

[Bug] Context parallel crashes with asymmetric K/V head dims (GQA + enable_mla path) NVIDIA/TransformerEngine#2868

Open

beccohov marked this pull request as ready for review April 24, 2026 09:53

svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label Apr 26, 2026

svcnvidia-nemo-ci removed the waiting-on-maintainers Waiting on maintainers to respond label May 3, 2026

sbhavani added the waiting-on-maintainers Waiting on maintainers to respond label May 4, 2026

svcnvidia-nemo-ci removed the waiting-on-maintainers Waiting on maintainers to respond label May 4, 2026

svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label May 7, 2026

cuichenx mentioned this pull request May 8, 2026

[NeMo FW 26.06 Release] MBridge v0.5.0 Roadmap #3754

Open

cuichenx reviewed May 11, 2026

View reviewed changes

Comment thread src/megatron/bridge/models/mimo_v2_flash/mimo_v2_flash_provider.py Outdated

svcnvidia-nemo-ci added waiting-on-customer Waiting on the original author to respond and removed waiting-on-maintainers Waiting on maintainers to respond labels May 11, 2026

yaoyu-33 added area:model Model implementations and HF bridge logic feature New capabilities, enhancements, or enablement work labels May 12, 2026

svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label May 17, 2026

Merge branch 'main' into beccohov/mimo-v2-flash

017062b

copy-pr-bot Bot temporarily deployed to public May 28, 2026 17:22 Inactive

copy-pr-bot Bot temporarily deployed to test May 28, 2026 17:23 Inactive

copy-pr-bot Bot temporarily deployed to public May 28, 2026 19:12 Inactive

copy-pr-bot Bot temporarily deployed to public May 28, 2026 19:13 Inactive

copy-pr-bot Bot temporarily deployed to public May 28, 2026 19:29 Inactive

[test] chore: move MiMo V2 Flash CI to flaky

341094f

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 dismissed cuichenx’s stale review via 341094f May 31, 2026 22:39

copy-pr-bot Bot temporarily deployed to public May 31, 2026 22:40 Inactive

yaoyu-33 merged commit 38803fd into NVIDIA-NeMo:main May 31, 2026
13 checks passed

copy-pr-bot Bot temporarily deployed to test May 31, 2026 22:40 Inactive

copy-pr-bot Bot temporarily deployed to public May 31, 2026 23:21 Inactive

copy-pr-bot Bot temporarily deployed to public May 31, 2026 23:22 Inactive

copy-pr-bot Bot temporarily deployed to public May 31, 2026 23:42 Inactive

		@@ -0,0 +1,22 @@
		# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

	# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
	# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.

Conversation

beccohov commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented Apr 5, 2026

Uh oh!

beccohov commented Apr 5, 2026

Uh oh!

beccohov commented Apr 5, 2026

Uh oh!

yaoyu-33 commented Apr 6, 2026

Uh oh!

beccohov commented Apr 11, 2026

Uh oh!

sbhavani commented Apr 11, 2026

Uh oh!

beccohov commented Apr 11, 2026

Uh oh!

beccohov commented Apr 24, 2026

Uh oh!

beccohov commented May 3, 2026

Uh oh!

sbhavani commented May 4, 2026

Uh oh!

beccohov commented May 4, 2026

Uh oh!

sbhavani commented May 4, 2026

Uh oh!

cuichenx left a comment

Choose a reason for hiding this comment

Uh oh!

cuichenx May 11, 2026

Choose a reason for hiding this comment

Uh oh!

beccohov May 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

beccohov commented May 17, 2026

Uh oh!

kamran-nvidia commented May 26, 2026

Uh oh!

beccohov commented May 27, 2026

Uh oh!

kamran-nvidia commented May 27, 2026

Uh oh!

yaoyu-33 commented May 28, 2026

Uh oh!

beccohov commented May 30, 2026

Uh oh!

yaoyu-33 commented May 31, 2026

Uh oh!

yaoyu-33 commented May 31, 2026

Uh oh!

yaoyu-33 commented May 31, 2026

Uh oh!

Uh oh!

Eisenhower commented Jun 4, 2026

Uh oh!

beccohov commented Jun 4, 2026

Uh oh!

Eisenhower commented Jun 10, 2026

Uh oh!

Eisenhower commented Jun 10, 2026

Uh oh!

beccohov commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

beccohov commented Apr 5, 2026 •

edited

Loading