fix(mimo): adapt training layer for MCore submodule bump by yashaswikarnati · Pull Request #2979 · NVIDIA-NeMo/Megatron-Bridge

yashaswikarnati · 2026-03-25T06:34:17Z

Summary

Add module_output_ndim to MultiModulePipelineCommunicator for correct 2D/3D tensor routing (vision encoders produce 2D [S, H], LLM produces 3D [S, B, H])
Use MIMO_LANGUAGE_MODULE_KEY instead of removed role.language_module_name attribute in mimo_step.py and train_mimo.py
Remove language_module_key assertion from pretrain_mimo.py (removed from MimoModelConfig)
Clean up stale language_module_key / language_module_name references in test mocks

Depends on: #2978 (phase 4 model layer fixes)

Test plan

Existing MIMO unit tests pass (159/162, 3 pre-existing failures in test_mimo_step)
E2e training test passes on 8 GPUs (torchrun --nproc_per_node=8 tests/e2e/mimo/test_mimo_training_e2e.py)

🤖 Generated with Claude Code

copy-pr-bot · 2026-03-25T06:34:21Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

aroshanghias-nvd · 2026-03-25T17:52:43Z

cross_entropy_loss_fusion=True — Is this required by the new MCore, or is it an independent improvement? It's only added in the checkpoint resume test's _make_language_config(), not in the other e2e tests (test_mimo_training_e2e.py, test_mimo_training_llava.py). Should those tests also get this flag for consistency?

pruprakash · 2026-04-28T19:25:23Z

QA RCCA Analysis

1. Fix Reference

PR: fix(mimo): adapt training layer for MCore submodule bump #2979 - fix(mimo): adapt training layer for MCore submodule bump
Issue: PR is the fix itself (MCore compatibility)

2. Root Cause

MCore submodule bump required MIMO training layer adaptations:

Add module_output_ndim to MultiModulePipelineCommunicator for correct 2D/3D tensor routing
Replace removed role.language_module_name with MIMO_LANGUAGE_MODULE_KEY
Clean up stale references in tests

3. Trigger Configuration

MIMO training after MCore bump
Vision encoders producing 2D tensors, LLM producing 3D tensors

4. Nature of the Bug

Classification: CODE BUG - MCore API compatibility issues in MIMO training

5. Existing Test Coverage

In Fix PR: YES - 5 test files:

tests/e2e/mimo/test_mimo_checkpoint_resume_e2e.py
tests/e2e/mimo/test_mimo_training_e2e.py
tests/e2e/mimo/test_mimo_training_llava.py
tests/unit_tests/training/mimo/test_mimo_checkpointing.py
tests/unit_tests/training/mimo/test_pretrain_mimo.py

In NMFW Tests: Not specifically covering MIMO MCore compatibility

6. Coverage Assessment

Test Type	Exists	Covers Bug
Fix PR unit tests	YES	YES
Fix PR e2e tests	YES	YES
NMFW regression tests	NO	N/A

7. New Regression Test

NOT NEEDED - Fix PR includes comprehensive unit and e2e tests for MIMO training.

8. Conclusion

Verdict: ADEQUATE COVERAGE - Fix PR includes 5 test files covering MIMO training layer.

yashaswikarnati force-pushed the mimo/phase5-mcore-bump-fixes branch 3 times, most recently from 339f237 to 8959b16 Compare March 25, 2026 07:29

aroshanghias-nvd force-pushed the mimo/phase5-checkpointing-rebuild branch from cd3f7fc to e4d2fdf Compare March 25, 2026 17:01

fix(mimo): bump MCore submodule and migrate llm key to language

339d640

aroshanghias-nvd force-pushed the mimo/phase5-mcore-bump-fixes branch from 8959b16 to 339d640 Compare March 25, 2026 17:09

yaoyu-33 added area:model Model implementations and HF bridge logic area:training Training loop, callbacks, and runtime integration bug Something isn't working labels Mar 25, 2026

yaoyu-33 approved these changes Mar 25, 2026

View reviewed changes

yaoyu-33 merged commit 28aa989 into NVIDIA-NeMo:mimo/phase5-checkpointing-rebuild Mar 25, 2026
2 checks passed

pruprakash added the qa_rcca_done label Apr 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mimo): adapt training layer for MCore submodule bump#2979

fix(mimo): adapt training layer for MCore submodule bump#2979
yaoyu-33 merged 1 commit into
NVIDIA-NeMo:mimo/phase5-checkpointing-rebuildfrom
yashaswikarnati:mimo/phase5-mcore-bump-fixes

yashaswikarnati commented Mar 25, 2026

Uh oh!

copy-pr-bot Bot commented Mar 25, 2026

Uh oh!

Uh oh!

aroshanghias-nvd commented Mar 25, 2026

Uh oh!

pruprakash commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yashaswikarnati commented Mar 25, 2026

Summary

Test plan

Uh oh!

copy-pr-bot Bot commented Mar 25, 2026

Uh oh!

Uh oh!

aroshanghias-nvd commented Mar 25, 2026

Uh oh!

pruprakash commented Apr 28, 2026

QA RCCA Analysis

1. Fix Reference

2. Root Cause

3. Trigger Configuration

4. Nature of the Bug

5. Existing Test Coverage

6. Coverage Assessment

7. New Regression Test

8. Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants