Skip to content

fix(mimo): adapt training layer for MCore submodule bump#2979

Merged
yaoyu-33 merged 1 commit into
NVIDIA-NeMo:mimo/phase5-checkpointing-rebuildfrom
yashaswikarnati:mimo/phase5-mcore-bump-fixes
Mar 25, 2026
Merged

fix(mimo): adapt training layer for MCore submodule bump#2979
yaoyu-33 merged 1 commit into
NVIDIA-NeMo:mimo/phase5-checkpointing-rebuildfrom
yashaswikarnati:mimo/phase5-mcore-bump-fixes

Conversation

@yashaswikarnati

Copy link
Copy Markdown
Contributor

Summary

  • Add module_output_ndim to MultiModulePipelineCommunicator for correct 2D/3D tensor routing (vision encoders produce 2D [S, H], LLM produces 3D [S, B, H])
  • Use MIMO_LANGUAGE_MODULE_KEY instead of removed role.language_module_name attribute in mimo_step.py and train_mimo.py
  • Remove language_module_key assertion from pretrain_mimo.py (removed from MimoModelConfig)
  • Clean up stale language_module_key / language_module_name references in test mocks

Depends on: #2978 (phase 4 model layer fixes)

Test plan

  • Existing MIMO unit tests pass (159/162, 3 pre-existing failures in test_mimo_step)
  • E2e training test passes on 8 GPUs (torchrun --nproc_per_node=8 tests/e2e/mimo/test_mimo_training_e2e.py)

🤖 Generated with Claude Code

@copy-pr-bot

copy-pr-bot Bot commented Mar 25, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yashaswikarnati yashaswikarnati force-pushed the mimo/phase5-mcore-bump-fixes branch 3 times, most recently from 339f237 to 8959b16 Compare March 25, 2026 07:29
@aroshanghias-nvd aroshanghias-nvd force-pushed the mimo/phase5-checkpointing-rebuild branch from cd3f7fc to e4d2fdf Compare March 25, 2026 17:01
@aroshanghias-nvd aroshanghias-nvd force-pushed the mimo/phase5-mcore-bump-fixes branch from 8959b16 to 339d640 Compare March 25, 2026 17:09
@yaoyu-33 yaoyu-33 added area:model Model implementations and HF bridge logic area:training Training loop, callbacks, and runtime integration bug Something isn't working labels Mar 25, 2026
@yaoyu-33 yaoyu-33 merged commit 28aa989 into NVIDIA-NeMo:mimo/phase5-checkpointing-rebuild Mar 25, 2026
2 checks passed
@aroshanghias-nvd

Copy link
Copy Markdown
Contributor

cross_entropy_loss_fusion=True — Is this required by the new MCore, or is it an independent improvement? It's only added in the checkpoint resume test's _make_language_config(), not in the other e2e tests (test_mimo_training_e2e.py, test_mimo_training_llava.py). Should those tests also get this flag for consistency?

@pruprakash

Copy link
Copy Markdown

QA RCCA Analysis

1. Fix Reference

2. Root Cause

MCore submodule bump required MIMO training layer adaptations:

  • Add module_output_ndim to MultiModulePipelineCommunicator for correct 2D/3D tensor routing
  • Replace removed role.language_module_name with MIMO_LANGUAGE_MODULE_KEY
  • Clean up stale references in tests

3. Trigger Configuration

  • MIMO training after MCore bump
  • Vision encoders producing 2D tensors, LLM producing 3D tensors

4. Nature of the Bug

Classification: CODE BUG - MCore API compatibility issues in MIMO training

5. Existing Test Coverage

In Fix PR: YES - 5 test files:

  • tests/e2e/mimo/test_mimo_checkpoint_resume_e2e.py
  • tests/e2e/mimo/test_mimo_training_e2e.py
  • tests/e2e/mimo/test_mimo_training_llava.py
  • tests/unit_tests/training/mimo/test_mimo_checkpointing.py
  • tests/unit_tests/training/mimo/test_pretrain_mimo.py

In NMFW Tests: Not specifically covering MIMO MCore compatibility

6. Coverage Assessment

Test Type Exists Covers Bug
Fix PR unit tests YES YES
Fix PR e2e tests YES YES
NMFW regression tests NO N/A

7. New Regression Test

NOT NEEDED - Fix PR includes comprehensive unit and e2e tests for MIMO training.

8. Conclusion

Verdict: ADEQUATE COVERAGE - Fix PR includes 5 test files covering MIMO training layer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:model Model implementations and HF bridge logic area:training Training loop, callbacks, and runtime integration bug Something isn't working qa_rcca_done

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants