rename models/mimo infra to MegatronMIMO#3340
Conversation
Signed-off-by: Li Ding <liding@nvidia.com>
|
/claude review |
|
A couple of files outside the PR still reference the old names:
Neither is a code-level breakage (just docstrings), but worth sweeping while you're at it. Otherwise the mechanical rename looks correct — imports, class names, test patches, and error messages are all consistently updated. |
📝 WalkthroughWalkthroughThis pull request migrates the codebase from MIMO-specific providers and configuration classes to OmniModal equivalents throughout examples, core bridge infrastructure, training utilities, and tests. Key renames include Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (3)
src/megatron/bridge/training/pretrain_mimo.py (1)
57-59: Enforce the documentedcfg.modelcontract at runtime.Line 57 documents
OmniModalProvider, butpretrain_mimo()does not validate it before callingsetup_mimo(). Add a fail-fast check so invalid configs raise a clear error instead of later attribute errors.Suggested patch
from megatron.bridge.training.config import ConfigContainer, mimo_runtime_config_update +from megatron.bridge.models.omni_modal.omni_modal_provider import OmniModalProvider from megatron.bridge.training.pretrain import _maybe_destroy_process_group from megatron.bridge.training.setup_mimo import setup_mimo @@ def pretrain_mimo( @@ ) -> None: @@ logger.info("Starting MIMO pretraining") + if not isinstance(cfg.model, OmniModalProvider): + raise TypeError( + f"pretrain_mimo requires cfg.model to be OmniModalProvider, got {type(cfg.model).__name__}." + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/training/pretrain_mimo.py` around lines 57 - 59, pretrain_mimo currently documents that cfg.model must be an OmniModalProvider but doesn't validate this at runtime; add a fail-fast type/attribute check at the start of pretrain_mimo (before calling setup_mimo) to verify cfg is a ConfigContainer and cfg.model implements the OmniModalProvider contract (e.g., check isinstance(cfg.model, OmniModalProvider) or the presence of required methods/attributes used by setup_mimo), and raise a clear ValueError/TypeError with a descriptive message if the check fails so callers get an immediate, actionable error instead of later attribute errors.src/megatron/bridge/models/omni_modal/__init__.py (1)
14-20: Sort__all__to satisfy Ruff RUF022.
__all__is currently unsorted; aligning ordering avoids lint noise and keeps exports deterministic.As per coding guidelines: `**/*.py` must pass Ruff linting/formatting.♻️ Suggested change
__all__ = [ "LlavaOmniModalProvider", + "ModuleParallelismConfig", "OmniModalInfra", + "OmniModalParallelismConfig", "OmniModalProvider", - "OmniModalParallelismConfig", - "ModuleParallelismConfig", ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/omni_modal/__init__.py` around lines 14 - 20, The __all__ list in omni_modal/__init__.py is unsorted and causing Ruff RUF022; reorder the exported names alphabetically (e.g., "LlavaOmniModalProvider", "ModuleParallelismConfig", "OmniModalInfra", "OmniModalParallelismConfig", "OmniModalProvider") so the __all__ sequence is deterministic and lint-compliant; update the __all__ assignment to contain the same symbols sorted lexicographically.src/megatron/bridge/training/mimo_parallel_utils.py (1)
273-277: Add*separator for multiple same-type parameters invalidate_data_loader_contract.Line 273-277 has three
intparameters and should enforce keyword-only separation per repo style rule.♻️ Proposed refactor
def validate_data_loader_contract( infra: OmniModalInfra, + *, global_batch_size: int, micro_batch_size: int, num_microbatches: int, ):As per coding guidelines, "**/*.py: Use
*separator for functions with multiple same-type parameters".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/training/mimo_parallel_utils.py` around lines 273 - 277, The function validate_data_loader_contract has three same-typed int params (global_batch_size, micro_batch_size, num_microbatches) and needs a keyword-only separator; update the signature to add a '*' before those params (e.g., def validate_data_loader_contract(infra: OmniModalInfra, *, global_batch_size: int, micro_batch_size: int, num_microbatches: int):) and adjust any callers to pass those three arguments by keyword if necessary.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/models/omni_modal/run_hetero_llava.sh`:
- Around line 7-10: Replace the distributed-launch command invocation: locate
the shell script that contains the current command starting with "uv run
torchrun" (in examples/models/omni_modal/run_hetero_llava.sh) and update that
invocation to use the repository convention "uv run python -m
torch.distributed.run" while keeping the existing flags/arguments
(--nproc_per_node "$GPUS_PER_NODE", --nnodes "$NUM_NODES", and the target script
examples/models/omni_modal/test_mimo_training_llava.py) unchanged so the
behavior and parameters remain the same.
In `@src/megatron/bridge/data/mimo/loaders.py`:
- Around line 75-78: The raised ValueError message referencing "run
build_model()" is misleading; update the message where OmniModalProvider._grids
is checked (the block raising ValueError for cfg.model._grids is None) to
instruct callers to run the provider initialization path (e.g., setup_mimo() or
the provider infra that triggers build_model()) so it points to the actual setup
entrypoint; modify the error string to mention OmniModalProvider._grids,
build_model(), and setup_mimo()/provider initialization so users know to
initialize the provider via setup_mimo() (or equivalent) before building data
loaders.
In `@tests/unit_tests/data/mimo/test_loaders.py`:
- Line 46: The test assertion using pytest.raises should escape the literal dot
so the regex matches exactly; update the pytest.raises(...) call in
tests/unit_tests/data/mimo/test_loaders.py (the line using
pytest.raises(ValueError, match="cfg.model must be OmniModalProvider")) to use
an escaped/raw regex for the dot (e.g., r"cfg\.model must be OmniModalProvider")
so the period is treated literally and Ruff RUF043 is resolved.
In `@tests/unit_tests/models/omni_modal/test_omni_modal_ddp.py`:
- Line 14: Update the stale docstring that reads "Create a mock MimoModel
(MCore) for testing." in the test_omni_modal_ddp.py test to use OmniModal
terminology instead (e.g., "Create a mock OmniModal (MCore) for testing.") so
the docstring matches the OmniModal test path and naming.
---
Nitpick comments:
In `@src/megatron/bridge/models/omni_modal/__init__.py`:
- Around line 14-20: The __all__ list in omni_modal/__init__.py is unsorted and
causing Ruff RUF022; reorder the exported names alphabetically (e.g.,
"LlavaOmniModalProvider", "ModuleParallelismConfig", "OmniModalInfra",
"OmniModalParallelismConfig", "OmniModalProvider") so the __all__ sequence is
deterministic and lint-compliant; update the __all__ assignment to contain the
same symbols sorted lexicographically.
In `@src/megatron/bridge/training/mimo_parallel_utils.py`:
- Around line 273-277: The function validate_data_loader_contract has three
same-typed int params (global_batch_size, micro_batch_size, num_microbatches)
and needs a keyword-only separator; update the signature to add a '*' before
those params (e.g., def validate_data_loader_contract(infra: OmniModalInfra, *,
global_batch_size: int, micro_batch_size: int, num_microbatches: int):) and
adjust any callers to pass those three arguments by keyword if necessary.
In `@src/megatron/bridge/training/pretrain_mimo.py`:
- Around line 57-59: pretrain_mimo currently documents that cfg.model must be an
OmniModalProvider but doesn't validate this at runtime; add a fail-fast
type/attribute check at the start of pretrain_mimo (before calling setup_mimo)
to verify cfg is a ConfigContainer and cfg.model implements the
OmniModalProvider contract (e.g., check isinstance(cfg.model, OmniModalProvider)
or the presence of required methods/attributes used by setup_mimo), and raise a
clear ValueError/TypeError with a descriptive message if the check fails so
callers get an immediate, actionable error instead of later attribute errors.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 96e377e7-7820-4ef1-8c38-df91efb21837
📒 Files selected for processing (41)
examples/models/omni_modal/__init__.pyexamples/models/omni_modal/convert_hf_clip_to_megatron.pyexamples/models/omni_modal/convert_hf_llama_to_megatron.pyexamples/models/omni_modal/run_conversion_verification.shexamples/models/omni_modal/run_hetero_llava.shexamples/models/omni_modal/run_hetero_llava_parallelism_tests.shexamples/models/omni_modal/run_hetero_llava_parallelism_tests_unfrozen_llm.shexamples/models/omni_modal/run_mimo_checkpoint_resume.shexamples/models/omni_modal/run_mimo_parallelism_tests.shexamples/models/omni_modal/test_mimo_checkpoint_resume_e2e.pyexamples/models/omni_modal/test_mimo_training_e2e.pyexamples/models/omni_modal/test_mimo_training_llava.pyexamples/models/omni_modal/verify_clip_conversion.pyexamples/models/omni_modal/verify_llama_conversion.pysrc/megatron/bridge/data/loaders.pysrc/megatron/bridge/data/mimo/dp_utils.pysrc/megatron/bridge/data/mimo/loaders.pysrc/megatron/bridge/models/mimo/__init__.pysrc/megatron/bridge/models/omni_modal/__init__.pysrc/megatron/bridge/models/omni_modal/llava_provider.pysrc/megatron/bridge/models/omni_modal/omni_modal_builder.pysrc/megatron/bridge/models/omni_modal/omni_modal_config.pysrc/megatron/bridge/models/omni_modal/omni_modal_ddp.pysrc/megatron/bridge/models/omni_modal/omni_modal_provider.pysrc/megatron/bridge/training/checkpointing.pysrc/megatron/bridge/training/config.pysrc/megatron/bridge/training/mimo_parallel_utils.pysrc/megatron/bridge/training/pretrain_mimo.pysrc/megatron/bridge/training/setup_mimo.pysrc/megatron/bridge/training/train_mimo.pysrc/megatron/bridge/training/utils/theoretical_memory_utils.pytests/functional_tests/test_groups/training/test_pretrain_mimo.pytests/unit_tests/data/mimo/test_dp_utils.pytests/unit_tests/data/mimo/test_loaders.pytests/unit_tests/models/omni_modal/__init__.pytests/unit_tests/models/omni_modal/test_llava_provider.pytests/unit_tests/models/omni_modal/test_omni_modal_builder.pytests/unit_tests/models/omni_modal/test_omni_modal_ddp.pytests/unit_tests/models/omni_modal/test_omni_modal_provider.pytests/unit_tests/training/mimo/test_mimo_checkpointing.pytests/unit_tests/training/mimo/test_mimo_config.py
| uv run torchrun \ | ||
| --nproc_per_node "$GPUS_PER_NODE" \ | ||
| --nnodes "$NUM_NODES" \ | ||
| tests/e2e/mimo/test_mimo_training_llava.py \ | ||
| examples/models/omni_modal/test_mimo_training_llava.py \ |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check whether `uv run torchrun` is still used elsewhere.
# Expected: no remaining occurrences once standardized.
rg -n --type=sh -C1 'uv run torchrun|uv run python -m torch\.distributed\.run'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 35077
Replace uv run torchrun with uv run python -m torch.distributed.run to align with repository conventions.
The codebase standardizes on uv run python -m torch.distributed.run for distributed execution. Update lines 7–10 accordingly.
Suggested diff
-uv run torchrun \
+uv run python -m torch.distributed.run \
--nproc_per_node "$GPUS_PER_NODE" \
--nnodes "$NUM_NODES" \
examples/models/omni_modal/test_mimo_training_llava.py \📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| uv run torchrun \ | |
| --nproc_per_node "$GPUS_PER_NODE" \ | |
| --nnodes "$NUM_NODES" \ | |
| tests/e2e/mimo/test_mimo_training_llava.py \ | |
| examples/models/omni_modal/test_mimo_training_llava.py \ | |
| uv run python -m torch.distributed.run \ | |
| --nproc_per_node "$GPUS_PER_NODE" \ | |
| --nnodes "$NUM_NODES" \ | |
| examples/models/omni_modal/test_mimo_training_llava.py \ |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/models/omni_modal/run_hetero_llava.sh` around lines 7 - 10, Replace
the distributed-launch command invocation: locate the shell script that contains
the current command starting with "uv run torchrun" (in
examples/models/omni_modal/run_hetero_llava.sh) and update that invocation to
use the repository convention "uv run python -m torch.distributed.run" while
keeping the existing flags/arguments (--nproc_per_node "$GPUS_PER_NODE",
--nnodes "$NUM_NODES", and the target script
examples/models/omni_modal/test_mimo_training_llava.py) unchanged so the
behavior and parameters remain the same.
| if cfg.model._grids is None: | ||
| raise ValueError( | ||
| "MimoModelProvider._grids is None. Ensure build_model() is called before building data loaders." | ||
| "OmniModalProvider._grids is None. Ensure build_model() is called before building data loaders." | ||
| ) |
There was a problem hiding this comment.
Clarify the _grids error guidance to reference the actual setup path.
Line 77 asks callers to run build_model(), but this path is driven by setup_mimo() / provider infra initialization. The current message can misdirect debugging.
Suggested patch
if cfg.model._grids is None:
raise ValueError(
- "OmniModalProvider._grids is None. Ensure build_model() is called before building data loaders."
+ "OmniModalProvider._grids is None. Ensure setup_mimo() (or OmniModalProvider.build_infra()) "
+ "has run before building data loaders."
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/data/mimo/loaders.py` around lines 75 - 78, The raised
ValueError message referencing "run build_model()" is misleading; update the
message where OmniModalProvider._grids is checked (the block raising ValueError
for cfg.model._grids is None) to instruct callers to run the provider
initialization path (e.g., setup_mimo() or the provider infra that triggers
build_model()) so it points to the actual setup entrypoint; modify the error
string to mention OmniModalProvider._grids, build_model(), and
setup_mimo()/provider initialization so users know to initialize the provider
via setup_mimo() (or equivalent) before building data loaders.
| @@ -43,7 +43,7 @@ def test_build_mimo_data_loaders_raises_when_model_not_mimo(monkeypatch): | |||
| cfg = SimpleNamespace(model=object(), train=SimpleNamespace(micro_batch_size=2)) | |||
| provider = FakeProvider() | |||
|
|
|||
| with pytest.raises(ValueError, match="cfg.model must be MimoModelProvider"): | |||
| with pytest.raises(ValueError, match="cfg.model must be OmniModalProvider"): | |||
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the regex warning is resolved in this file.
uv run ruff check tests/unit_tests/data/mimo/test_loaders.py --select RUF043Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 113
🏁 Script executed:
# Read the file to see the current state at line 46
sed -n '40,52p' tests/unit_tests/data/mimo/test_loaders.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 618
🏁 Script executed:
# Check Python regex behavior: does unescaped dot match only dot or any character?
python3 << 'EOF'
import re
# Test the pattern with unescaped dot
pattern_unescaped = "cfg.model must be OmniModalProvider"
pattern_escaped = r"cfg\.model must be OmniModalProvider"
test_string = "cfg.model must be OmniModalProvider"
# Test unescaped (dot matches any char)
match_unescaped = re.search(pattern_unescaped, test_string)
print(f"Unescaped pattern matches: {match_unescaped is not None}")
# Test escaped (dot matches literal dot)
match_escaped = re.search(pattern_escaped, test_string)
print(f"Escaped pattern matches: {match_escaped is not None}")
# Show what unescaped dot matches
test_string2 = "cfgXmodel must be OmniModalProvider" # Replace . with X
match_unescaped2 = re.search(pattern_unescaped, test_string2)
print(f"Unescaped pattern matches 'cfgXmodel...' (X instead of dot): {match_unescaped2 is not None}")
EOFRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 198
🏁 Script executed:
# Find where the ValueError is raised to see the exact error message
rg 'cfg\.model must be OmniModalProvider' --type pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 322
🏁 Script executed:
# Try running ruff check with python -m ruff if available
python3 -m ruff check tests/unit_tests/data/mimo/test_loaders.py --select RUF043 2>&1 || echo "ruff not available via python -m"Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 877
🏁 Script executed:
# Check if ruff is installed and callable directly
ruff check tests/unit_tests/data/mimo/test_loaders.py --select RUF043 2>&1 || echo "ruff check failed"Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 862
Escape the dot in the match regex at Line 46.
Line 46 uses match="cfg.model must be OmniModalProvider", where . is treated as a regex wildcard matching any character. Use a raw escaped pattern for exact matching and to clear the RUF043 linting error.
Suggested fix
- with pytest.raises(ValueError, match="cfg.model must be OmniModalProvider"):
+ with pytest.raises(ValueError, match=r"cfg\.model must be OmniModalProvider"):Per the coding guidelines, all Python files must pass Ruff checks.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| with pytest.raises(ValueError, match="cfg.model must be OmniModalProvider"): | |
| with pytest.raises(ValueError, match=r"cfg\.model must be OmniModalProvider"): |
🧰 Tools
🪛 Ruff (0.15.10)
[warning] 46-46: Pattern passed to match= contains metacharacters but is neither escaped nor raw
(RUF043)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit_tests/data/mimo/test_loaders.py` at line 46, The test assertion
using pytest.raises should escape the literal dot so the regex matches exactly;
update the pytest.raises(...) call in tests/unit_tests/data/mimo/test_loaders.py
(the line using pytest.raises(ValueError, match="cfg.model must be
OmniModalProvider")) to use an escaped/raw regex for the dot (e.g., r"cfg\.model
must be OmniModalProvider") so the period is treated literally and Ruff RUF043
is resolved.
| def _create_mock_mimo_model(self, has_language_model=True, modality_names=None): | ||
| """Create a mock MimoModel for testing.""" | ||
| def _create_mock_omni_modal_model(self, has_language_model=True, modality_names=None): | ||
| """Create a mock MimoModel (MCore) for testing.""" |
There was a problem hiding this comment.
Update stale docstring naming to OmniModal terminology.
Line 14 still mentions MimoModel; this is now testing the OmniModal path and should use consistent naming.
✏️ Suggested text fix
- """Create a mock MimoModel (MCore) for testing."""
+ """Create a mock OmniModal model (MCore) for testing."""📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| """Create a mock MimoModel (MCore) for testing.""" | |
| """Create a mock OmniModal model (MCore) for testing.""" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit_tests/models/omni_modal/test_omni_modal_ddp.py` at line 14, Update
the stale docstring that reads "Create a mock MimoModel (MCore) for testing." in
the test_omni_modal_ddp.py test to use OmniModal terminology instead (e.g.,
"Create a mock OmniModal (MCore) for testing.") so the docstring matches the
OmniModal test path and naming.
|
/ok to test 1d467d5 |
Signed-off-by: Li Ding <liding@nvidia.com>
Signed-off-by: Li Ding <liding@nvidia.com>
|
/ok to test e8ed80e |
|
Rename to MegatronMIMO |
Signed-off-by: Li Ding <liding@nvidia.com>
|
/ok to test 1d58f11 |
|
/claude review |
| Args: | ||
| cfg: ConfigContainer with training configuration. ``cfg.model`` must be | ||
| a ``MimoModelProvider``. ``cfg.optimizer`` (a ``BridgeOptimizerConfig``) | ||
| an ``MegatronMIMOProvider``. ``cfg.optimizer`` (a ``BridgeOptimizerConfig``) |
There was a problem hiding this comment.
Grammar: "MegatronMIMO" starts with a consonant sound → should be "a", not "an". The old code correctly had `a ``MimoModelProvider```.
| an ``MegatronMIMOProvider``. ``cfg.optimizer`` (a ``BridgeOptimizerConfig``) | |
| a ``MegatronMIMOProvider``. ``cfg.optimizer`` (a ``BridgeOptimizerConfig``) |
Same issue in setup_megatron_mimo.py lines 178 and 188.
| provider = MimoModelProvider( | ||
| provider = MegatronMIMOProvider( | ||
| language_model_spec=language_spec, | ||
| modality_submodules_spec={"images": modality_spec}, |
There was a problem hiding this comment.
Same "an" → "a" grammar fix:
| modality_submodules_spec={"images": modality_spec}, | |
| """Create a MegatronMIMOParallelismConfig.""" |
|
Light review — clean mechanical rename overall. Two minor issues:
No logic issues found. The old |
|
/ok to test 43d7a22 |
|
/ok to test 43d7a22 |
|
/ok to test 43d7a22 |
Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Signed-off-by: Chen Cui <cxcui@alumni.cmu.edu>
|
/ok to test fa4bf87 |
|
/ok to test c8457a6 |
Signed-off-by: Li Ding <liding@nvidia.com> Signed-off-by: Chen Cui <cxcui@alumni.cmu.edu> Co-authored-by: Chen Cui <chcui@nvidia.com> Co-authored-by: claude[bot] <209825114+claude[bot]@users.noreply.github.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
Rename our heterogeneous-parallelism framework to MegatronMIMO /
megatron_mimoto disambiguatefrom MCore's upstream
MimoModeland the community MiMo model bridge (MimoBridge), both of whichstay unchanged.
Convention
MegatronMIMO*(PascalCase, acronym UPPER)megatron_mimo_*(snake_case)Preserved (not renamed):
megatron.core.models.mimo.*(MimoModel,MimoOptimizer,MIMO_LANGUAGE_MODULE_KEY,.mimo_config)src/megatron/bridge/models/mimo/mimo_bridge.py(MimoBridge,"MiMoForCausalLM")Pure structural refactor — no logic changes