[model, recipe, test] refactor: remove specific model provider dataclasses by yaoyu-33 · Pull Request #2599 · NVIDIA-NeMo/Megatron-Bridge

yaoyu-33 · 2026-02-28T18:59:41Z

Summary

Remove size-specific model provider dataclasses that are no longer needed now that bridges dynamically configure model architectures via GPTModelProvider, MLAModelProvider, and MambaModelProvider.

Deleted provider files (7):

llama_provider.py — Llama{2,3,31,32,4}ModelProvider* (26 classes)
nemotron_provider.py — Nemotron{3,4}ModelProvider* (8 classes)
gpt_oss_provider.py — GPTOSSModelProvider* (4 classes)
deepseek_provider.py — DeepSeek{V2,V3}ModelProvider* (4 classes)
kimi_provider.py — KimiK2Provider
glm45_provider.py — GLM45{Air,Plus}ModelProvider* (5 classes)
nemotron_h_provider.py — NemotronH{,Nano}ModelProvider* (6 classes)

Refactored providers with cross-dependencies (3):

llama_nemotron_provider.py — now inherits directly from GPTModelProvider
glm_45v_provider.py — now inherits directly from GPTModelProvider (inlined GLM MoE defaults)
nemotron_vl_provider.py — now inherits directly from MambaModelProvider (inlined NemotronH defaults)

Updated recipes to use base providers (GPTModelProvider, MLAModelProvider, MambaModelProvider) with inlined architecture parameters instead of importing deleted specific providers.

Cleaned up tests:

Deleted 7 unit test files that exclusively tested removed providers
Updated remaining unit and functional tests to use base provider classes

Net: -1,947 lines (~3,527 deleted, ~1,580 added for inlined parameters)

Test plan

Unit tests pass for updated recipe tests (kimi, moonlight, nemotronh, glm45, run_plugins)
Unit tests pass for updated model tests (llama_nemotron, glm_vl, models_imports)
Functional training tests pass with inlined GPTModelProvider parameters
No import errors from removed provider exports

Made with Cursor

Summary by CodeRabbit

Release Notes

Refactor
- Consolidated model provider APIs: Specific model provider classes have been simplified and unified through Bridge-based interfaces for improved maintainability.
- Updated model configuration architecture across supported models for consistency.
Chores
- Removed deprecated model provider implementations and associated tests.

copy-pr-bot · 2026-02-28T18:59:44Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yaoyu-33 · 2026-02-28T18:59:48Z

/ok to test 0bca980

coderabbitai · 2026-02-28T19:08:08Z

📝 Walkthrough

Walkthrough

This pull request consolidates and refactors the model provider API by removing specialized provider classes across multiple model families (Llama, Llama Nemotron, Nemotron, NemotronH, DeepSeek, GLM, GPT-OSS, Kimi) and replacing them with unified providers (GPTModelProvider, MLAModelProvider, MambaModelProvider). Corresponding recipe and test updates reflect the new provider structure, with bridges becoming the primary public API surface.

Changes

Cohort / File(s)	Summary
Package initialization and provider removals `src/megatron/bridge/models/__init__.py`, `src/megatron/bridge/models/deepseek/__init__.py`, `src/megatron/bridge/models/glm/__init__.py`, `src/megatron/bridge/models/gpt_oss/__init__.py`, `src/megatron/bridge/models/kimi/__init__.py`, `src/megatron/bridge/models/llama/__init__.py`, `src/megatron/bridge/models/nemotron/__init__.py`, `src/megatron/bridge/models/nemotron_vl/__init__.py`, `src/megatron/bridge/models/nemotronh/__init__.py`, `src/megatron/bridge/models/llama_nemotron/__init__.py`	Removed numerous model-specific provider class exports from public APIs; consolidated to Bridge classes as the primary public interface (e.g., LlamaBridge, NemotronHBridge, DeepSeekV2Bridge). Updated all lists to reflect narrowed export surfaces.
Deleted provider implementation files `src/megatron/bridge/models/deepseek/deepseek_provider.py`, `src/megatron/bridge/models/gpt_oss/gpt_oss_provider.py`, `src/megatron/bridge/models/kimi/kimi_provider.py`, `src/megatron/bridge/models/llama/llama_provider.py`, `src/megatron/bridge/models/nemotron/nemotron_provider.py`, `src/megatron/bridge/models/nemotronh/nemotron_h_provider.py`	Entirely removed model-specific provider dataclass definitions (e.g., LlamaModelProvider, Llama2ModelProvider7B, NemotronModelProvider, NemotronHModelProvider, GLMMoEModelProvider, GPTOSSProvider). Elimination of per-model configuration defaults and deprecation helpers.
Provider inheritance refactoring `src/megatron/bridge/models/glm_vl/glm_45v_provider.py`, `src/megatron/bridge/models/llama_nemotron/llama_nemotron_provider.py`, `src/megatron/bridge/models/llama_nemotron/llama_nemotron_bridge.py`, `src/megatron/bridge/models/nemotron_vl/nemotron_vl_provider.py`, `src/megatron/bridge/models/nemotron_vl/nemotron_vl_bridge.py`	Updated base classes from removed model-specific providers (GLM45AirModelProvider106B, Llama31ModelProvider, NemotronNano12Bv2Provider) to unified providers (GPTModelProvider, MambaModelProvider). Added extensive public configuration fields to support flexible model instantiation without pre-baked defaults.
Bridge provider type updates `src/megatron/bridge/models/conversion/model_bridge.py`	Updated provider_bridge method return type from model-specific providers (e.g., LlamaModelProvider) to unified providers (GPTModelProvider).
Recipe configuration migration `src/megatron/bridge/recipes/kimi/kimi_k2.py`, `src/megatron/bridge/recipes/moonlight/moonlight_16b.py`, `src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py`, `src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py`, `src/megatron/bridge/recipes/nemotronh/nemotronh.py`	Replaced model-specific provider class usage with AutoBridge or unified providers (MambaModelProvider, MLAModelProvider). Migrated architecture parameters from provider-class defaults to explicit configuration attributes passed to unified providers.
Unit test deletions `tests/unit_tests/models/deepseek/test_deepseek_provider.py`, `tests/unit_tests/models/glm/test_glm45_provider.py`, `tests/unit_tests/models/gpt_oss/test_gpt_oss_provider.py`, `tests/unit_tests/models/kimi/test_kimi_provider.py`, `tests/unit_tests/models/llama/test_llama_provider.py`, `tests/unit_tests/models/nemotron/test_nemotron_provider.py`, `tests/unit_tests/models/nemotronh/test_nemotron_h_provider.py`, `tests/unit_tests/models/llama_nemotron/test_llama_nemotron_bridge.py`	Removed test suites that validated defaults and configurations of deleted provider classes. Tests no longer needed as provider classes no longer exist.
Functional test updates `tests/functional_tests/data/test_samplers.py`, `tests/functional_tests/models/gpt_oss/test_gpt_oss_provider.py`, `tests/functional_tests/training/test_callbacks.py`, `tests/functional_tests/training/test_decentralized_pg.py`, `tests/functional_tests/training/test_finetune_dora.py`, `tests/functional_tests/training/test_finetune_lora.py`, `tests/functional_tests/training/test_inprocess_restart.py`, `tests/functional_tests/training/test_megatron_fsdp.py`, `tests/functional_tests/training/test_nvrx_straggler.py`, `tests/functional_tests/training/test_pretrain.py`, `tests/functional_tests/training/test_pretrain_resume.py`, `tests/functional_tests/training/test_sample_based_training.py`, `tests/functional_tests/training/test_sft.py`, `tests/functional_tests/training/test_tensor_inspect.py`	Replaced test model provider configurations from model-specific classes (Llama32ModelProvider1B, Llama3ModelProvider8B) to unified providers (GPTModelProvider). Extended configuration parameters to support explicit specification of model architecture details previously embedded in provider defaults.
Unit test recipe/integration updates `tests/unit_tests/models/glm_vl/test_glm_45v_provider.py`, `tests/unit_tests/recipes/kimi/test_kimi_k2.py`, `tests/unit_tests/recipes/nemotronh/test_nemotron_3_nano.py`, `tests/unit_tests/recipes/nemotronh/test_nemotron_nano_v2.py`, `tests/unit_tests/recipes/nemotronh/test_nemotronh.py`, `tests/unit_tests/recipes/test_glm45_recipes.py`, `tests/unit_tests/recipes/test_moonlight_recipes.py`, `tests/unit_tests/recipes/test_run_plugins.py`, `tests/unit_tests/training/test_config.py`, `tests/unit_tests/training/test_log_non_default_values.py`, `tests/unit_tests/models/test_models_imports.py`	Updated recipe tests to expect unified providers (MambaModelProvider) instead of removed model-specific providers. Added AutoBridge mocking fixtures. Updated import assertions and isinstance checks to reflect new provider structure.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

[model, refactor] refactor: Centralize provider_bridge config mapping in base class for Nemotron models #2225 — Directly refactors provider_bridge implementations and bridge decorators, consolidating specialized provider classes to unified providers (GPTModelProvider, MLAModelProvider), aligning with this PR's provider consolidation strategy.
[model, refactor] refactor: Centralize provider_bridge config mapping in base class #2052 — Continues provider_bridge consolidation to GPTModelProvider/MLAModelProvider with CONFIG_MAPPING/PROVIDER_CLASS pattern alignment, extending the refactoring initiated in this PR.
Nemotron-3-Nano Model Support #1914 — Introduces Nemotron-3-Nano and extends NemotronH provider/bridge exports, directly conflicting with and related to this PR's removal of Nemotron-H and Nemotron-Nano provider classes.

Suggested labels

enhancement

Suggested reviewers

ko3n1g
maanug-nv

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR removes 7 provider modules and 1,408+ lines of tests with only a test plan, no actual CI logs or pass/fail evidence, plus confirmed code issues including type annotation error and redundant code.	Document CI/test logs, resolve identified code review issues, verify imports work for deleted providers, and include regression testing results before merging this large refactoring.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title '[model, recipe, test] refactor: remove specific model provider dataclasses' is clear, specific, and directly describes the main change—removing redundant size-specific model provider classes now replaced by bridges that dynamically configure base providers.
Docstring Coverage	✅ Passed	Docstring coverage is 81.25% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch yuya/remove-model-providers

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/megatron/bridge/recipes/kimi/kimi_k2.py (1)
83-86: ⚠️ Potential issue | 🔴 Critical

KimiBridge implementation is missing – the recipe will fail at runtime.

The recipe imports KimiBridge from src/megatron/bridge/models/kimi/kimi_bridge.py (via kimi/__init__.py), but this file does not exist in the codebase. When AutoBridge.from_hf_pretrained("moonshotai/Kimi-K2-Instruct") is called, the dispatch system will fail to find a registered bridge implementation for the Kimi model architecture, causing a runtime error before cfg.model.vocab_size can even be accessed.

The bridge must be implemented similar to DeepSeekV2Bridge (which uses MLAModelProvider and @MegatronModelBridge.register_bridge) to register with the dispatch system. Once implemented, vocab_size will be available from the HF config mapping (GPTModelProvider has vocab_size: Optional[int] = None and model_bridge.py maps it via CONFIG_MAPPING).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/kimi/kimi_k2.py` around lines 83 - 86, The recipe
fails at runtime because the KimiBridge implementation is missing; create a new
bridge class named KimiBridge that follows the pattern of DeepSeekV2Bridge:
annotate it with `@MegatronModelBridge.register_bridge` and implement the required
interface methods, use MLAModelProvider (or the appropriate ModelProvider) to
map HF config fields (ensure vocab_size is exposed via
CONFIG_MAPPING/GPTModelProvider semantics) and register the bridge so
AutoBridge.from_hf_pretrained("moonshotai/Kimi-K2-Instruct") can dispatch
correctly; mirror DeepSeekV2Bridge’s constructors and provider wiring to ensure
cfg.model.vocab_size is populated.

🧹 Nitpick comments (12)

tests/functional_tests/data/test_samplers.py (1)

620-622: LGTM!

Migration is correct.

Optional: The _DummyBridge class with GPTModelProvider is duplicated across 5 test methods. Consider extracting to a module-level helper or pytest fixture if this pattern expands further.

,

♻️ Optional refactor using pytest fixture

# At module level or in conftest.py
`@pytest.fixture`
def mock_llama3_bridge():
    """Mock AutoBridge to avoid HF downloads, returning GPTModelProvider."""
    from unittest import mock as _mock

    class _DummyBridge:
        def to_megatron_provider(self, load_weights=False):
            from megatron.bridge.models.gpt_provider import GPTModelProvider
            return GPTModelProvider()

    with _mock.patch(
        "megatron.bridge.recipes.llama.llama3.AutoBridge.from_hf_pretrained",
        return_value=_DummyBridge()
    ):
        yield

Then tests can simply use def test_...(self, mock_llama3_bridge): without the repeated boilerplate.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/data/test_samplers.py` around lines 620 - 622, Extract
the repeated _DummyBridge + GPTModelProvider pattern into a reusable pytest
fixture or module-level helper to avoid duplication: create a fixture (e.g.,
mock_llama3_bridge) that defines class _DummyBridge with method
to_megatron_provider returning GPTModelProvider, and use unittest.mock.patch to
patch "megatron.bridge.recipes.llama.llama3.AutoBridge.from_hf_pretrained" to
return that _DummyBridge, then update the five tests to accept the fixture
(e.g., def test_...(mock_llama3_bridge)) instead of repeating the inline class
and patch.

tests/unit_tests/recipes/test_moonlight_recipes.py (1)

136-136: Consolidate repeated MLAModelProvider monkeypatching into a pytest fixture.

The identical monkeypatch.setattr(mod, "MLAModelProvider", _FakeMoonlightModelProvider16B) setup is repeated in many tests. Please move this into a shared fixture (module-level or reusable helper fixture) to reduce duplication and keep setup consistent.

As per coding guidelines, "tests/**/*.py: Use pytest fixtures for common setup in unit tests".

Also applies to: 171-171, 196-196, 222-222, 238-238, 261-261, 284-284, 307-307, 329-329
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/recipes/test_moonlight_recipes.py` at line 136, Create a
reusable pytest fixture that performs the repeated monkeypatching of
MLAModelProvider to _FakeMoonlightModelProvider16B and replace the duplicated
monkeypatch.setattr(mod, "MLAModelProvider", _FakeMoonlightModelProvider16B)
lines in tests with that fixture; implement a fixture (e.g.,
fake_mlamodel_provider or patch_mla_provider) that accepts monkeypatch, calls
monkeypatch.setattr(mod, "MLAModelProvider", _FakeMoonlightModelProvider16B),
mark it module- or session-scoped as appropriate, and add the fixture name to
the tests that previously had the inline monkeypatch to eliminate duplication
while preserving behavior.

src/megatron/bridge/recipes/moonlight/moonlight_16b.py (2)

413-468: Optional: Extract shared architecture parameters to reduce duplication.

The MLAModelProvider instantiation in moonlight_16b_peft_config duplicates ~40 architecture parameters that are identical to moonlight_16b_sft_config (lines 200-241 vs 414-454). Only the parallelism settings differ.

This is acceptable for clarity and explicit configuration, but if maintainability becomes a concern, consider extracting the shared architecture parameters into a module-level constant.
♻️ Example: Extract shared architecture parameters
# Module-level constant for shared architecture
_MOONLIGHT_16B_ARCH = dict(
    num_layers=27,
    hidden_size=2048,
    ffn_hidden_size=11264,
    # ... remaining shared params
)

def moonlight_16b_sft_config() -> ConfigContainer:
    cfg.model = MLAModelProvider(
        **_MOONLIGHT_16B_ARCH,
        # SFT-specific parallelism
        tensor_model_parallel_size=2,
        expert_model_parallel_size=8,
        sequence_parallel=True,
        # ...
    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/moonlight/moonlight_16b.py` around lines 413 -
468, The MLAModelProvider block in moonlight_16b_peft_config duplicates many
architecture params already used in moonlight_16b_sft_config; extract the shared
architecture into a module-level constant (e.g., _MOONLIGHT_16B_ARCH) containing
the common keys (num_layers, hidden_size, ffn_hidden_size, num_attention_heads,
kv_channels, vocab_size, rotary_*, layernorm_epsilon, init_method_std,
activation_func, gated_linear_unit, etc.), then update both
moonlight_16b_sft_config and moonlight_16b_peft_config to call
MLAModelProvider(**_MOONLIGHT_16B_ARCH, <their specific overrides>) so only
parallelism and config-specific overrides remain in each function.
199-255: Consider removing redundant parameter assignments.

The MLAModelProvider constructor sets moe_token_dispatcher_type="alltoall" (line 241) and moe_grouped_gemm=True (line 240), but these are reassigned later in the function:

cfg.model.moe_token_dispatcher_type at lines 284

cfg.model.moe_grouped_gemm at line 308

cfg.model.moe_permute_fusion at lines 274 and 307

While functionally correct, the duplication could lead to drift if only one location is updated.
♻️ Suggested cleanup: remove redundant assignments from constructor or later code

Either remove from the constructor (if they're always overridden by the DeePEP logic):
         moe_grouped_gemm=True,
-        moe_token_dispatcher_type="alltoall",
         # Parallelism
Or consolidate all MoE settings in one place and remove the later redundant assignments.
Note: The Ruff S106 warnings about moe_token_dispatcher_type are false positives—this is a Mixture-of-Experts configuration parameter, not a password.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/moonlight/moonlight_16b.py` around lines 199 -
255, The constructor for MLAModelProvider sets MoE-related defaults
(moe_grouped_gemm, moe_token_dispatcher_type, moe_permute_fusion) that are later
reassigned on cfg.model (e.g., cfg.model.moe_token_dispatcher_type,
cfg.model.moe_grouped_gemm, cfg.model.moe_permute_fusion); remove the redundant
assignments either from the MLAModelProvider(...) argument list or from the
later reassignment block so all MoE settings are defined in one place—locate
MLAModelProvider(...) in moonlight_16b.py and the subsequent lines where
cfg.model.moe_permute_fusion, cfg.model.moe_token_dispatcher_type, and
cfg.model.moe_grouped_gemm are set, then keep only the preferred location and
delete the duplicates.

tests/unit_tests/training/test_config.py (1)

79-90: Consider renaming the helper to reflect the new abstraction.

create_test_deepseek_config now returns MLAModelProvider. A neutral helper name (e.g., create_test_mla_config) would reduce ambiguity as MLA coverage expands.

♻️ Optional clarity refactor

-def create_test_deepseek_config(**kwargs: Any) -> MLAModelProvider:
-    """Creates an instance of MLAModelProvider for testing."""
+def create_test_mla_config(**kwargs: Any) -> MLAModelProvider:
+    """Creates an instance of MLAModelProvider for testing."""
     defaults = {
         "num_layers": 1,
         "hidden_size": 128,
         "num_attention_heads": 4,
         "seq_length": 512,
         "apply_rope_fusion": False,
     }
     defaults.update(kwargs)
     return MLAModelProvider(**defaults)
+
+# Backward-compatible alias for existing tests.
+create_test_deepseek_config = create_test_mla_config

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/training/test_config.py` around lines 79 - 90, Rename the
helper function create_test_deepseek_config to a neutral name like
create_test_mla_config since it now returns MLAModelProvider; update the
function definition and all call sites to the new name (search for
create_test_deepseek_config and replace with create_test_mla_config) and keep
the same parameters/defaults and return MLAModelProvider to preserve behavior.

tests/unit_tests/recipes/nemotronh/test_nemotronh.py (1)

163-177: Consider simplifying the parameterized test.

Since all variants now use MambaModelProvider, the provider_cls parameter is redundant—every tuple maps to the same class. You could simplify by removing the parameter and using MambaModelProvider directly in the assertion.

That said, keeping the current structure is fine if you prefer consistency with other parameterized tests or want flexibility for future divergence.

♻️ Optional simplification

     `@pytest.mark.parametrize`(
-        "recipe_fn,provider_cls",
+        "recipe_fn",
         [
-            (nemotronh_4b_pretrain_config, MambaModelProvider),
-            (nemotronh_8b_pretrain_config, MambaModelProvider),
-            (nemotronh_47b_pretrain_config, MambaModelProvider),
-            (nemotronh_56b_pretrain_config, MambaModelProvider),
+            nemotronh_4b_pretrain_config,
+            nemotronh_8b_pretrain_config,
+            nemotronh_47b_pretrain_config,
+            nemotronh_56b_pretrain_config,
         ],
     )
-    def test_config_container_structure(self, recipe_fn, provider_cls):
+    def test_config_container_structure(self, recipe_fn):
         """Test that all configs return proper ConfigContainer with correct model provider."""
         config = recipe_fn()
 
         assert isinstance(config, ConfigContainer)
-        assert isinstance(config.model, provider_cls)
+        assert isinstance(config.model, MambaModelProvider)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/recipes/nemotronh/test_nemotronh.py` around lines 163 - 177,
The parameterized test test_config_container_structure currently includes a
redundant provider_cls parameter because all tuples use MambaModelProvider;
simplify by parameterizing only the recipe functions
(nemotronh_4b_pretrain_config, nemotronh_8b_pretrain_config,
nemotronh_47b_pretrain_config, nemotronh_56b_pretrain_config) and replace the
second assertion to directly assert isinstance(config.model, MambaModelProvider)
while keeping the initial assert isinstance(config, ConfigContainer); update the
`@pytest.mark.parametrize` decorator and the test signature accordingly to remove
provider_cls.

src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py (1)

211-268: Configuration duplication across pretrain/SFT/PEFT is notable but acceptable.

The SFT and PEFT configs duplicate the MambaModelProvider instantiation with ~40 identical architecture parameters. While this could be extracted into a helper function, this pattern appears consistent with other recipes in the codebase.

♻️ Optional: Extract common architecture parameters

def _nemotron_3_nano_model_kwargs(seq_length: int = 8192) -> dict:
    """Common Nemotron 3 Nano architecture parameters."""
    return dict(
        hybrid_override_pattern="MEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEMEM*EMEMEMEME",
        num_layers=52,
        hidden_size=2688,
        mamba_num_heads=64,
        kv_channels=128,
        mamba_state_dim=128,
        # ... remaining common params
        seq_length=seq_length,
    )

# Usage in configs:
cfg.model = MambaModelProvider(
    **_nemotron_3_nano_model_kwargs(seq_length=8192),
    tensor_model_parallel_size=4,
    # ... parallelism params
)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py` around lines 211 -
268, Multiple recipes duplicate the same MambaModelProvider architecture block;
extract those shared parameters into a helper function (e.g.,
_nemotron_3_nano_model_kwargs) that returns a dict of the common keys
(hybrid_override_pattern, num_layers, hidden_size, mamba_num_heads, kv_channels,
mamba_state_dim, ffn_hidden_size, seq_length, etc.), then update cfg.model
assignments in pretrain/SFT/PEFT to call
MambaModelProvider(**_nemotron_3_nano_model_kwargs(seq_length=...), <override or
additional parallelism args>) so only recipe-specific fields (like
tensor_model_parallel_size, pipeline_model_parallel_size, pipeline_dtype,
expert_model_parallel_size) remain inline.

tests/functional_tests/training/test_pretrain_resume.py (1)

47-72: LGTM with optional consolidation opportunity.

The Llama3ModelProvider145M class correctly inherits from GPTModelProvider with Llama-3-style defaults. Note that this class definition is nearly identical to the one in test_sft.py (with minor differences in seq_length and num_layers defaults). Consider extracting a shared test utility class in a future refactor to reduce duplication across functional tests.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/training/test_pretrain_resume.py` around lines 47 -
72, The Llama3ModelProvider145M test class duplicates a nearly identical
definition found in test_sft.py; refactor by extracting the shared defaults into
a single reusable test utility (e.g., a base class or factory function) and have
both Llama3ModelProvider145M and the counterpart in test_sft.py inherit from or
invoke that utility; update the unique overrides (seq_length, num_layers) in
each test-specific subclass or factory call and remove the duplicated field
definitions to centralize maintenance.

tests/functional_tests/training/test_finetune_lora.py (1)

53-80: LGTM!

The Llama3ModelProvider145M class correctly inherits from GPTModelProvider with Llama-3-style defaults. The explicit vocab_size=128256 is appropriate for LoRA finetuning tests that use HuggingFace tokenizers.

This is the third definition of Llama3ModelProvider145M across functional test files (also in test_sft.py and test_pretrain_resume.py). Consider consolidating these into a shared test utility module (e.g., tests/functional_tests/utils.py or a dedicated tests/functional_tests/training/model_providers.py) in a future refactor to reduce maintenance burden.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/training/test_finetune_lora.py` around lines 53 - 80,
Duplicate Llama3ModelProvider145M definitions should be consolidated into a
single shared test utility: create a new test helper module (e.g.,
tests/functional_tests/training/model_providers.py or utils.py), move the class
Llama3ModelProvider145M into that module, update usages in tests that currently
define the class (references in test_sft.py and test_pretrain_resume.py and this
file) to import Llama3ModelProvider145M from the new module, and remove the
duplicate class definitions so tests import the single canonical provider.

tests/functional_tests/training/test_decentralized_pg.py (1)

104-140: High duplication in model config setup across decentralized-PG tests.

A shared builder (with per-test overrides for TP/PP/CP/share-embeddings) would make these tests easier to maintain and less brittle.

Also applies to: 255-290, 410-446, 566-602, 722-758, 878-914

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/training/test_decentralized_pg.py` around lines 104 -
140, The duplicated GPTModelProvider block should be extracted into a shared
builder function (e.g., make_decentralized_model_cfg or build_base_gpt_model)
that returns a GPTModelProvider pre-populated with the common args shown
(normalization, activation_func, gated_linear_unit, position_embedding_type, ...
hidden_size, ffn_hidden_size, num_attention_heads, init_method_std,
layernorm_epsilon, rotary/rope settings, seq_length,
make_vocab_size_divisible_by, vocab_size, num_layers, etc.), and each test
should call that builder and apply per-test overrides for
tensor_model_parallel_size, pipeline_model_parallel_size, context_parallel_size,
sequence_parallel, and share_embeddings_and_output_weights; update all
referenced tests (the blocks at ~104-140, 255-290, 410-446, 566-602, 722-758,
878-914) to use the builder and only change the TP/PP/CP/share-embeddings fields
to reduce duplication and improve maintainability.

src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py (1)

38-77: Consider factoring shared MambaModelProvider kwargs into reusable builders/constants.

These six blocks duplicate a large common parameter surface; extracting shared defaults would reduce maintenance overhead and accidental divergence across pretrain/SFT/PEFT variants.

Also applies to: 173-212, 310-349, 436-475, 571-610, 718-757

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py` around lines 38 -
77, Multiple MambaModelProvider invocations (e.g., the cfg.model =
MambaModelProvider block) duplicate a large common parameter set across
variants; factor those shared kwargs into a reusable constant or builder (for
example a MAMBA_DEFAULTS dict or a build_mamba_kwargs(name, overrides) helper)
and then construct each provider via unpacking the shared defaults plus only the
variant-specific overrides (use the existing symbols like MambaModelProvider and
cfg.model to locate callers and merge with per-variant keys such as num_layers,
seq_length, pipeline_dtype, etc.). Ensure the new helper/constant is
imported/defined once and replace the repeated blocks at the other occurrences
noted so differences remain explicit and minimal.

tests/functional_tests/training/test_pretrain.py (1)

73-108: Consider extracting a shared GPT model-config factory.

These two constructor blocks are mostly duplicated; a small helper would reduce config drift across tests when provider defaults/fields change.

Also applies to: 219-255

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/training/test_pretrain.py` around lines 73 - 108, The
two nearly identical GPTModelProvider constructor blocks (creating model_cfg)
should be extracted into a shared factory to avoid duplication and drift; add a
helper function (e.g., make_gpt_model_cfg or build_gpt_model_config) that
returns a configured GPTModelProvider with the common parameters shown
(normalization, activation_func, gated_linear_unit, position_embedding_type,
add_bias_linear, attention_dropout, hidden_dropout, bias_activation_fusion,
masked_softmax_fusion, persist_layer_norm, bias_dropout_fusion,
apply_rope_fusion, num_query_groups, init_method_std, layernorm_epsilon,
rotary_percent, rope_scaling, rope_scaling_factor,
share_embeddings_and_output_weights, rotary_base, hidden_size, ffn_hidden_size,
num_attention_heads, tensor_model_parallel_size, pipeline_model_parallel_size,
context_parallel_size, sequence_parallel, attention_softmax_in_fp32,
pipeline_dtype, bf16, seq_length, make_vocab_size_divisible_by, vocab_size,
num_layers) and replace the two inline constructor blocks (the model_cfg
assignments at the shown location and the similar block around lines 219–255)
with calls to that helper, allowing optional overrides for seq_length,
num_layers, vocab_size, etc.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/megatron/bridge/models/glm_vl/glm_45v_provider.py`:
- Line 56: The type hint for the variable init_method_std is incorrect: it is
declared as int but initialized with a float (0.02); update the annotation from
int to float for init_method_std in glm_45v_provider.py so the signature matches
the default value (leave the default 0.02 unchanged) — locate the
init_method_std declaration and change its type to float.

In `@src/megatron/bridge/recipes/nemotronh/nemotronh.py`:
- Line 455: Remove the misleading explicit attention_backend=AttnBackend.auto
argument from the MambaModelProvider constructor calls in the three 56B variants
(the places that currently pass attention_backend=AttnBackend.auto) because each
function later sets cfg.model.attention_backend = None; edit those
MambaModelProvider(...) invocations to omit the attention_backend parameter so
the intended None configuration (used by Mamba models) is the effective setting
and will not confuse maintainers.

---

Outside diff comments:
In `@src/megatron/bridge/recipes/kimi/kimi_k2.py`:
- Around line 83-86: The recipe fails at runtime because the KimiBridge
implementation is missing; create a new bridge class named KimiBridge that
follows the pattern of DeepSeekV2Bridge: annotate it with
`@MegatronModelBridge.register_bridge` and implement the required interface
methods, use MLAModelProvider (or the appropriate ModelProvider) to map HF
config fields (ensure vocab_size is exposed via CONFIG_MAPPING/GPTModelProvider
semantics) and register the bridge so
AutoBridge.from_hf_pretrained("moonshotai/Kimi-K2-Instruct") can dispatch
correctly; mirror DeepSeekV2Bridge’s constructors and provider wiring to ensure
cfg.model.vocab_size is populated.

---

Nitpick comments:
In `@src/megatron/bridge/recipes/moonlight/moonlight_16b.py`:
- Around line 413-468: The MLAModelProvider block in moonlight_16b_peft_config
duplicates many architecture params already used in moonlight_16b_sft_config;
extract the shared architecture into a module-level constant (e.g.,
_MOONLIGHT_16B_ARCH) containing the common keys (num_layers, hidden_size,
ffn_hidden_size, num_attention_heads, kv_channels, vocab_size, rotary_*,
layernorm_epsilon, init_method_std, activation_func, gated_linear_unit, etc.),
then update both moonlight_16b_sft_config and moonlight_16b_peft_config to call
MLAModelProvider(**_MOONLIGHT_16B_ARCH, <their specific overrides>) so only
parallelism and config-specific overrides remain in each function.
- Around line 199-255: The constructor for MLAModelProvider sets MoE-related
defaults (moe_grouped_gemm, moe_token_dispatcher_type, moe_permute_fusion) that
are later reassigned on cfg.model (e.g., cfg.model.moe_token_dispatcher_type,
cfg.model.moe_grouped_gemm, cfg.model.moe_permute_fusion); remove the redundant
assignments either from the MLAModelProvider(...) argument list or from the
later reassignment block so all MoE settings are defined in one place—locate
MLAModelProvider(...) in moonlight_16b.py and the subsequent lines where
cfg.model.moe_permute_fusion, cfg.model.moe_token_dispatcher_type, and
cfg.model.moe_grouped_gemm are set, then keep only the preferred location and
delete the duplicates.

In `@src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py`:
- Around line 211-268: Multiple recipes duplicate the same MambaModelProvider
architecture block; extract those shared parameters into a helper function
(e.g., _nemotron_3_nano_model_kwargs) that returns a dict of the common keys
(hybrid_override_pattern, num_layers, hidden_size, mamba_num_heads, kv_channels,
mamba_state_dim, ffn_hidden_size, seq_length, etc.), then update cfg.model
assignments in pretrain/SFT/PEFT to call
MambaModelProvider(**_nemotron_3_nano_model_kwargs(seq_length=...), <override or
additional parallelism args>) so only recipe-specific fields (like
tensor_model_parallel_size, pipeline_model_parallel_size, pipeline_dtype,
expert_model_parallel_size) remain inline.

In `@src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py`:
- Around line 38-77: Multiple MambaModelProvider invocations (e.g., the
cfg.model = MambaModelProvider block) duplicate a large common parameter set
across variants; factor those shared kwargs into a reusable constant or builder
(for example a MAMBA_DEFAULTS dict or a build_mamba_kwargs(name, overrides)
helper) and then construct each provider via unpacking the shared defaults plus
only the variant-specific overrides (use the existing symbols like
MambaModelProvider and cfg.model to locate callers and merge with per-variant
keys such as num_layers, seq_length, pipeline_dtype, etc.). Ensure the new
helper/constant is imported/defined once and replace the repeated blocks at the
other occurrences noted so differences remain explicit and minimal.

In `@tests/functional_tests/data/test_samplers.py`:
- Around line 620-622: Extract the repeated _DummyBridge + GPTModelProvider
pattern into a reusable pytest fixture or module-level helper to avoid
duplication: create a fixture (e.g., mock_llama3_bridge) that defines class
_DummyBridge with method to_megatron_provider returning GPTModelProvider, and
use unittest.mock.patch to patch
"megatron.bridge.recipes.llama.llama3.AutoBridge.from_hf_pretrained" to return
that _DummyBridge, then update the five tests to accept the fixture (e.g., def
test_...(mock_llama3_bridge)) instead of repeating the inline class and patch.

In `@tests/functional_tests/training/test_decentralized_pg.py`:
- Around line 104-140: The duplicated GPTModelProvider block should be extracted
into a shared builder function (e.g., make_decentralized_model_cfg or
build_base_gpt_model) that returns a GPTModelProvider pre-populated with the
common args shown (normalization, activation_func, gated_linear_unit,
position_embedding_type, ... hidden_size, ffn_hidden_size, num_attention_heads,
init_method_std, layernorm_epsilon, rotary/rope settings, seq_length,
make_vocab_size_divisible_by, vocab_size, num_layers, etc.), and each test
should call that builder and apply per-test overrides for
tensor_model_parallel_size, pipeline_model_parallel_size, context_parallel_size,
sequence_parallel, and share_embeddings_and_output_weights; update all
referenced tests (the blocks at ~104-140, 255-290, 410-446, 566-602, 722-758,
878-914) to use the builder and only change the TP/PP/CP/share-embeddings fields
to reduce duplication and improve maintainability.

In `@tests/functional_tests/training/test_finetune_lora.py`:
- Around line 53-80: Duplicate Llama3ModelProvider145M definitions should be
consolidated into a single shared test utility: create a new test helper module
(e.g., tests/functional_tests/training/model_providers.py or utils.py), move the
class Llama3ModelProvider145M into that module, update usages in tests that
currently define the class (references in test_sft.py and
test_pretrain_resume.py and this file) to import Llama3ModelProvider145M from
the new module, and remove the duplicate class definitions so tests import the
single canonical provider.

In `@tests/functional_tests/training/test_pretrain_resume.py`:
- Around line 47-72: The Llama3ModelProvider145M test class duplicates a nearly
identical definition found in test_sft.py; refactor by extracting the shared
defaults into a single reusable test utility (e.g., a base class or factory
function) and have both Llama3ModelProvider145M and the counterpart in
test_sft.py inherit from or invoke that utility; update the unique overrides
(seq_length, num_layers) in each test-specific subclass or factory call and
remove the duplicated field definitions to centralize maintenance.

In `@tests/functional_tests/training/test_pretrain.py`:
- Around line 73-108: The two nearly identical GPTModelProvider constructor
blocks (creating model_cfg) should be extracted into a shared factory to avoid
duplication and drift; add a helper function (e.g., make_gpt_model_cfg or
build_gpt_model_config) that returns a configured GPTModelProvider with the
common parameters shown (normalization, activation_func, gated_linear_unit,
position_embedding_type, add_bias_linear, attention_dropout, hidden_dropout,
bias_activation_fusion, masked_softmax_fusion, persist_layer_norm,
bias_dropout_fusion, apply_rope_fusion, num_query_groups, init_method_std,
layernorm_epsilon, rotary_percent, rope_scaling, rope_scaling_factor,
share_embeddings_and_output_weights, rotary_base, hidden_size, ffn_hidden_size,
num_attention_heads, tensor_model_parallel_size, pipeline_model_parallel_size,
context_parallel_size, sequence_parallel, attention_softmax_in_fp32,
pipeline_dtype, bf16, seq_length, make_vocab_size_divisible_by, vocab_size,
num_layers) and replace the two inline constructor blocks (the model_cfg
assignments at the shown location and the similar block around lines 219–255)
with calls to that helper, allowing optional overrides for seq_length,
num_layers, vocab_size, etc.

In `@tests/unit_tests/recipes/nemotronh/test_nemotronh.py`:
- Around line 163-177: The parameterized test test_config_container_structure
currently includes a redundant provider_cls parameter because all tuples use
MambaModelProvider; simplify by parameterizing only the recipe functions
(nemotronh_4b_pretrain_config, nemotronh_8b_pretrain_config,
nemotronh_47b_pretrain_config, nemotronh_56b_pretrain_config) and replace the
second assertion to directly assert isinstance(config.model, MambaModelProvider)
while keeping the initial assert isinstance(config, ConfigContainer); update the
`@pytest.mark.parametrize` decorator and the test signature accordingly to remove
provider_cls.

In `@tests/unit_tests/recipes/test_moonlight_recipes.py`:
- Line 136: Create a reusable pytest fixture that performs the repeated
monkeypatching of MLAModelProvider to _FakeMoonlightModelProvider16B and replace
the duplicated monkeypatch.setattr(mod, "MLAModelProvider",
_FakeMoonlightModelProvider16B) lines in tests with that fixture; implement a
fixture (e.g., fake_mlamodel_provider or patch_mla_provider) that accepts
monkeypatch, calls monkeypatch.setattr(mod, "MLAModelProvider",
_FakeMoonlightModelProvider16B), mark it module- or session-scoped as
appropriate, and add the fixture name to the tests that previously had the
inline monkeypatch to eliminate duplication while preserving behavior.

In `@tests/unit_tests/training/test_config.py`:
- Around line 79-90: Rename the helper function create_test_deepseek_config to a
neutral name like create_test_mla_config since it now returns MLAModelProvider;
update the function definition and all call sites to the new name (search for
create_test_deepseek_config and replace with create_test_mla_config) and keep
the same parameters/defaults and return MLAModelProvider to preserve behavior.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 383b610 and 0bca980.

📒 Files selected for processing (61)

src/megatron/bridge/models/__init__.py
src/megatron/bridge/models/conversion/model_bridge.py
src/megatron/bridge/models/deepseek/__init__.py
src/megatron/bridge/models/deepseek/deepseek_provider.py
src/megatron/bridge/models/glm/__init__.py
src/megatron/bridge/models/glm/glm45_provider.py
src/megatron/bridge/models/glm_vl/glm_45v_provider.py
src/megatron/bridge/models/gpt_oss/__init__.py
src/megatron/bridge/models/gpt_oss/gpt_oss_provider.py
src/megatron/bridge/models/kimi/__init__.py
src/megatron/bridge/models/kimi/kimi_provider.py
src/megatron/bridge/models/llama/__init__.py
src/megatron/bridge/models/llama/llama_provider.py
src/megatron/bridge/models/llama_nemotron/__init__.py
src/megatron/bridge/models/llama_nemotron/llama_nemotron_bridge.py
src/megatron/bridge/models/llama_nemotron/llama_nemotron_provider.py
src/megatron/bridge/models/nemotron/__init__.py
src/megatron/bridge/models/nemotron/nemotron_provider.py
src/megatron/bridge/models/nemotron_vl/__init__.py
src/megatron/bridge/models/nemotron_vl/nemotron_vl_bridge.py
src/megatron/bridge/models/nemotron_vl/nemotron_vl_provider.py
src/megatron/bridge/models/nemotronh/__init__.py
src/megatron/bridge/models/nemotronh/nemotron_h_provider.py
src/megatron/bridge/recipes/kimi/kimi_k2.py
src/megatron/bridge/recipes/moonlight/moonlight_16b.py
src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py
src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py
src/megatron/bridge/recipes/nemotronh/nemotronh.py
tests/functional_tests/data/test_samplers.py
tests/functional_tests/models/gpt_oss/test_gpt_oss_provider.py
tests/functional_tests/training/test_callbacks.py
tests/functional_tests/training/test_decentralized_pg.py
tests/functional_tests/training/test_finetune_dora.py
tests/functional_tests/training/test_finetune_lora.py
tests/functional_tests/training/test_inprocess_restart.py
tests/functional_tests/training/test_megatron_fsdp.py
tests/functional_tests/training/test_nvrx_straggler.py
tests/functional_tests/training/test_pretrain.py
tests/functional_tests/training/test_pretrain_resume.py
tests/functional_tests/training/test_sample_based_training.py
tests/functional_tests/training/test_sft.py
tests/functional_tests/training/test_tensor_inspect.py
tests/unit_tests/models/deepseek/test_deepseek_provider.py
tests/unit_tests/models/glm/test_glm45_provider.py
tests/unit_tests/models/glm_vl/test_glm_45v_provider.py
tests/unit_tests/models/gpt_oss/test_gpt_oss_provider.py
tests/unit_tests/models/kimi/test_kimi_provider.py
tests/unit_tests/models/llama/test_llama_provider.py
tests/unit_tests/models/llama_nemotron/test_llama_nemotron_bridge.py
tests/unit_tests/models/nemotron/test_nemotron_provider.py
tests/unit_tests/models/nemotronh/test_nemotron_h_provider.py
tests/unit_tests/models/test_models_imports.py
tests/unit_tests/recipes/kimi/test_kimi_k2.py
tests/unit_tests/recipes/nemotronh/test_nemotron_3_nano.py
tests/unit_tests/recipes/nemotronh/test_nemotron_nano_v2.py
tests/unit_tests/recipes/nemotronh/test_nemotronh.py
tests/unit_tests/recipes/test_glm45_recipes.py
tests/unit_tests/recipes/test_moonlight_recipes.py
tests/unit_tests/recipes/test_run_plugins.py
tests/unit_tests/training/test_config.py
tests/unit_tests/training/test_log_non_default_values.py

💤 Files with no reviewable changes (22)

src/megatron/bridge/models/glm/init.py
src/megatron/bridge/models/glm/glm45_provider.py
tests/unit_tests/models/deepseek/test_deepseek_provider.py
src/megatron/bridge/models/nemotron/init.py
tests/unit_tests/models/kimi/test_kimi_provider.py
tests/functional_tests/models/gpt_oss/test_gpt_oss_provider.py
src/megatron/bridge/models/kimi/kimi_provider.py
tests/unit_tests/models/glm/test_glm45_provider.py
src/megatron/bridge/models/gpt_oss/gpt_oss_provider.py
src/megatron/bridge/models/nemotron_vl/init.py
src/megatron/bridge/models/llama_nemotron/init.py
tests/unit_tests/models/llama/test_llama_provider.py
tests/unit_tests/models/nemotron/test_nemotron_provider.py
src/megatron/bridge/models/gpt_oss/init.py
tests/unit_tests/models/gpt_oss/test_gpt_oss_provider.py
src/megatron/bridge/models/llama/llama_provider.py
src/megatron/bridge/models/deepseek/deepseek_provider.py
tests/unit_tests/models/nemotronh/test_nemotron_h_provider.py
src/megatron/bridge/models/nemotronh/init.py
src/megatron/bridge/models/nemotron/nemotron_provider.py
src/megatron/bridge/models/nemotronh/nemotron_h_provider.py
tests/unit_tests/models/llama_nemotron/test_llama_nemotron_bridge.py

coderabbitai · 2026-02-28T19:08:11Z

+    add_bias_linear: bool = False
+    add_qkv_bias: bool = True
+    seq_length: int = 131072
+    init_method_std: int = 0.02


⚠️ Potential issue | 🟡 Minor

Type annotation mismatch: init_method_std should be float, not int.

The default value 0.02 is a float, but the type hint declares int.

🐛 Proposed fix

- init_method_std: int = 0.02 + init_method_std: float = 0.02

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

init_method_std: int = 0.02

init_method_std: float = 0.02

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/models/glm_vl/glm_45v_provider.py` at line 56, The type hint for the variable init_method_std is incorrect: it is declared as int but initialized with a float (0.02); update the annotation from int to float for init_method_std in glm_45v_provider.py so the signature matches the default value (leave the default 0.02 unchanged) — locate the init_method_std declaration and change its type to float.

coderabbitai · 2026-02-28T19:08:12Z

+        mamba_num_heads=256,
+        ffn_hidden_size=32768,
+        num_attention_heads=64,
+        attention_backend=AttnBackend.auto,


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Get context around the attention_backend assignments and overrides rg -n -B5 -A5 'attention_backend=AttnBackend\.auto|cfg\.model\.attention_backend = None' \ src/megatron/bridge/recipes/nemotronh/nemotronh.py | head -200

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 6074

🏁 Script executed:

#!/bin/bash # Get function signatures and full context around the assignments # First, find which functions contain these lines rg -n -B30 'attention_backend=AttnBackend\.auto' src/megatron/bridge/recipes/nemotronh/nemotronh.py | \ grep -E '(^[0-9]+-def |attention_backend=AttnBackend\.auto)' | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 366

🏁 Script executed:

#!/bin/bash # Check if MambaModelProvider has attention_backend parameter rg -A20 'class MambaModelProvider|def __init__' src/megatron/bridge/models/mimo/mamba_provider.py | head -80

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 160

🏁 Script executed:

#!/bin/bash # Find MambaModelProvider definition fd -t f '*.py' -x grep -l 'class MambaModelProvider' {}

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 354

🏁 Script executed:

#!/bin/bash # Search for attention_backend in MambaModelProvider rg -l 'MambaModelProvider' --type py | head -10

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 626

🏁 Script executed:

#!/bin/bash # Check MambaModelProvider's __init__ signature rg -A30 'class MambaModelProvider' src/megatron/bridge/models/mamba/mamba_provider.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1319

🏁 Script executed:

#!/bin/bash # Check what parameters MambaModelProvider accepts rg 'attention_backend' src/megatron/bridge/models/mamba/mamba_provider.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 125

🏁 Script executed:

#!/bin/bash # Check if Mamba actually uses attention_backend or if it's just a placeholder # Look at the full context of why attention_backend is set to None rg -B10 -A5 'cfg.model.attention_backend = None' src/megatron/bridge/recipes/nemotronh/nemotronh.py | \ grep -A15 'Kernel selections'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 5400

🏁 Script executed:

#!/bin/bash # Check if Mamba is truly an attention-free architecture rg -C5 'attention' src/megatron/bridge/models/mamba/mamba_provider.py | head -40

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1547

🏁 Script executed:

#!/bin/bash # Check if there's any code between constructor call and the None assignment that might depend on attention_backend # Get full function bodies for the three functions rg -A100 'def nemotronh_56b_pretrain_config' src/megatron/bridge/recipes/nemotronh/nemotronh.py | \ head -120

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3447

🏁 Script executed:

#!/bin/bash # Verify: does EVERY function that sets AttnBackend.auto also override with None? # Count occurrences of each pattern echo "=== Functions with AttnBackend.auto ===" rg -B30 'attention_backend=AttnBackend\.auto' src/megatron/bridge/recipes/nemotronh/nemotronh.py | \ grep 'def ' | wc -l echo "=== Functions with cfg.model.attention_backend = None ===" rg -B50 'cfg\.model\.attention_backend = None' src/megatron/bridge/recipes/nemotronh/nemotronh.py | \ grep 'def ' | wc -l echo "=== All function names that set AttnBackend.auto ===" rg -B30 'attention_backend=AttnBackend\.auto' src/megatron/bridge/recipes/nemotronh/nemotronh.py | \ grep 'def ' | awk '{print $NF}'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 286

🏁 Script executed:

#!/bin/bash # Check: is there any code between AttnBackend.auto and the None override that uses attention_backend? rg -B1 -A70 'attention_backend=AttnBackend\.auto' src/megatron/bridge/recipes/nemotronh/nemotronh.py | \ grep -E '(attention_backend|^[0-9]+-.*cfg\.model\.attention_backend = None)' | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 319

Remove redundant attention_backend=AttnBackend.auto from MambaModelProvider constructors in all three 56B variants.

Lines 455, 982, and 1563 set attention_backend=AttnBackend.auto, but each function immediately overrides it with cfg.model.attention_backend = None (lines 524, 1032, and 1613). Since Mamba models do not use traditional attention, the explicit None assignment is the intended configuration; remove the constructor argument to avoid misleading maintainers about the actual backend choice.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/recipes/nemotronh/nemotronh.py` at line 455, Remove the misleading explicit attention_backend=AttnBackend.auto argument from the MambaModelProvider constructor calls in the three 56B variants (the places that currently pass attention_backend=AttnBackend.auto) because each function later sets cfg.model.attention_backend = None; edit those MambaModelProvider(...) invocations to omit the attention_backend parameter so the intended None configuration (used by Mamba models) is the effective setting and will not confuse maintainers.

yaoyu-33 · 2026-03-12T22:14:42Z

/ok to test c3afb89

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

- Remove NemotronNano12Bv2Provider from nemotron_vl/__init__.py (was a deprecated alias from deleted nemotron_h_provider.py) - Remove invalid max_position_embeddings kwarg from kimi and moonlight recipes (not a field on MLAModelProvider) - Update moonlight test to monkeypatch MLAModelProvider instead of deleted MoonlightModelProvider16B Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor

DeepSeekV2Bridge and DeepSeekV3Bridge were defined but never imported in the parent models/__init__.py, so their @register_bridge decorators never fired and DeepseekV3ForCausalLM was not a supported architecture. Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor

Move yarn_* fields (yarn_rotary_scaling_factor, yarn_original_max_position_embeddings, yarn_beta_fast, yarn_beta_slow, yarn_mscale, yarn_mscale_all_dim, yarn_correction_range_round_to_int) from dynamic setattr injection to proper dataclass fields on GPTModelProvider. This enables models like llama, qwen, gpt-oss, and ministral3 to toggle YARN configs through the standard provider without needing model-specific provider subclasses. - Remove GPTOSSProvider import and re-wrapping hack in gpt_oss_bridge.py - Pass YARN params directly as provider kwargs in model_bridge.py instead of stashing in _yarn_params dict for later setattr - Fix megatron_to_hf_config to check value is not None instead of hasattr (since fields always exist now but default to None) Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor

yarn_* fields are now proper dataclass fields on GPTModelProvider defaulting to None, so None means 'unset' and should be omitted from exported rope_scaling dict rather than preserved. Made-with: Cursor Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

…resets Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor

…n NemotronNano12Bv2VLModelProvider After PR #2628 renamed the field from hybrid_override_pattern to hybrid_layer_pattern in MambaModelProvider, NemotronNano12Bv2VLModelProvider still used the old name as a class default. When finalize() was called, both the class default (hybrid_override_pattern) and the bridge CONFIG_MAPPING-set value (hybrid_layer_pattern) were non-None, triggering an assertion error. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

…rovider finalize() derives num_layers from hybrid_layer_pattern; having both a class-level default num_layers=62 and a user-supplied custom pattern caused finalize() to assert 62==N(pattern) and fail. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

…d_layer_pattern) - Replace removed Llama3ModelProvider with GPTModelProvider in test_local_checkpointing.py (added from main after provider removal) - Rename deprecated hybrid_override_pattern to hybrid_layer_pattern in all nemotronh recipes (nemotronh.py, nemotron_nano_v2.py, nemotron_3_nano.py) to match PR #2628 field rename in MambaModelProvider Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor

…w fixes - Add num_layers=3 to MEGATRON_NEMOTRON_3_NANO_OVERRIDES so it matches the toy hybrid_layer_pattern="M*E" (3 layers vs recipe's 52) - Fix init_method_std type hint from int to float in glm_45v_provider.py - Remove redundant attention_backend=AttnBackend.auto from 56B nemotronh recipe constructors (each function overrides it to None later) Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor

Signed-off-by: ruit <ruit@nvidia.com>

The base provider_bridge only forced RoPE when rope_type was missing, so HF configs with rope_type="llama3" (or "default", "longrope", …) fell back to the provider default learned_absolute. Fix by setting position_embedding_type="rope" for every rope_type except "yarn" (which is already handled by hf_config_to_provider_kwargs). Also align the key lookup to check both "type" and "rope_type" keys consistently. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor

…xport Add the missing KimiK2Bridge implementation for Kimi K2 model support and fix the kimi __init__.py to export KimiK2Bridge (matching the actual class name) instead of the non-existent KimiBridge. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor

yaoyu-33 · 2026-03-12T22:30:50Z

/ok to test c7aa37a

The base class provider_bridge now sets position_embedding_type="rope" for all non-YARN models, but Mamba models don't use position embeddings. Override it back to "none" in NemotronHBridge. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor

yaoyu-33 · 2026-03-13T20:50:12Z

/ok to test 827d325

The test_provider_bridge_mamba_config test was checking hybrid_layer_pattern without calling finalize(), but this field is only populated during finalize() from the deprecated hybrid_override_pattern. Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor

yaoyu-33 · 2026-03-14T23:37:40Z

/ok to test f932780

yaoyu-33 · 2026-03-15T01:21:53Z

/ok to test c72bb59

copy-pr-bot Bot temporarily deployed to test February 28, 2026 19:00 Inactive

coderabbitai Bot reviewed Feb 28, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to nemo-ci February 28, 2026 19:09 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci February 28, 2026 19:16 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci February 28, 2026 19:26 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci February 28, 2026 19:26 Failure

copy-pr-bot Bot temporarily deployed to nemo-ci February 28, 2026 19:26 Inactive

yaoyu-33 and others added 16 commits March 12, 2026 15:25

renmove deprecated providers

c468962

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

fix: reset mcore submodule, patch yarn bridge-side, remove GPT size p…

6bc725b

…resets Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor

fix: restore GPTProvider175B for GPT3-175B recipe and test

33dd451

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor

[ci] trigger: force CI re-run

01b02cd

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

ci: re-trigger after flaky SIGSEGV/NCCL failures

f600bdd

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

feat: Implement dynamic position embedding type (#2768)

56e283c

Signed-off-by: ruit <ruit@nvidia.com>

yaoyu-33 mentioned this pull request Mar 12, 2026

Add GLM5 support #2469

Closed

5 tasks

fix

c72bb59

cuichenx approved these changes Mar 16, 2026

View reviewed changes

This was referenced Mar 20, 2026

[model] fix: restore position_embedding_type and gradient_accumulation_fusion for MLA models #2926

Closed

[model] fix: restore position_embedding_type and gradient_accumulation_fusion for MLA models #2927

Merged

shanecmoran mentioned this pull request Mar 23, 2026

[model, training] fix: MoE checkpoint export with YaRN RoPE and flex dispatcher #2641

Closed

4 tasks

kevalmorabia97 mentioned this pull request May 1, 2026

Add previously removed NemotronHBridge SequentialMLP mappings #3628

Merged

5 tasks

Conversation

yaoyu-33 commented Feb 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot Bot commented Feb 28, 2026

Uh oh!

yaoyu-33 commented Feb 28, 2026

Uh oh!

coderabbitai Bot commented Feb 28, 2026

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

yaoyu-33 commented Mar 12, 2026

Uh oh!

yaoyu-33 commented Mar 12, 2026

Uh oh!

yaoyu-33 commented Mar 13, 2026

Uh oh!

yaoyu-33 commented Mar 14, 2026

Uh oh!

yaoyu-33 commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yaoyu-33 commented Feb 28, 2026 •

edited by coderabbitai Bot

Loading