Skip to content

[model, recipe, test] refactor: remove specific model provider dataclasses#2599

Merged
yaoyu-33 merged 19 commits into
mainfrom
yuya/remove-model-providers
Mar 16, 2026
Merged

[model, recipe, test] refactor: remove specific model provider dataclasses#2599
yaoyu-33 merged 19 commits into
mainfrom
yuya/remove-model-providers

Conversation

@yaoyu-33

@yaoyu-33 yaoyu-33 commented Feb 28, 2026

Copy link
Copy Markdown
Contributor

Summary

Remove size-specific model provider dataclasses that are no longer needed now that bridges dynamically configure model architectures via GPTModelProvider, MLAModelProvider, and MambaModelProvider.

Deleted provider files (7):

  • llama_provider.pyLlama{2,3,31,32,4}ModelProvider* (26 classes)
  • nemotron_provider.pyNemotron{3,4}ModelProvider* (8 classes)
  • gpt_oss_provider.pyGPTOSSModelProvider* (4 classes)
  • deepseek_provider.pyDeepSeek{V2,V3}ModelProvider* (4 classes)
  • kimi_provider.pyKimiK2Provider
  • glm45_provider.pyGLM45{Air,Plus}ModelProvider* (5 classes)
  • nemotron_h_provider.pyNemotronH{,Nano}ModelProvider* (6 classes)

Refactored providers with cross-dependencies (3):

  • llama_nemotron_provider.py — now inherits directly from GPTModelProvider
  • glm_45v_provider.py — now inherits directly from GPTModelProvider (inlined GLM MoE defaults)
  • nemotron_vl_provider.py — now inherits directly from MambaModelProvider (inlined NemotronH defaults)

Updated recipes to use base providers (GPTModelProvider, MLAModelProvider, MambaModelProvider) with inlined architecture parameters instead of importing deleted specific providers.

Cleaned up tests:

  • Deleted 7 unit test files that exclusively tested removed providers
  • Updated remaining unit and functional tests to use base provider classes

Net: -1,947 lines (~3,527 deleted, ~1,580 added for inlined parameters)

Test plan

  • Unit tests pass for updated recipe tests (kimi, moonlight, nemotronh, glm45, run_plugins)
  • Unit tests pass for updated model tests (llama_nemotron, glm_vl, models_imports)
  • Functional training tests pass with inlined GPTModelProvider parameters
  • No import errors from removed provider exports

Made with Cursor

Summary by CodeRabbit

Release Notes

  • Refactor

    • Consolidated model provider APIs: Specific model provider classes have been simplified and unified through Bridge-based interfaces for improved maintainability.
    • Updated model configuration architecture across supported models for consistency.
  • Chores

    • Removed deprecated model provider implementations and associated tests.

@copy-pr-bot

copy-pr-bot Bot commented Feb 28, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test 0bca980

@coderabbitai

coderabbitai Bot commented Feb 28, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

This pull request consolidates and refactors the model provider API by removing specialized provider classes across multiple model families (Llama, Llama Nemotron, Nemotron, NemotronH, DeepSeek, GLM, GPT-OSS, Kimi) and replacing them with unified providers (GPTModelProvider, MLAModelProvider, MambaModelProvider). Corresponding recipe and test updates reflect the new provider structure, with bridges becoming the primary public API surface.

Changes

Cohort / File(s) Summary
Package initialization and provider removals
src/megatron/bridge/models/__init__.py, src/megatron/bridge/models/deepseek/__init__.py, src/megatron/bridge/models/glm/__init__.py, src/megatron/bridge/models/gpt_oss/__init__.py, src/megatron/bridge/models/kimi/__init__.py, src/megatron/bridge/models/llama/__init__.py, src/megatron/bridge/models/nemotron/__init__.py, src/megatron/bridge/models/nemotron_vl/__init__.py, src/megatron/bridge/models/nemotronh/__init__.py, src/megatron/bridge/models/llama_nemotron/__init__.py
Removed numerous model-specific provider class exports from public APIs; consolidated to Bridge classes as the primary public interface (e.g., LlamaBridge, NemotronHBridge, DeepSeekV2Bridge). Updated all lists to reflect narrowed export surfaces.
Deleted provider implementation files
src/megatron/bridge/models/deepseek/deepseek_provider.py, src/megatron/bridge/models/gpt_oss/gpt_oss_provider.py, src/megatron/bridge/models/kimi/kimi_provider.py, src/megatron/bridge/models/llama/llama_provider.py, src/megatron/bridge/models/nemotron/nemotron_provider.py, src/megatron/bridge/models/nemotronh/nemotron_h_provider.py
Entirely removed model-specific provider dataclass definitions (e.g., LlamaModelProvider, Llama2ModelProvider7B, NemotronModelProvider, NemotronHModelProvider*, GLMMoEModelProvider, GPTOSSProvider*). Elimination of per-model configuration defaults and deprecation helpers.
Provider inheritance refactoring
src/megatron/bridge/models/glm_vl/glm_45v_provider.py, src/megatron/bridge/models/llama_nemotron/llama_nemotron_provider.py, src/megatron/bridge/models/llama_nemotron/llama_nemotron_bridge.py, src/megatron/bridge/models/nemotron_vl/nemotron_vl_provider.py, src/megatron/bridge/models/nemotron_vl/nemotron_vl_bridge.py
Updated base classes from removed model-specific providers (GLM45AirModelProvider106B, Llama31ModelProvider, NemotronNano12Bv2Provider) to unified providers (GPTModelProvider, MambaModelProvider). Added extensive public configuration fields to support flexible model instantiation without pre-baked defaults.
Bridge provider type updates
src/megatron/bridge/models/conversion/model_bridge.py
Updated provider_bridge method return type from model-specific providers (e.g., LlamaModelProvider) to unified providers (GPTModelProvider).
Recipe configuration migration
src/megatron/bridge/recipes/kimi/kimi_k2.py, src/megatron/bridge/recipes/moonlight/moonlight_16b.py, src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py, src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py, src/megatron/bridge/recipes/nemotronh/nemotronh.py
Replaced model-specific provider class usage with AutoBridge or unified providers (MambaModelProvider, MLAModelProvider). Migrated architecture parameters from provider-class defaults to explicit configuration attributes passed to unified providers.
Unit test deletions
tests/unit_tests/models/deepseek/test_deepseek_provider.py, tests/unit_tests/models/glm/test_glm45_provider.py, tests/unit_tests/models/gpt_oss/test_gpt_oss_provider.py, tests/unit_tests/models/kimi/test_kimi_provider.py, tests/unit_tests/models/llama/test_llama_provider.py, tests/unit_tests/models/nemotron/test_nemotron_provider.py, tests/unit_tests/models/nemotronh/test_nemotron_h_provider.py, tests/unit_tests/models/llama_nemotron/test_llama_nemotron_bridge.py
Removed test suites that validated defaults and configurations of deleted provider classes. Tests no longer needed as provider classes no longer exist.
Functional test updates
tests/functional_tests/data/test_samplers.py, tests/functional_tests/models/gpt_oss/test_gpt_oss_provider.py, tests/functional_tests/training/test_callbacks.py, tests/functional_tests/training/test_decentralized_pg.py, tests/functional_tests/training/test_finetune_dora.py, tests/functional_tests/training/test_finetune_lora.py, tests/functional_tests/training/test_inprocess_restart.py, tests/functional_tests/training/test_megatron_fsdp.py, tests/functional_tests/training/test_nvrx_straggler.py, tests/functional_tests/training/test_pretrain.py, tests/functional_tests/training/test_pretrain_resume.py, tests/functional_tests/training/test_sample_based_training.py, tests/functional_tests/training/test_sft.py, tests/functional_tests/training/test_tensor_inspect.py
Replaced test model provider configurations from model-specific classes (Llama32ModelProvider1B, Llama3ModelProvider8B) to unified providers (GPTModelProvider). Extended configuration parameters to support explicit specification of model architecture details previously embedded in provider defaults.
Unit test recipe/integration updates
tests/unit_tests/models/glm_vl/test_glm_45v_provider.py, tests/unit_tests/recipes/kimi/test_kimi_k2.py, tests/unit_tests/recipes/nemotronh/test_nemotron_3_nano.py, tests/unit_tests/recipes/nemotronh/test_nemotron_nano_v2.py, tests/unit_tests/recipes/nemotronh/test_nemotronh.py, tests/unit_tests/recipes/test_glm45_recipes.py, tests/unit_tests/recipes/test_moonlight_recipes.py, tests/unit_tests/recipes/test_run_plugins.py, tests/unit_tests/training/test_config.py, tests/unit_tests/training/test_log_non_default_values.py, tests/unit_tests/models/test_models_imports.py
Updated recipe tests to expect unified providers (MambaModelProvider) instead of removed model-specific providers. Added AutoBridge mocking fixtures. Updated import assertions and isinstance checks to reflect new provider structure.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested labels

enhancement

Suggested reviewers

  • ko3n1g
  • maanug-nv
🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning PR removes 7 provider modules and 1,408+ lines of tests with only a test plan, no actual CI logs or pass/fail evidence, plus confirmed code issues including type annotation error and redundant code. Document CI/test logs, resolve identified code review issues, verify imports work for deleted providers, and include regression testing results before merging this large refactoring.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[model, recipe, test] refactor: remove specific model provider dataclasses' is clear, specific, and directly describes the main change—removing redundant size-specific model provider classes now replaced by bridges that dynamically configure base providers.
Docstring Coverage ✅ Passed Docstring coverage is 81.25% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch yuya/remove-model-providers

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/megatron/bridge/recipes/kimi/kimi_k2.py (1)

83-86: ⚠️ Potential issue | 🔴 Critical

KimiBridge implementation is missing – the recipe will fail at runtime.

The recipe imports KimiBridge from src/megatron/bridge/models/kimi/kimi_bridge.py (via kimi/__init__.py), but this file does not exist in the codebase. When AutoBridge.from_hf_pretrained("moonshotai/Kimi-K2-Instruct") is called, the dispatch system will fail to find a registered bridge implementation for the Kimi model architecture, causing a runtime error before cfg.model.vocab_size can even be accessed.

The bridge must be implemented similar to DeepSeekV2Bridge (which uses MLAModelProvider and @MegatronModelBridge.register_bridge) to register with the dispatch system. Once implemented, vocab_size will be available from the HF config mapping (GPTModelProvider has vocab_size: Optional[int] = None and model_bridge.py maps it via CONFIG_MAPPING).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/kimi/kimi_k2.py` around lines 83 - 86, The recipe
fails at runtime because the KimiBridge implementation is missing; create a new
bridge class named KimiBridge that follows the pattern of DeepSeekV2Bridge:
annotate it with `@MegatronModelBridge.register_bridge` and implement the required
interface methods, use MLAModelProvider (or the appropriate ModelProvider) to
map HF config fields (ensure vocab_size is exposed via
CONFIG_MAPPING/GPTModelProvider semantics) and register the bridge so
AutoBridge.from_hf_pretrained("moonshotai/Kimi-K2-Instruct") can dispatch
correctly; mirror DeepSeekV2Bridge’s constructors and provider wiring to ensure
cfg.model.vocab_size is populated.
🧹 Nitpick comments (12)
tests/functional_tests/data/test_samplers.py (1)

620-622: LGTM!

Migration is correct.

Optional: The _DummyBridge class with GPTModelProvider is duplicated across 5 test methods. Consider extracting to a module-level helper or pytest fixture if this pattern expands further.

,

♻️ Optional refactor using pytest fixture
# At module level or in conftest.py
`@pytest.fixture`
def mock_llama3_bridge():
    """Mock AutoBridge to avoid HF downloads, returning GPTModelProvider."""
    from unittest import mock as _mock

    class _DummyBridge:
        def to_megatron_provider(self, load_weights=False):
            from megatron.bridge.models.gpt_provider import GPTModelProvider
            return GPTModelProvider()

    with _mock.patch(
        "megatron.bridge.recipes.llama.llama3.AutoBridge.from_hf_pretrained",
        return_value=_DummyBridge()
    ):
        yield

Then tests can simply use def test_...(self, mock_llama3_bridge): without the repeated boilerplate.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/data/test_samplers.py` around lines 620 - 622, Extract
the repeated _DummyBridge + GPTModelProvider pattern into a reusable pytest
fixture or module-level helper to avoid duplication: create a fixture (e.g.,
mock_llama3_bridge) that defines class _DummyBridge with method
to_megatron_provider returning GPTModelProvider, and use unittest.mock.patch to
patch "megatron.bridge.recipes.llama.llama3.AutoBridge.from_hf_pretrained" to
return that _DummyBridge, then update the five tests to accept the fixture
(e.g., def test_...(mock_llama3_bridge)) instead of repeating the inline class
and patch.
tests/unit_tests/recipes/test_moonlight_recipes.py (1)

136-136: Consolidate repeated MLAModelProvider monkeypatching into a pytest fixture.

The identical monkeypatch.setattr(mod, "MLAModelProvider", _FakeMoonlightModelProvider16B) setup is repeated in many tests. Please move this into a shared fixture (module-level or reusable helper fixture) to reduce duplication and keep setup consistent.

As per coding guidelines, "tests/**/*.py: Use pytest fixtures for common setup in unit tests".

Also applies to: 171-171, 196-196, 222-222, 238-238, 261-261, 284-284, 307-307, 329-329

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/recipes/test_moonlight_recipes.py` at line 136, Create a
reusable pytest fixture that performs the repeated monkeypatching of
MLAModelProvider to _FakeMoonlightModelProvider16B and replace the duplicated
monkeypatch.setattr(mod, "MLAModelProvider", _FakeMoonlightModelProvider16B)
lines in tests with that fixture; implement a fixture (e.g.,
fake_mlamodel_provider or patch_mla_provider) that accepts monkeypatch, calls
monkeypatch.setattr(mod, "MLAModelProvider", _FakeMoonlightModelProvider16B),
mark it module- or session-scoped as appropriate, and add the fixture name to
the tests that previously had the inline monkeypatch to eliminate duplication
while preserving behavior.
src/megatron/bridge/recipes/moonlight/moonlight_16b.py (2)

413-468: Optional: Extract shared architecture parameters to reduce duplication.

The MLAModelProvider instantiation in moonlight_16b_peft_config duplicates ~40 architecture parameters that are identical to moonlight_16b_sft_config (lines 200-241 vs 414-454). Only the parallelism settings differ.

This is acceptable for clarity and explicit configuration, but if maintainability becomes a concern, consider extracting the shared architecture parameters into a module-level constant.

♻️ Example: Extract shared architecture parameters
# Module-level constant for shared architecture
_MOONLIGHT_16B_ARCH = dict(
    num_layers=27,
    hidden_size=2048,
    ffn_hidden_size=11264,
    # ... remaining shared params
)

def moonlight_16b_sft_config() -> ConfigContainer:
    cfg.model = MLAModelProvider(
        **_MOONLIGHT_16B_ARCH,
        # SFT-specific parallelism
        tensor_model_parallel_size=2,
        expert_model_parallel_size=8,
        sequence_parallel=True,
        # ...
    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/moonlight/moonlight_16b.py` around lines 413 -
468, The MLAModelProvider block in moonlight_16b_peft_config duplicates many
architecture params already used in moonlight_16b_sft_config; extract the shared
architecture into a module-level constant (e.g., _MOONLIGHT_16B_ARCH) containing
the common keys (num_layers, hidden_size, ffn_hidden_size, num_attention_heads,
kv_channels, vocab_size, rotary_*, layernorm_epsilon, init_method_std,
activation_func, gated_linear_unit, etc.), then update both
moonlight_16b_sft_config and moonlight_16b_peft_config to call
MLAModelProvider(**_MOONLIGHT_16B_ARCH, <their specific overrides>) so only
parallelism and config-specific overrides remain in each function.

199-255: Consider removing redundant parameter assignments.

The MLAModelProvider constructor sets moe_token_dispatcher_type="alltoall" (line 241) and moe_grouped_gemm=True (line 240), but these are reassigned later in the function:

  • cfg.model.moe_token_dispatcher_type at lines 284
  • cfg.model.moe_grouped_gemm at line 308
  • cfg.model.moe_permute_fusion at lines 274 and 307

While functionally correct, the duplication could lead to drift if only one location is updated.

♻️ Suggested cleanup: remove redundant assignments from constructor or later code

Either remove from the constructor (if they're always overridden by the DeePEP logic):

         moe_grouped_gemm=True,
-        moe_token_dispatcher_type="alltoall",
         # Parallelism

Or consolidate all MoE settings in one place and remove the later redundant assignments.

Note: The Ruff S106 warnings about moe_token_dispatcher_type are false positives—this is a Mixture-of-Experts configuration parameter, not a password.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/moonlight/moonlight_16b.py` around lines 199 -
255, The constructor for MLAModelProvider sets MoE-related defaults
(moe_grouped_gemm, moe_token_dispatcher_type, moe_permute_fusion) that are later
reassigned on cfg.model (e.g., cfg.model.moe_token_dispatcher_type,
cfg.model.moe_grouped_gemm, cfg.model.moe_permute_fusion); remove the redundant
assignments either from the MLAModelProvider(...) argument list or from the
later reassignment block so all MoE settings are defined in one place—locate
MLAModelProvider(...) in moonlight_16b.py and the subsequent lines where
cfg.model.moe_permute_fusion, cfg.model.moe_token_dispatcher_type, and
cfg.model.moe_grouped_gemm are set, then keep only the preferred location and
delete the duplicates.
tests/unit_tests/training/test_config.py (1)

79-90: Consider renaming the helper to reflect the new abstraction.

create_test_deepseek_config now returns MLAModelProvider. A neutral helper name (e.g., create_test_mla_config) would reduce ambiguity as MLA coverage expands.

♻️ Optional clarity refactor
-def create_test_deepseek_config(**kwargs: Any) -> MLAModelProvider:
-    """Creates an instance of MLAModelProvider for testing."""
+def create_test_mla_config(**kwargs: Any) -> MLAModelProvider:
+    """Creates an instance of MLAModelProvider for testing."""
     defaults = {
         "num_layers": 1,
         "hidden_size": 128,
         "num_attention_heads": 4,
         "seq_length": 512,
         "apply_rope_fusion": False,
     }
     defaults.update(kwargs)
     return MLAModelProvider(**defaults)
+
+# Backward-compatible alias for existing tests.
+create_test_deepseek_config = create_test_mla_config
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/training/test_config.py` around lines 79 - 90, Rename the
helper function create_test_deepseek_config to a neutral name like
create_test_mla_config since it now returns MLAModelProvider; update the
function definition and all call sites to the new name (search for
create_test_deepseek_config and replace with create_test_mla_config) and keep
the same parameters/defaults and return MLAModelProvider to preserve behavior.
tests/unit_tests/recipes/nemotronh/test_nemotronh.py (1)

163-177: Consider simplifying the parameterized test.

Since all variants now use MambaModelProvider, the provider_cls parameter is redundant—every tuple maps to the same class. You could simplify by removing the parameter and using MambaModelProvider directly in the assertion.

That said, keeping the current structure is fine if you prefer consistency with other parameterized tests or want flexibility for future divergence.

♻️ Optional simplification
     `@pytest.mark.parametrize`(
-        "recipe_fn,provider_cls",
+        "recipe_fn",
         [
-            (nemotronh_4b_pretrain_config, MambaModelProvider),
-            (nemotronh_8b_pretrain_config, MambaModelProvider),
-            (nemotronh_47b_pretrain_config, MambaModelProvider),
-            (nemotronh_56b_pretrain_config, MambaModelProvider),
+            nemotronh_4b_pretrain_config,
+            nemotronh_8b_pretrain_config,
+            nemotronh_47b_pretrain_config,
+            nemotronh_56b_pretrain_config,
         ],
     )
-    def test_config_container_structure(self, recipe_fn, provider_cls):
+    def test_config_container_structure(self, recipe_fn):
         """Test that all configs return proper ConfigContainer with correct model provider."""
         config = recipe_fn()
 
         assert isinstance(config, ConfigContainer)
-        assert isinstance(config.model, provider_cls)
+        assert isinstance(config.model, MambaModelProvider)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit_tests/recipes/nemotronh/test_nemotronh.py` around lines 163 - 177,
The parameterized test test_config_container_structure currently includes a
redundant provider_cls parameter because all tuples use MambaModelProvider;
simplify by parameterizing only the recipe functions
(nemotronh_4b_pretrain_config, nemotronh_8b_pretrain_config,
nemotronh_47b_pretrain_config, nemotronh_56b_pretrain_config) and replace the
second assertion to directly assert isinstance(config.model, MambaModelProvider)
while keeping the initial assert isinstance(config, ConfigContainer); update the
`@pytest.mark.parametrize` decorator and the test signature accordingly to remove
provider_cls.
src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py (1)

211-268: Configuration duplication across pretrain/SFT/PEFT is notable but acceptable.

The SFT and PEFT configs duplicate the MambaModelProvider instantiation with ~40 identical architecture parameters. While this could be extracted into a helper function, this pattern appears consistent with other recipes in the codebase.

♻️ Optional: Extract common architecture parameters
def _nemotron_3_nano_model_kwargs(seq_length: int = 8192) -> dict:
    """Common Nemotron 3 Nano architecture parameters."""
    return dict(
        hybrid_override_pattern="MEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEMEM*EMEMEMEME",
        num_layers=52,
        hidden_size=2688,
        mamba_num_heads=64,
        kv_channels=128,
        mamba_state_dim=128,
        # ... remaining common params
        seq_length=seq_length,
    )

# Usage in configs:
cfg.model = MambaModelProvider(
    **_nemotron_3_nano_model_kwargs(seq_length=8192),
    tensor_model_parallel_size=4,
    # ... parallelism params
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py` around lines 211 -
268, Multiple recipes duplicate the same MambaModelProvider architecture block;
extract those shared parameters into a helper function (e.g.,
_nemotron_3_nano_model_kwargs) that returns a dict of the common keys
(hybrid_override_pattern, num_layers, hidden_size, mamba_num_heads, kv_channels,
mamba_state_dim, ffn_hidden_size, seq_length, etc.), then update cfg.model
assignments in pretrain/SFT/PEFT to call
MambaModelProvider(**_nemotron_3_nano_model_kwargs(seq_length=...), <override or
additional parallelism args>) so only recipe-specific fields (like
tensor_model_parallel_size, pipeline_model_parallel_size, pipeline_dtype,
expert_model_parallel_size) remain inline.
tests/functional_tests/training/test_pretrain_resume.py (1)

47-72: LGTM with optional consolidation opportunity.

The Llama3ModelProvider145M class correctly inherits from GPTModelProvider with Llama-3-style defaults. Note that this class definition is nearly identical to the one in test_sft.py (with minor differences in seq_length and num_layers defaults). Consider extracting a shared test utility class in a future refactor to reduce duplication across functional tests.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/training/test_pretrain_resume.py` around lines 47 -
72, The Llama3ModelProvider145M test class duplicates a nearly identical
definition found in test_sft.py; refactor by extracting the shared defaults into
a single reusable test utility (e.g., a base class or factory function) and have
both Llama3ModelProvider145M and the counterpart in test_sft.py inherit from or
invoke that utility; update the unique overrides (seq_length, num_layers) in
each test-specific subclass or factory call and remove the duplicated field
definitions to centralize maintenance.
tests/functional_tests/training/test_finetune_lora.py (1)

53-80: LGTM!

The Llama3ModelProvider145M class correctly inherits from GPTModelProvider with Llama-3-style defaults. The explicit vocab_size=128256 is appropriate for LoRA finetuning tests that use HuggingFace tokenizers.

This is the third definition of Llama3ModelProvider145M across functional test files (also in test_sft.py and test_pretrain_resume.py). Consider consolidating these into a shared test utility module (e.g., tests/functional_tests/utils.py or a dedicated tests/functional_tests/training/model_providers.py) in a future refactor to reduce maintenance burden.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/training/test_finetune_lora.py` around lines 53 - 80,
Duplicate Llama3ModelProvider145M definitions should be consolidated into a
single shared test utility: create a new test helper module (e.g.,
tests/functional_tests/training/model_providers.py or utils.py), move the class
Llama3ModelProvider145M into that module, update usages in tests that currently
define the class (references in test_sft.py and test_pretrain_resume.py and this
file) to import Llama3ModelProvider145M from the new module, and remove the
duplicate class definitions so tests import the single canonical provider.
tests/functional_tests/training/test_decentralized_pg.py (1)

104-140: High duplication in model config setup across decentralized-PG tests.

A shared builder (with per-test overrides for TP/PP/CP/share-embeddings) would make these tests easier to maintain and less brittle.

Also applies to: 255-290, 410-446, 566-602, 722-758, 878-914

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/training/test_decentralized_pg.py` around lines 104 -
140, The duplicated GPTModelProvider block should be extracted into a shared
builder function (e.g., make_decentralized_model_cfg or build_base_gpt_model)
that returns a GPTModelProvider pre-populated with the common args shown
(normalization, activation_func, gated_linear_unit, position_embedding_type, ...
hidden_size, ffn_hidden_size, num_attention_heads, init_method_std,
layernorm_epsilon, rotary/rope settings, seq_length,
make_vocab_size_divisible_by, vocab_size, num_layers, etc.), and each test
should call that builder and apply per-test overrides for
tensor_model_parallel_size, pipeline_model_parallel_size, context_parallel_size,
sequence_parallel, and share_embeddings_and_output_weights; update all
referenced tests (the blocks at ~104-140, 255-290, 410-446, 566-602, 722-758,
878-914) to use the builder and only change the TP/PP/CP/share-embeddings fields
to reduce duplication and improve maintainability.
src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py (1)

38-77: Consider factoring shared MambaModelProvider kwargs into reusable builders/constants.

These six blocks duplicate a large common parameter surface; extracting shared defaults would reduce maintenance overhead and accidental divergence across pretrain/SFT/PEFT variants.

Also applies to: 173-212, 310-349, 436-475, 571-610, 718-757

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py` around lines 38 -
77, Multiple MambaModelProvider invocations (e.g., the cfg.model =
MambaModelProvider block) duplicate a large common parameter set across
variants; factor those shared kwargs into a reusable constant or builder (for
example a MAMBA_DEFAULTS dict or a build_mamba_kwargs(name, overrides) helper)
and then construct each provider via unpacking the shared defaults plus only the
variant-specific overrides (use the existing symbols like MambaModelProvider and
cfg.model to locate callers and merge with per-variant keys such as num_layers,
seq_length, pipeline_dtype, etc.). Ensure the new helper/constant is
imported/defined once and replace the repeated blocks at the other occurrences
noted so differences remain explicit and minimal.
tests/functional_tests/training/test_pretrain.py (1)

73-108: Consider extracting a shared GPT model-config factory.

These two constructor blocks are mostly duplicated; a small helper would reduce config drift across tests when provider defaults/fields change.

Also applies to: 219-255

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/functional_tests/training/test_pretrain.py` around lines 73 - 108, The
two nearly identical GPTModelProvider constructor blocks (creating model_cfg)
should be extracted into a shared factory to avoid duplication and drift; add a
helper function (e.g., make_gpt_model_cfg or build_gpt_model_config) that
returns a configured GPTModelProvider with the common parameters shown
(normalization, activation_func, gated_linear_unit, position_embedding_type,
add_bias_linear, attention_dropout, hidden_dropout, bias_activation_fusion,
masked_softmax_fusion, persist_layer_norm, bias_dropout_fusion,
apply_rope_fusion, num_query_groups, init_method_std, layernorm_epsilon,
rotary_percent, rope_scaling, rope_scaling_factor,
share_embeddings_and_output_weights, rotary_base, hidden_size, ffn_hidden_size,
num_attention_heads, tensor_model_parallel_size, pipeline_model_parallel_size,
context_parallel_size, sequence_parallel, attention_softmax_in_fp32,
pipeline_dtype, bf16, seq_length, make_vocab_size_divisible_by, vocab_size,
num_layers) and replace the two inline constructor blocks (the model_cfg
assignments at the shown location and the similar block around lines 219–255)
with calls to that helper, allowing optional overrides for seq_length,
num_layers, vocab_size, etc.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/megatron/bridge/models/glm_vl/glm_45v_provider.py`:
- Line 56: The type hint for the variable init_method_std is incorrect: it is
declared as int but initialized with a float (0.02); update the annotation from
int to float for init_method_std in glm_45v_provider.py so the signature matches
the default value (leave the default 0.02 unchanged) — locate the
init_method_std declaration and change its type to float.

In `@src/megatron/bridge/recipes/nemotronh/nemotronh.py`:
- Line 455: Remove the misleading explicit attention_backend=AttnBackend.auto
argument from the MambaModelProvider constructor calls in the three 56B variants
(the places that currently pass attention_backend=AttnBackend.auto) because each
function later sets cfg.model.attention_backend = None; edit those
MambaModelProvider(...) invocations to omit the attention_backend parameter so
the intended None configuration (used by Mamba models) is the effective setting
and will not confuse maintainers.

---

Outside diff comments:
In `@src/megatron/bridge/recipes/kimi/kimi_k2.py`:
- Around line 83-86: The recipe fails at runtime because the KimiBridge
implementation is missing; create a new bridge class named KimiBridge that
follows the pattern of DeepSeekV2Bridge: annotate it with
`@MegatronModelBridge.register_bridge` and implement the required interface
methods, use MLAModelProvider (or the appropriate ModelProvider) to map HF
config fields (ensure vocab_size is exposed via CONFIG_MAPPING/GPTModelProvider
semantics) and register the bridge so
AutoBridge.from_hf_pretrained("moonshotai/Kimi-K2-Instruct") can dispatch
correctly; mirror DeepSeekV2Bridge’s constructors and provider wiring to ensure
cfg.model.vocab_size is populated.

---

Nitpick comments:
In `@src/megatron/bridge/recipes/moonlight/moonlight_16b.py`:
- Around line 413-468: The MLAModelProvider block in moonlight_16b_peft_config
duplicates many architecture params already used in moonlight_16b_sft_config;
extract the shared architecture into a module-level constant (e.g.,
_MOONLIGHT_16B_ARCH) containing the common keys (num_layers, hidden_size,
ffn_hidden_size, num_attention_heads, kv_channels, vocab_size, rotary_*,
layernorm_epsilon, init_method_std, activation_func, gated_linear_unit, etc.),
then update both moonlight_16b_sft_config and moonlight_16b_peft_config to call
MLAModelProvider(**_MOONLIGHT_16B_ARCH, <their specific overrides>) so only
parallelism and config-specific overrides remain in each function.
- Around line 199-255: The constructor for MLAModelProvider sets MoE-related
defaults (moe_grouped_gemm, moe_token_dispatcher_type, moe_permute_fusion) that
are later reassigned on cfg.model (e.g., cfg.model.moe_token_dispatcher_type,
cfg.model.moe_grouped_gemm, cfg.model.moe_permute_fusion); remove the redundant
assignments either from the MLAModelProvider(...) argument list or from the
later reassignment block so all MoE settings are defined in one place—locate
MLAModelProvider(...) in moonlight_16b.py and the subsequent lines where
cfg.model.moe_permute_fusion, cfg.model.moe_token_dispatcher_type, and
cfg.model.moe_grouped_gemm are set, then keep only the preferred location and
delete the duplicates.

In `@src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py`:
- Around line 211-268: Multiple recipes duplicate the same MambaModelProvider
architecture block; extract those shared parameters into a helper function
(e.g., _nemotron_3_nano_model_kwargs) that returns a dict of the common keys
(hybrid_override_pattern, num_layers, hidden_size, mamba_num_heads, kv_channels,
mamba_state_dim, ffn_hidden_size, seq_length, etc.), then update cfg.model
assignments in pretrain/SFT/PEFT to call
MambaModelProvider(**_nemotron_3_nano_model_kwargs(seq_length=...), <override or
additional parallelism args>) so only recipe-specific fields (like
tensor_model_parallel_size, pipeline_model_parallel_size, pipeline_dtype,
expert_model_parallel_size) remain inline.

In `@src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py`:
- Around line 38-77: Multiple MambaModelProvider invocations (e.g., the
cfg.model = MambaModelProvider block) duplicate a large common parameter set
across variants; factor those shared kwargs into a reusable constant or builder
(for example a MAMBA_DEFAULTS dict or a build_mamba_kwargs(name, overrides)
helper) and then construct each provider via unpacking the shared defaults plus
only the variant-specific overrides (use the existing symbols like
MambaModelProvider and cfg.model to locate callers and merge with per-variant
keys such as num_layers, seq_length, pipeline_dtype, etc.). Ensure the new
helper/constant is imported/defined once and replace the repeated blocks at the
other occurrences noted so differences remain explicit and minimal.

In `@tests/functional_tests/data/test_samplers.py`:
- Around line 620-622: Extract the repeated _DummyBridge + GPTModelProvider
pattern into a reusable pytest fixture or module-level helper to avoid
duplication: create a fixture (e.g., mock_llama3_bridge) that defines class
_DummyBridge with method to_megatron_provider returning GPTModelProvider, and
use unittest.mock.patch to patch
"megatron.bridge.recipes.llama.llama3.AutoBridge.from_hf_pretrained" to return
that _DummyBridge, then update the five tests to accept the fixture (e.g., def
test_...(mock_llama3_bridge)) instead of repeating the inline class and patch.

In `@tests/functional_tests/training/test_decentralized_pg.py`:
- Around line 104-140: The duplicated GPTModelProvider block should be extracted
into a shared builder function (e.g., make_decentralized_model_cfg or
build_base_gpt_model) that returns a GPTModelProvider pre-populated with the
common args shown (normalization, activation_func, gated_linear_unit,
position_embedding_type, ... hidden_size, ffn_hidden_size, num_attention_heads,
init_method_std, layernorm_epsilon, rotary/rope settings, seq_length,
make_vocab_size_divisible_by, vocab_size, num_layers, etc.), and each test
should call that builder and apply per-test overrides for
tensor_model_parallel_size, pipeline_model_parallel_size, context_parallel_size,
sequence_parallel, and share_embeddings_and_output_weights; update all
referenced tests (the blocks at ~104-140, 255-290, 410-446, 566-602, 722-758,
878-914) to use the builder and only change the TP/PP/CP/share-embeddings fields
to reduce duplication and improve maintainability.

In `@tests/functional_tests/training/test_finetune_lora.py`:
- Around line 53-80: Duplicate Llama3ModelProvider145M definitions should be
consolidated into a single shared test utility: create a new test helper module
(e.g., tests/functional_tests/training/model_providers.py or utils.py), move the
class Llama3ModelProvider145M into that module, update usages in tests that
currently define the class (references in test_sft.py and
test_pretrain_resume.py and this file) to import Llama3ModelProvider145M from
the new module, and remove the duplicate class definitions so tests import the
single canonical provider.

In `@tests/functional_tests/training/test_pretrain_resume.py`:
- Around line 47-72: The Llama3ModelProvider145M test class duplicates a nearly
identical definition found in test_sft.py; refactor by extracting the shared
defaults into a single reusable test utility (e.g., a base class or factory
function) and have both Llama3ModelProvider145M and the counterpart in
test_sft.py inherit from or invoke that utility; update the unique overrides
(seq_length, num_layers) in each test-specific subclass or factory call and
remove the duplicated field definitions to centralize maintenance.

In `@tests/functional_tests/training/test_pretrain.py`:
- Around line 73-108: The two nearly identical GPTModelProvider constructor
blocks (creating model_cfg) should be extracted into a shared factory to avoid
duplication and drift; add a helper function (e.g., make_gpt_model_cfg or
build_gpt_model_config) that returns a configured GPTModelProvider with the
common parameters shown (normalization, activation_func, gated_linear_unit,
position_embedding_type, add_bias_linear, attention_dropout, hidden_dropout,
bias_activation_fusion, masked_softmax_fusion, persist_layer_norm,
bias_dropout_fusion, apply_rope_fusion, num_query_groups, init_method_std,
layernorm_epsilon, rotary_percent, rope_scaling, rope_scaling_factor,
share_embeddings_and_output_weights, rotary_base, hidden_size, ffn_hidden_size,
num_attention_heads, tensor_model_parallel_size, pipeline_model_parallel_size,
context_parallel_size, sequence_parallel, attention_softmax_in_fp32,
pipeline_dtype, bf16, seq_length, make_vocab_size_divisible_by, vocab_size,
num_layers) and replace the two inline constructor blocks (the model_cfg
assignments at the shown location and the similar block around lines 219–255)
with calls to that helper, allowing optional overrides for seq_length,
num_layers, vocab_size, etc.

In `@tests/unit_tests/recipes/nemotronh/test_nemotronh.py`:
- Around line 163-177: The parameterized test test_config_container_structure
currently includes a redundant provider_cls parameter because all tuples use
MambaModelProvider; simplify by parameterizing only the recipe functions
(nemotronh_4b_pretrain_config, nemotronh_8b_pretrain_config,
nemotronh_47b_pretrain_config, nemotronh_56b_pretrain_config) and replace the
second assertion to directly assert isinstance(config.model, MambaModelProvider)
while keeping the initial assert isinstance(config, ConfigContainer); update the
`@pytest.mark.parametrize` decorator and the test signature accordingly to remove
provider_cls.

In `@tests/unit_tests/recipes/test_moonlight_recipes.py`:
- Line 136: Create a reusable pytest fixture that performs the repeated
monkeypatching of MLAModelProvider to _FakeMoonlightModelProvider16B and replace
the duplicated monkeypatch.setattr(mod, "MLAModelProvider",
_FakeMoonlightModelProvider16B) lines in tests with that fixture; implement a
fixture (e.g., fake_mlamodel_provider or patch_mla_provider) that accepts
monkeypatch, calls monkeypatch.setattr(mod, "MLAModelProvider",
_FakeMoonlightModelProvider16B), mark it module- or session-scoped as
appropriate, and add the fixture name to the tests that previously had the
inline monkeypatch to eliminate duplication while preserving behavior.

In `@tests/unit_tests/training/test_config.py`:
- Around line 79-90: Rename the helper function create_test_deepseek_config to a
neutral name like create_test_mla_config since it now returns MLAModelProvider;
update the function definition and all call sites to the new name (search for
create_test_deepseek_config and replace with create_test_mla_config) and keep
the same parameters/defaults and return MLAModelProvider to preserve behavior.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 383b610 and 0bca980.

📒 Files selected for processing (61)
  • src/megatron/bridge/models/__init__.py
  • src/megatron/bridge/models/conversion/model_bridge.py
  • src/megatron/bridge/models/deepseek/__init__.py
  • src/megatron/bridge/models/deepseek/deepseek_provider.py
  • src/megatron/bridge/models/glm/__init__.py
  • src/megatron/bridge/models/glm/glm45_provider.py
  • src/megatron/bridge/models/glm_vl/glm_45v_provider.py
  • src/megatron/bridge/models/gpt_oss/__init__.py
  • src/megatron/bridge/models/gpt_oss/gpt_oss_provider.py
  • src/megatron/bridge/models/kimi/__init__.py
  • src/megatron/bridge/models/kimi/kimi_provider.py
  • src/megatron/bridge/models/llama/__init__.py
  • src/megatron/bridge/models/llama/llama_provider.py
  • src/megatron/bridge/models/llama_nemotron/__init__.py
  • src/megatron/bridge/models/llama_nemotron/llama_nemotron_bridge.py
  • src/megatron/bridge/models/llama_nemotron/llama_nemotron_provider.py
  • src/megatron/bridge/models/nemotron/__init__.py
  • src/megatron/bridge/models/nemotron/nemotron_provider.py
  • src/megatron/bridge/models/nemotron_vl/__init__.py
  • src/megatron/bridge/models/nemotron_vl/nemotron_vl_bridge.py
  • src/megatron/bridge/models/nemotron_vl/nemotron_vl_provider.py
  • src/megatron/bridge/models/nemotronh/__init__.py
  • src/megatron/bridge/models/nemotronh/nemotron_h_provider.py
  • src/megatron/bridge/recipes/kimi/kimi_k2.py
  • src/megatron/bridge/recipes/moonlight/moonlight_16b.py
  • src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py
  • src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py
  • src/megatron/bridge/recipes/nemotronh/nemotronh.py
  • tests/functional_tests/data/test_samplers.py
  • tests/functional_tests/models/gpt_oss/test_gpt_oss_provider.py
  • tests/functional_tests/training/test_callbacks.py
  • tests/functional_tests/training/test_decentralized_pg.py
  • tests/functional_tests/training/test_finetune_dora.py
  • tests/functional_tests/training/test_finetune_lora.py
  • tests/functional_tests/training/test_inprocess_restart.py
  • tests/functional_tests/training/test_megatron_fsdp.py
  • tests/functional_tests/training/test_nvrx_straggler.py
  • tests/functional_tests/training/test_pretrain.py
  • tests/functional_tests/training/test_pretrain_resume.py
  • tests/functional_tests/training/test_sample_based_training.py
  • tests/functional_tests/training/test_sft.py
  • tests/functional_tests/training/test_tensor_inspect.py
  • tests/unit_tests/models/deepseek/test_deepseek_provider.py
  • tests/unit_tests/models/glm/test_glm45_provider.py
  • tests/unit_tests/models/glm_vl/test_glm_45v_provider.py
  • tests/unit_tests/models/gpt_oss/test_gpt_oss_provider.py
  • tests/unit_tests/models/kimi/test_kimi_provider.py
  • tests/unit_tests/models/llama/test_llama_provider.py
  • tests/unit_tests/models/llama_nemotron/test_llama_nemotron_bridge.py
  • tests/unit_tests/models/nemotron/test_nemotron_provider.py
  • tests/unit_tests/models/nemotronh/test_nemotron_h_provider.py
  • tests/unit_tests/models/test_models_imports.py
  • tests/unit_tests/recipes/kimi/test_kimi_k2.py
  • tests/unit_tests/recipes/nemotronh/test_nemotron_3_nano.py
  • tests/unit_tests/recipes/nemotronh/test_nemotron_nano_v2.py
  • tests/unit_tests/recipes/nemotronh/test_nemotronh.py
  • tests/unit_tests/recipes/test_glm45_recipes.py
  • tests/unit_tests/recipes/test_moonlight_recipes.py
  • tests/unit_tests/recipes/test_run_plugins.py
  • tests/unit_tests/training/test_config.py
  • tests/unit_tests/training/test_log_non_default_values.py
💤 Files with no reviewable changes (22)
  • src/megatron/bridge/models/glm/init.py
  • src/megatron/bridge/models/glm/glm45_provider.py
  • tests/unit_tests/models/deepseek/test_deepseek_provider.py
  • src/megatron/bridge/models/nemotron/init.py
  • tests/unit_tests/models/kimi/test_kimi_provider.py
  • tests/functional_tests/models/gpt_oss/test_gpt_oss_provider.py
  • src/megatron/bridge/models/kimi/kimi_provider.py
  • tests/unit_tests/models/glm/test_glm45_provider.py
  • src/megatron/bridge/models/gpt_oss/gpt_oss_provider.py
  • src/megatron/bridge/models/nemotron_vl/init.py
  • src/megatron/bridge/models/llama_nemotron/init.py
  • tests/unit_tests/models/llama/test_llama_provider.py
  • tests/unit_tests/models/nemotron/test_nemotron_provider.py
  • src/megatron/bridge/models/gpt_oss/init.py
  • tests/unit_tests/models/gpt_oss/test_gpt_oss_provider.py
  • src/megatron/bridge/models/llama/llama_provider.py
  • src/megatron/bridge/models/deepseek/deepseek_provider.py
  • tests/unit_tests/models/nemotronh/test_nemotron_h_provider.py
  • src/megatron/bridge/models/nemotronh/init.py
  • src/megatron/bridge/models/nemotron/nemotron_provider.py
  • src/megatron/bridge/models/nemotronh/nemotron_h_provider.py
  • tests/unit_tests/models/llama_nemotron/test_llama_nemotron_bridge.py

add_bias_linear: bool = False
add_qkv_bias: bool = True
seq_length: int = 131072
init_method_std: int = 0.02

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Type annotation mismatch: init_method_std should be float, not int.

The default value 0.02 is a float, but the type hint declares int.

🐛 Proposed fix
-    init_method_std: int = 0.02
+    init_method_std: float = 0.02
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
init_method_std: int = 0.02
init_method_std: float = 0.02
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/models/glm_vl/glm_45v_provider.py` at line 56, The type
hint for the variable init_method_std is incorrect: it is declared as int but
initialized with a float (0.02); update the annotation from int to float for
init_method_std in glm_45v_provider.py so the signature matches the default
value (leave the default 0.02 unchanged) — locate the init_method_std
declaration and change its type to float.

mamba_num_heads=256,
ffn_hidden_size=32768,
num_attention_heads=64,
attention_backend=AttnBackend.auto,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Get context around the attention_backend assignments and overrides
rg -n -B5 -A5 'attention_backend=AttnBackend\.auto|cfg\.model\.attention_backend = None' \
  src/megatron/bridge/recipes/nemotronh/nemotronh.py | head -200

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 6074


🏁 Script executed:

#!/bin/bash
# Get function signatures and full context around the assignments
# First, find which functions contain these lines
rg -n -B30 'attention_backend=AttnBackend\.auto' src/megatron/bridge/recipes/nemotronh/nemotronh.py | \
  grep -E '(^[0-9]+-def |attention_backend=AttnBackend\.auto)' | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 366


🏁 Script executed:

#!/bin/bash
# Check if MambaModelProvider has attention_backend parameter
rg -A20 'class MambaModelProvider|def __init__' src/megatron/bridge/models/mimo/mamba_provider.py | head -80

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 160


🏁 Script executed:

#!/bin/bash
# Find MambaModelProvider definition
fd -t f '*.py' -x grep -l 'class MambaModelProvider' {}

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 354


🏁 Script executed:

#!/bin/bash
# Search for attention_backend in MambaModelProvider
rg -l 'MambaModelProvider' --type py | head -10

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 626


🏁 Script executed:

#!/bin/bash
# Check MambaModelProvider's __init__ signature
rg -A30 'class MambaModelProvider' src/megatron/bridge/models/mamba/mamba_provider.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1319


🏁 Script executed:

#!/bin/bash
# Check what parameters MambaModelProvider accepts
rg 'attention_backend' src/megatron/bridge/models/mamba/mamba_provider.py

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 125


🏁 Script executed:

#!/bin/bash
# Check if Mamba actually uses attention_backend or if it's just a placeholder
# Look at the full context of why attention_backend is set to None
rg -B10 -A5 'cfg.model.attention_backend = None' src/megatron/bridge/recipes/nemotronh/nemotronh.py | \
  grep -A15 'Kernel selections'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 5400


🏁 Script executed:

#!/bin/bash
# Check if Mamba is truly an attention-free architecture
rg -C5 'attention' src/megatron/bridge/models/mamba/mamba_provider.py | head -40

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 1547


🏁 Script executed:

#!/bin/bash
# Check if there's any code between constructor call and the None assignment that might depend on attention_backend
# Get full function bodies for the three functions
rg -A100 'def nemotronh_56b_pretrain_config' src/megatron/bridge/recipes/nemotronh/nemotronh.py | \
  head -120

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 3447


🏁 Script executed:

#!/bin/bash
# Verify: does EVERY function that sets AttnBackend.auto also override with None?
# Count occurrences of each pattern
echo "=== Functions with AttnBackend.auto ==="
rg -B30 'attention_backend=AttnBackend\.auto' src/megatron/bridge/recipes/nemotronh/nemotronh.py | \
  grep 'def ' | wc -l

echo "=== Functions with cfg.model.attention_backend = None ==="
rg -B50 'cfg\.model\.attention_backend = None' src/megatron/bridge/recipes/nemotronh/nemotronh.py | \
  grep 'def ' | wc -l

echo "=== All function names that set AttnBackend.auto ==="
rg -B30 'attention_backend=AttnBackend\.auto' src/megatron/bridge/recipes/nemotronh/nemotronh.py | \
  grep 'def ' | awk '{print $NF}'

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 286


🏁 Script executed:

#!/bin/bash
# Check: is there any code between AttnBackend.auto and the None override that uses attention_backend?
rg -B1 -A70 'attention_backend=AttnBackend\.auto' src/megatron/bridge/recipes/nemotronh/nemotronh.py | \
  grep -E '(attention_backend|^[0-9]+-.*cfg\.model\.attention_backend = None)' | head -20

Repository: NVIDIA-NeMo/Megatron-Bridge

Length of output: 319


Remove redundant attention_backend=AttnBackend.auto from MambaModelProvider constructors in all three 56B variants.

Lines 455, 982, and 1563 set attention_backend=AttnBackend.auto, but each function immediately overrides it with cfg.model.attention_backend = None (lines 524, 1032, and 1613). Since Mamba models do not use traditional attention, the explicit None assignment is the intended configuration; remove the constructor argument to avoid misleading maintainers about the actual backend choice.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/megatron/bridge/recipes/nemotronh/nemotronh.py` at line 455, Remove the
misleading explicit attention_backend=AttnBackend.auto argument from the
MambaModelProvider constructor calls in the three 56B variants (the places that
currently pass attention_backend=AttnBackend.auto) because each function later
sets cfg.model.attention_backend = None; edit those MambaModelProvider(...)
invocations to omit the attention_backend parameter so the intended None
configuration (used by Mamba models) is the effective setting and will not
confuse maintainers.

@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test c3afb89

yaoyu-33 and others added 16 commits March 12, 2026 15:25
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
- Remove NemotronNano12Bv2Provider from nemotron_vl/__init__.py
  (was a deprecated alias from deleted nemotron_h_provider.py)
- Remove invalid max_position_embeddings kwarg from kimi and moonlight
  recipes (not a field on MLAModelProvider)
- Update moonlight test to monkeypatch MLAModelProvider instead of
  deleted MoonlightModelProvider16B

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
DeepSeekV2Bridge and DeepSeekV3Bridge were defined but never imported
in the parent models/__init__.py, so their @register_bridge decorators
never fired and DeepseekV3ForCausalLM was not a supported architecture.

Signed-off-by: Yu Yao <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
Move yarn_* fields (yarn_rotary_scaling_factor, yarn_original_max_position_embeddings,
yarn_beta_fast, yarn_beta_slow, yarn_mscale, yarn_mscale_all_dim,
yarn_correction_range_round_to_int) from dynamic setattr injection to proper dataclass
fields on GPTModelProvider. This enables models like llama, qwen, gpt-oss, and
ministral3 to toggle YARN configs through the standard provider without needing
model-specific provider subclasses.

- Remove GPTOSSProvider import and re-wrapping hack in gpt_oss_bridge.py
- Pass YARN params directly as provider kwargs in model_bridge.py instead of
  stashing in _yarn_params dict for later setattr
- Fix megatron_to_hf_config to check value is not None instead of hasattr
  (since fields always exist now but default to None)

Signed-off-by: Yu Yao <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
yarn_* fields are now proper dataclass fields on GPTModelProvider
defaulting to None, so None means 'unset' and should be omitted
from exported rope_scaling dict rather than preserved.

Made-with: Cursor
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
…resets

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
…n NemotronNano12Bv2VLModelProvider

After PR #2628 renamed the field from hybrid_override_pattern to
hybrid_layer_pattern in MambaModelProvider, NemotronNano12Bv2VLModelProvider
still used the old name as a class default. When finalize() was called,
both the class default (hybrid_override_pattern) and the bridge
CONFIG_MAPPING-set value (hybrid_layer_pattern) were non-None,
triggering an assertion error.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
…rovider

finalize() derives num_layers from hybrid_layer_pattern; having both a
class-level default num_layers=62 and a user-supplied custom pattern
caused finalize() to assert 62==N(pattern) and fail.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
…d_layer_pattern)

- Replace removed Llama3ModelProvider with GPTModelProvider in
  test_local_checkpointing.py (added from main after provider removal)
- Rename deprecated hybrid_override_pattern to hybrid_layer_pattern in
  all nemotronh recipes (nemotronh.py, nemotron_nano_v2.py,
  nemotron_3_nano.py) to match PR #2628 field rename in MambaModelProvider

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
…w fixes

- Add num_layers=3 to MEGATRON_NEMOTRON_3_NANO_OVERRIDES so it matches
  the toy hybrid_layer_pattern="M*E" (3 layers vs recipe's 52)
- Fix init_method_std type hint from int to float in glm_45v_provider.py
- Remove redundant attention_backend=AttnBackend.auto from 56B nemotronh
  recipe constructors (each function overrides it to None later)

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
Signed-off-by: ruit <ruit@nvidia.com>
The base provider_bridge only forced RoPE when rope_type was missing,
so HF configs with rope_type="llama3" (or "default", "longrope", …)
fell back to the provider default learned_absolute.  Fix by setting
position_embedding_type="rope" for every rope_type except "yarn"
(which is already handled by hf_config_to_provider_kwargs).  Also
align the key lookup to check both "type" and "rope_type" keys
consistently.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
…xport

Add the missing KimiK2Bridge implementation for Kimi K2 model support
and fix the kimi __init__.py to export KimiK2Bridge (matching the actual
class name) instead of the non-existent KimiBridge.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test c7aa37a

@yaoyu-33 yaoyu-33 mentioned this pull request Mar 12, 2026
5 tasks
The base class provider_bridge now sets position_embedding_type="rope"
for all non-YARN models, but Mamba models don't use position embeddings.
Override it back to "none" in NemotronHBridge.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test 827d325

The test_provider_bridge_mamba_config test was checking
hybrid_layer_pattern without calling finalize(), but this
field is only populated during finalize() from the deprecated
hybrid_override_pattern.

Signed-off-by: Yu Yao <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test f932780

@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test c72bb59

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-to-merge PR is approved, current, and only waiting for CI to pass before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants