[model, recipe, test] refactor: remove specific model provider dataclasses#2599
Conversation
|
/ok to test 0bca980 |
📝 WalkthroughWalkthroughThis pull request consolidates and refactors the model provider API by removing specialized provider classes across multiple model families (Llama, Llama Nemotron, Nemotron, NemotronH, DeepSeek, GLM, GPT-OSS, Kimi) and replacing them with unified providers (GPTModelProvider, MLAModelProvider, MambaModelProvider). Corresponding recipe and test updates reflect the new provider structure, with bridges becoming the primary public API surface. Changes
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/megatron/bridge/recipes/kimi/kimi_k2.py (1)
83-86:⚠️ Potential issue | 🔴 CriticalKimiBridge implementation is missing – the recipe will fail at runtime.
The recipe imports
KimiBridgefromsrc/megatron/bridge/models/kimi/kimi_bridge.py(viakimi/__init__.py), but this file does not exist in the codebase. WhenAutoBridge.from_hf_pretrained("moonshotai/Kimi-K2-Instruct")is called, the dispatch system will fail to find a registered bridge implementation for the Kimi model architecture, causing a runtime error beforecfg.model.vocab_sizecan even be accessed.The bridge must be implemented similar to
DeepSeekV2Bridge(which usesMLAModelProviderand@MegatronModelBridge.register_bridge) to register with the dispatch system. Once implemented,vocab_sizewill be available from the HF config mapping (GPTModelProvider hasvocab_size: Optional[int] = Noneand model_bridge.py maps it via CONFIG_MAPPING).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/recipes/kimi/kimi_k2.py` around lines 83 - 86, The recipe fails at runtime because the KimiBridge implementation is missing; create a new bridge class named KimiBridge that follows the pattern of DeepSeekV2Bridge: annotate it with `@MegatronModelBridge.register_bridge` and implement the required interface methods, use MLAModelProvider (or the appropriate ModelProvider) to map HF config fields (ensure vocab_size is exposed via CONFIG_MAPPING/GPTModelProvider semantics) and register the bridge so AutoBridge.from_hf_pretrained("moonshotai/Kimi-K2-Instruct") can dispatch correctly; mirror DeepSeekV2Bridge’s constructors and provider wiring to ensure cfg.model.vocab_size is populated.
🧹 Nitpick comments (12)
tests/functional_tests/data/test_samplers.py (1)
620-622: LGTM!Migration is correct.
Optional: The
_DummyBridgeclass withGPTModelProvideris duplicated across 5 test methods. Consider extracting to a module-level helper or pytest fixture if this pattern expands further.,
♻️ Optional refactor using pytest fixture
# At module level or in conftest.py `@pytest.fixture` def mock_llama3_bridge(): """Mock AutoBridge to avoid HF downloads, returning GPTModelProvider.""" from unittest import mock as _mock class _DummyBridge: def to_megatron_provider(self, load_weights=False): from megatron.bridge.models.gpt_provider import GPTModelProvider return GPTModelProvider() with _mock.patch( "megatron.bridge.recipes.llama.llama3.AutoBridge.from_hf_pretrained", return_value=_DummyBridge() ): yieldThen tests can simply use
def test_...(self, mock_llama3_bridge):without the repeated boilerplate.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/data/test_samplers.py` around lines 620 - 622, Extract the repeated _DummyBridge + GPTModelProvider pattern into a reusable pytest fixture or module-level helper to avoid duplication: create a fixture (e.g., mock_llama3_bridge) that defines class _DummyBridge with method to_megatron_provider returning GPTModelProvider, and use unittest.mock.patch to patch "megatron.bridge.recipes.llama.llama3.AutoBridge.from_hf_pretrained" to return that _DummyBridge, then update the five tests to accept the fixture (e.g., def test_...(mock_llama3_bridge)) instead of repeating the inline class and patch.tests/unit_tests/recipes/test_moonlight_recipes.py (1)
136-136: Consolidate repeatedMLAModelProvidermonkeypatching into a pytest fixture.The identical
monkeypatch.setattr(mod, "MLAModelProvider", _FakeMoonlightModelProvider16B)setup is repeated in many tests. Please move this into a shared fixture (module-level or reusable helper fixture) to reduce duplication and keep setup consistent.As per coding guidelines, "tests/**/*.py: Use pytest fixtures for common setup in unit tests".
Also applies to: 171-171, 196-196, 222-222, 238-238, 261-261, 284-284, 307-307, 329-329
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit_tests/recipes/test_moonlight_recipes.py` at line 136, Create a reusable pytest fixture that performs the repeated monkeypatching of MLAModelProvider to _FakeMoonlightModelProvider16B and replace the duplicated monkeypatch.setattr(mod, "MLAModelProvider", _FakeMoonlightModelProvider16B) lines in tests with that fixture; implement a fixture (e.g., fake_mlamodel_provider or patch_mla_provider) that accepts monkeypatch, calls monkeypatch.setattr(mod, "MLAModelProvider", _FakeMoonlightModelProvider16B), mark it module- or session-scoped as appropriate, and add the fixture name to the tests that previously had the inline monkeypatch to eliminate duplication while preserving behavior.src/megatron/bridge/recipes/moonlight/moonlight_16b.py (2)
413-468: Optional: Extract shared architecture parameters to reduce duplication.The MLAModelProvider instantiation in
moonlight_16b_peft_configduplicates ~40 architecture parameters that are identical tomoonlight_16b_sft_config(lines 200-241 vs 414-454). Only the parallelism settings differ.This is acceptable for clarity and explicit configuration, but if maintainability becomes a concern, consider extracting the shared architecture parameters into a module-level constant.
♻️ Example: Extract shared architecture parameters
# Module-level constant for shared architecture _MOONLIGHT_16B_ARCH = dict( num_layers=27, hidden_size=2048, ffn_hidden_size=11264, # ... remaining shared params ) def moonlight_16b_sft_config() -> ConfigContainer: cfg.model = MLAModelProvider( **_MOONLIGHT_16B_ARCH, # SFT-specific parallelism tensor_model_parallel_size=2, expert_model_parallel_size=8, sequence_parallel=True, # ... )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/recipes/moonlight/moonlight_16b.py` around lines 413 - 468, The MLAModelProvider block in moonlight_16b_peft_config duplicates many architecture params already used in moonlight_16b_sft_config; extract the shared architecture into a module-level constant (e.g., _MOONLIGHT_16B_ARCH) containing the common keys (num_layers, hidden_size, ffn_hidden_size, num_attention_heads, kv_channels, vocab_size, rotary_*, layernorm_epsilon, init_method_std, activation_func, gated_linear_unit, etc.), then update both moonlight_16b_sft_config and moonlight_16b_peft_config to call MLAModelProvider(**_MOONLIGHT_16B_ARCH, <their specific overrides>) so only parallelism and config-specific overrides remain in each function.
199-255: Consider removing redundant parameter assignments.The MLAModelProvider constructor sets
moe_token_dispatcher_type="alltoall"(line 241) andmoe_grouped_gemm=True(line 240), but these are reassigned later in the function:
cfg.model.moe_token_dispatcher_typeat lines 284cfg.model.moe_grouped_gemmat line 308cfg.model.moe_permute_fusionat lines 274 and 307While functionally correct, the duplication could lead to drift if only one location is updated.
♻️ Suggested cleanup: remove redundant assignments from constructor or later code
Either remove from the constructor (if they're always overridden by the DeePEP logic):
moe_grouped_gemm=True, - moe_token_dispatcher_type="alltoall", # ParallelismOr consolidate all MoE settings in one place and remove the later redundant assignments.
Note: The Ruff S106 warnings about
moe_token_dispatcher_typeare false positives—this is a Mixture-of-Experts configuration parameter, not a password.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/recipes/moonlight/moonlight_16b.py` around lines 199 - 255, The constructor for MLAModelProvider sets MoE-related defaults (moe_grouped_gemm, moe_token_dispatcher_type, moe_permute_fusion) that are later reassigned on cfg.model (e.g., cfg.model.moe_token_dispatcher_type, cfg.model.moe_grouped_gemm, cfg.model.moe_permute_fusion); remove the redundant assignments either from the MLAModelProvider(...) argument list or from the later reassignment block so all MoE settings are defined in one place—locate MLAModelProvider(...) in moonlight_16b.py and the subsequent lines where cfg.model.moe_permute_fusion, cfg.model.moe_token_dispatcher_type, and cfg.model.moe_grouped_gemm are set, then keep only the preferred location and delete the duplicates.tests/unit_tests/training/test_config.py (1)
79-90: Consider renaming the helper to reflect the new abstraction.
create_test_deepseek_confignow returnsMLAModelProvider. A neutral helper name (e.g.,create_test_mla_config) would reduce ambiguity as MLA coverage expands.♻️ Optional clarity refactor
-def create_test_deepseek_config(**kwargs: Any) -> MLAModelProvider: - """Creates an instance of MLAModelProvider for testing.""" +def create_test_mla_config(**kwargs: Any) -> MLAModelProvider: + """Creates an instance of MLAModelProvider for testing.""" defaults = { "num_layers": 1, "hidden_size": 128, "num_attention_heads": 4, "seq_length": 512, "apply_rope_fusion": False, } defaults.update(kwargs) return MLAModelProvider(**defaults) + +# Backward-compatible alias for existing tests. +create_test_deepseek_config = create_test_mla_config🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit_tests/training/test_config.py` around lines 79 - 90, Rename the helper function create_test_deepseek_config to a neutral name like create_test_mla_config since it now returns MLAModelProvider; update the function definition and all call sites to the new name (search for create_test_deepseek_config and replace with create_test_mla_config) and keep the same parameters/defaults and return MLAModelProvider to preserve behavior.tests/unit_tests/recipes/nemotronh/test_nemotronh.py (1)
163-177: Consider simplifying the parameterized test.Since all variants now use
MambaModelProvider, theprovider_clsparameter is redundant—every tuple maps to the same class. You could simplify by removing the parameter and usingMambaModelProviderdirectly in the assertion.That said, keeping the current structure is fine if you prefer consistency with other parameterized tests or want flexibility for future divergence.
♻️ Optional simplification
`@pytest.mark.parametrize`( - "recipe_fn,provider_cls", + "recipe_fn", [ - (nemotronh_4b_pretrain_config, MambaModelProvider), - (nemotronh_8b_pretrain_config, MambaModelProvider), - (nemotronh_47b_pretrain_config, MambaModelProvider), - (nemotronh_56b_pretrain_config, MambaModelProvider), + nemotronh_4b_pretrain_config, + nemotronh_8b_pretrain_config, + nemotronh_47b_pretrain_config, + nemotronh_56b_pretrain_config, ], ) - def test_config_container_structure(self, recipe_fn, provider_cls): + def test_config_container_structure(self, recipe_fn): """Test that all configs return proper ConfigContainer with correct model provider.""" config = recipe_fn() assert isinstance(config, ConfigContainer) - assert isinstance(config.model, provider_cls) + assert isinstance(config.model, MambaModelProvider)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit_tests/recipes/nemotronh/test_nemotronh.py` around lines 163 - 177, The parameterized test test_config_container_structure currently includes a redundant provider_cls parameter because all tuples use MambaModelProvider; simplify by parameterizing only the recipe functions (nemotronh_4b_pretrain_config, nemotronh_8b_pretrain_config, nemotronh_47b_pretrain_config, nemotronh_56b_pretrain_config) and replace the second assertion to directly assert isinstance(config.model, MambaModelProvider) while keeping the initial assert isinstance(config, ConfigContainer); update the `@pytest.mark.parametrize` decorator and the test signature accordingly to remove provider_cls.src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py (1)
211-268: Configuration duplication across pretrain/SFT/PEFT is notable but acceptable.The SFT and PEFT configs duplicate the
MambaModelProviderinstantiation with ~40 identical architecture parameters. While this could be extracted into a helper function, this pattern appears consistent with other recipes in the codebase.♻️ Optional: Extract common architecture parameters
def _nemotron_3_nano_model_kwargs(seq_length: int = 8192) -> dict: """Common Nemotron 3 Nano architecture parameters.""" return dict( hybrid_override_pattern="MEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEM*EMEMEMEM*EMEMEMEME", num_layers=52, hidden_size=2688, mamba_num_heads=64, kv_channels=128, mamba_state_dim=128, # ... remaining common params seq_length=seq_length, ) # Usage in configs: cfg.model = MambaModelProvider( **_nemotron_3_nano_model_kwargs(seq_length=8192), tensor_model_parallel_size=4, # ... parallelism params )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py` around lines 211 - 268, Multiple recipes duplicate the same MambaModelProvider architecture block; extract those shared parameters into a helper function (e.g., _nemotron_3_nano_model_kwargs) that returns a dict of the common keys (hybrid_override_pattern, num_layers, hidden_size, mamba_num_heads, kv_channels, mamba_state_dim, ffn_hidden_size, seq_length, etc.), then update cfg.model assignments in pretrain/SFT/PEFT to call MambaModelProvider(**_nemotron_3_nano_model_kwargs(seq_length=...), <override or additional parallelism args>) so only recipe-specific fields (like tensor_model_parallel_size, pipeline_model_parallel_size, pipeline_dtype, expert_model_parallel_size) remain inline.tests/functional_tests/training/test_pretrain_resume.py (1)
47-72: LGTM with optional consolidation opportunity.The
Llama3ModelProvider145Mclass correctly inherits fromGPTModelProviderwith Llama-3-style defaults. Note that this class definition is nearly identical to the one intest_sft.py(with minor differences inseq_lengthandnum_layersdefaults). Consider extracting a shared test utility class in a future refactor to reduce duplication across functional tests.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/training/test_pretrain_resume.py` around lines 47 - 72, The Llama3ModelProvider145M test class duplicates a nearly identical definition found in test_sft.py; refactor by extracting the shared defaults into a single reusable test utility (e.g., a base class or factory function) and have both Llama3ModelProvider145M and the counterpart in test_sft.py inherit from or invoke that utility; update the unique overrides (seq_length, num_layers) in each test-specific subclass or factory call and remove the duplicated field definitions to centralize maintenance.tests/functional_tests/training/test_finetune_lora.py (1)
53-80: LGTM!The
Llama3ModelProvider145Mclass correctly inherits fromGPTModelProviderwith Llama-3-style defaults. The explicitvocab_size=128256is appropriate for LoRA finetuning tests that use HuggingFace tokenizers.This is the third definition of
Llama3ModelProvider145Macross functional test files (also intest_sft.pyandtest_pretrain_resume.py). Consider consolidating these into a shared test utility module (e.g.,tests/functional_tests/utils.pyor a dedicatedtests/functional_tests/training/model_providers.py) in a future refactor to reduce maintenance burden.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/training/test_finetune_lora.py` around lines 53 - 80, Duplicate Llama3ModelProvider145M definitions should be consolidated into a single shared test utility: create a new test helper module (e.g., tests/functional_tests/training/model_providers.py or utils.py), move the class Llama3ModelProvider145M into that module, update usages in tests that currently define the class (references in test_sft.py and test_pretrain_resume.py and this file) to import Llama3ModelProvider145M from the new module, and remove the duplicate class definitions so tests import the single canonical provider.tests/functional_tests/training/test_decentralized_pg.py (1)
104-140: High duplication in model config setup across decentralized-PG tests.A shared builder (with per-test overrides for TP/PP/CP/share-embeddings) would make these tests easier to maintain and less brittle.
Also applies to: 255-290, 410-446, 566-602, 722-758, 878-914
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/training/test_decentralized_pg.py` around lines 104 - 140, The duplicated GPTModelProvider block should be extracted into a shared builder function (e.g., make_decentralized_model_cfg or build_base_gpt_model) that returns a GPTModelProvider pre-populated with the common args shown (normalization, activation_func, gated_linear_unit, position_embedding_type, ... hidden_size, ffn_hidden_size, num_attention_heads, init_method_std, layernorm_epsilon, rotary/rope settings, seq_length, make_vocab_size_divisible_by, vocab_size, num_layers, etc.), and each test should call that builder and apply per-test overrides for tensor_model_parallel_size, pipeline_model_parallel_size, context_parallel_size, sequence_parallel, and share_embeddings_and_output_weights; update all referenced tests (the blocks at ~104-140, 255-290, 410-446, 566-602, 722-758, 878-914) to use the builder and only change the TP/PP/CP/share-embeddings fields to reduce duplication and improve maintainability.src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py (1)
38-77: Consider factoring shared MambaModelProvider kwargs into reusable builders/constants.These six blocks duplicate a large common parameter surface; extracting shared defaults would reduce maintenance overhead and accidental divergence across pretrain/SFT/PEFT variants.
Also applies to: 173-212, 310-349, 436-475, 571-610, 718-757
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py` around lines 38 - 77, Multiple MambaModelProvider invocations (e.g., the cfg.model = MambaModelProvider block) duplicate a large common parameter set across variants; factor those shared kwargs into a reusable constant or builder (for example a MAMBA_DEFAULTS dict or a build_mamba_kwargs(name, overrides) helper) and then construct each provider via unpacking the shared defaults plus only the variant-specific overrides (use the existing symbols like MambaModelProvider and cfg.model to locate callers and merge with per-variant keys such as num_layers, seq_length, pipeline_dtype, etc.). Ensure the new helper/constant is imported/defined once and replace the repeated blocks at the other occurrences noted so differences remain explicit and minimal.tests/functional_tests/training/test_pretrain.py (1)
73-108: Consider extracting a shared GPT model-config factory.These two constructor blocks are mostly duplicated; a small helper would reduce config drift across tests when provider defaults/fields change.
Also applies to: 219-255
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/functional_tests/training/test_pretrain.py` around lines 73 - 108, The two nearly identical GPTModelProvider constructor blocks (creating model_cfg) should be extracted into a shared factory to avoid duplication and drift; add a helper function (e.g., make_gpt_model_cfg or build_gpt_model_config) that returns a configured GPTModelProvider with the common parameters shown (normalization, activation_func, gated_linear_unit, position_embedding_type, add_bias_linear, attention_dropout, hidden_dropout, bias_activation_fusion, masked_softmax_fusion, persist_layer_norm, bias_dropout_fusion, apply_rope_fusion, num_query_groups, init_method_std, layernorm_epsilon, rotary_percent, rope_scaling, rope_scaling_factor, share_embeddings_and_output_weights, rotary_base, hidden_size, ffn_hidden_size, num_attention_heads, tensor_model_parallel_size, pipeline_model_parallel_size, context_parallel_size, sequence_parallel, attention_softmax_in_fp32, pipeline_dtype, bf16, seq_length, make_vocab_size_divisible_by, vocab_size, num_layers) and replace the two inline constructor blocks (the model_cfg assignments at the shown location and the similar block around lines 219–255) with calls to that helper, allowing optional overrides for seq_length, num_layers, vocab_size, etc.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/megatron/bridge/models/glm_vl/glm_45v_provider.py`:
- Line 56: The type hint for the variable init_method_std is incorrect: it is
declared as int but initialized with a float (0.02); update the annotation from
int to float for init_method_std in glm_45v_provider.py so the signature matches
the default value (leave the default 0.02 unchanged) — locate the
init_method_std declaration and change its type to float.
In `@src/megatron/bridge/recipes/nemotronh/nemotronh.py`:
- Line 455: Remove the misleading explicit attention_backend=AttnBackend.auto
argument from the MambaModelProvider constructor calls in the three 56B variants
(the places that currently pass attention_backend=AttnBackend.auto) because each
function later sets cfg.model.attention_backend = None; edit those
MambaModelProvider(...) invocations to omit the attention_backend parameter so
the intended None configuration (used by Mamba models) is the effective setting
and will not confuse maintainers.
---
Outside diff comments:
In `@src/megatron/bridge/recipes/kimi/kimi_k2.py`:
- Around line 83-86: The recipe fails at runtime because the KimiBridge
implementation is missing; create a new bridge class named KimiBridge that
follows the pattern of DeepSeekV2Bridge: annotate it with
`@MegatronModelBridge.register_bridge` and implement the required interface
methods, use MLAModelProvider (or the appropriate ModelProvider) to map HF
config fields (ensure vocab_size is exposed via CONFIG_MAPPING/GPTModelProvider
semantics) and register the bridge so
AutoBridge.from_hf_pretrained("moonshotai/Kimi-K2-Instruct") can dispatch
correctly; mirror DeepSeekV2Bridge’s constructors and provider wiring to ensure
cfg.model.vocab_size is populated.
---
Nitpick comments:
In `@src/megatron/bridge/recipes/moonlight/moonlight_16b.py`:
- Around line 413-468: The MLAModelProvider block in moonlight_16b_peft_config
duplicates many architecture params already used in moonlight_16b_sft_config;
extract the shared architecture into a module-level constant (e.g.,
_MOONLIGHT_16B_ARCH) containing the common keys (num_layers, hidden_size,
ffn_hidden_size, num_attention_heads, kv_channels, vocab_size, rotary_*,
layernorm_epsilon, init_method_std, activation_func, gated_linear_unit, etc.),
then update both moonlight_16b_sft_config and moonlight_16b_peft_config to call
MLAModelProvider(**_MOONLIGHT_16B_ARCH, <their specific overrides>) so only
parallelism and config-specific overrides remain in each function.
- Around line 199-255: The constructor for MLAModelProvider sets MoE-related
defaults (moe_grouped_gemm, moe_token_dispatcher_type, moe_permute_fusion) that
are later reassigned on cfg.model (e.g., cfg.model.moe_token_dispatcher_type,
cfg.model.moe_grouped_gemm, cfg.model.moe_permute_fusion); remove the redundant
assignments either from the MLAModelProvider(...) argument list or from the
later reassignment block so all MoE settings are defined in one place—locate
MLAModelProvider(...) in moonlight_16b.py and the subsequent lines where
cfg.model.moe_permute_fusion, cfg.model.moe_token_dispatcher_type, and
cfg.model.moe_grouped_gemm are set, then keep only the preferred location and
delete the duplicates.
In `@src/megatron/bridge/recipes/nemotronh/nemotron_3_nano.py`:
- Around line 211-268: Multiple recipes duplicate the same MambaModelProvider
architecture block; extract those shared parameters into a helper function
(e.g., _nemotron_3_nano_model_kwargs) that returns a dict of the common keys
(hybrid_override_pattern, num_layers, hidden_size, mamba_num_heads, kv_channels,
mamba_state_dim, ffn_hidden_size, seq_length, etc.), then update cfg.model
assignments in pretrain/SFT/PEFT to call
MambaModelProvider(**_nemotron_3_nano_model_kwargs(seq_length=...), <override or
additional parallelism args>) so only recipe-specific fields (like
tensor_model_parallel_size, pipeline_model_parallel_size, pipeline_dtype,
expert_model_parallel_size) remain inline.
In `@src/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.py`:
- Around line 38-77: Multiple MambaModelProvider invocations (e.g., the
cfg.model = MambaModelProvider block) duplicate a large common parameter set
across variants; factor those shared kwargs into a reusable constant or builder
(for example a MAMBA_DEFAULTS dict or a build_mamba_kwargs(name, overrides)
helper) and then construct each provider via unpacking the shared defaults plus
only the variant-specific overrides (use the existing symbols like
MambaModelProvider and cfg.model to locate callers and merge with per-variant
keys such as num_layers, seq_length, pipeline_dtype, etc.). Ensure the new
helper/constant is imported/defined once and replace the repeated blocks at the
other occurrences noted so differences remain explicit and minimal.
In `@tests/functional_tests/data/test_samplers.py`:
- Around line 620-622: Extract the repeated _DummyBridge + GPTModelProvider
pattern into a reusable pytest fixture or module-level helper to avoid
duplication: create a fixture (e.g., mock_llama3_bridge) that defines class
_DummyBridge with method to_megatron_provider returning GPTModelProvider, and
use unittest.mock.patch to patch
"megatron.bridge.recipes.llama.llama3.AutoBridge.from_hf_pretrained" to return
that _DummyBridge, then update the five tests to accept the fixture (e.g., def
test_...(mock_llama3_bridge)) instead of repeating the inline class and patch.
In `@tests/functional_tests/training/test_decentralized_pg.py`:
- Around line 104-140: The duplicated GPTModelProvider block should be extracted
into a shared builder function (e.g., make_decentralized_model_cfg or
build_base_gpt_model) that returns a GPTModelProvider pre-populated with the
common args shown (normalization, activation_func, gated_linear_unit,
position_embedding_type, ... hidden_size, ffn_hidden_size, num_attention_heads,
init_method_std, layernorm_epsilon, rotary/rope settings, seq_length,
make_vocab_size_divisible_by, vocab_size, num_layers, etc.), and each test
should call that builder and apply per-test overrides for
tensor_model_parallel_size, pipeline_model_parallel_size, context_parallel_size,
sequence_parallel, and share_embeddings_and_output_weights; update all
referenced tests (the blocks at ~104-140, 255-290, 410-446, 566-602, 722-758,
878-914) to use the builder and only change the TP/PP/CP/share-embeddings fields
to reduce duplication and improve maintainability.
In `@tests/functional_tests/training/test_finetune_lora.py`:
- Around line 53-80: Duplicate Llama3ModelProvider145M definitions should be
consolidated into a single shared test utility: create a new test helper module
(e.g., tests/functional_tests/training/model_providers.py or utils.py), move the
class Llama3ModelProvider145M into that module, update usages in tests that
currently define the class (references in test_sft.py and
test_pretrain_resume.py and this file) to import Llama3ModelProvider145M from
the new module, and remove the duplicate class definitions so tests import the
single canonical provider.
In `@tests/functional_tests/training/test_pretrain_resume.py`:
- Around line 47-72: The Llama3ModelProvider145M test class duplicates a nearly
identical definition found in test_sft.py; refactor by extracting the shared
defaults into a single reusable test utility (e.g., a base class or factory
function) and have both Llama3ModelProvider145M and the counterpart in
test_sft.py inherit from or invoke that utility; update the unique overrides
(seq_length, num_layers) in each test-specific subclass or factory call and
remove the duplicated field definitions to centralize maintenance.
In `@tests/functional_tests/training/test_pretrain.py`:
- Around line 73-108: The two nearly identical GPTModelProvider constructor
blocks (creating model_cfg) should be extracted into a shared factory to avoid
duplication and drift; add a helper function (e.g., make_gpt_model_cfg or
build_gpt_model_config) that returns a configured GPTModelProvider with the
common parameters shown (normalization, activation_func, gated_linear_unit,
position_embedding_type, add_bias_linear, attention_dropout, hidden_dropout,
bias_activation_fusion, masked_softmax_fusion, persist_layer_norm,
bias_dropout_fusion, apply_rope_fusion, num_query_groups, init_method_std,
layernorm_epsilon, rotary_percent, rope_scaling, rope_scaling_factor,
share_embeddings_and_output_weights, rotary_base, hidden_size, ffn_hidden_size,
num_attention_heads, tensor_model_parallel_size, pipeline_model_parallel_size,
context_parallel_size, sequence_parallel, attention_softmax_in_fp32,
pipeline_dtype, bf16, seq_length, make_vocab_size_divisible_by, vocab_size,
num_layers) and replace the two inline constructor blocks (the model_cfg
assignments at the shown location and the similar block around lines 219–255)
with calls to that helper, allowing optional overrides for seq_length,
num_layers, vocab_size, etc.
In `@tests/unit_tests/recipes/nemotronh/test_nemotronh.py`:
- Around line 163-177: The parameterized test test_config_container_structure
currently includes a redundant provider_cls parameter because all tuples use
MambaModelProvider; simplify by parameterizing only the recipe functions
(nemotronh_4b_pretrain_config, nemotronh_8b_pretrain_config,
nemotronh_47b_pretrain_config, nemotronh_56b_pretrain_config) and replace the
second assertion to directly assert isinstance(config.model, MambaModelProvider)
while keeping the initial assert isinstance(config, ConfigContainer); update the
`@pytest.mark.parametrize` decorator and the test signature accordingly to remove
provider_cls.
In `@tests/unit_tests/recipes/test_moonlight_recipes.py`:
- Line 136: Create a reusable pytest fixture that performs the repeated
monkeypatching of MLAModelProvider to _FakeMoonlightModelProvider16B and replace
the duplicated monkeypatch.setattr(mod, "MLAModelProvider",
_FakeMoonlightModelProvider16B) lines in tests with that fixture; implement a
fixture (e.g., fake_mlamodel_provider or patch_mla_provider) that accepts
monkeypatch, calls monkeypatch.setattr(mod, "MLAModelProvider",
_FakeMoonlightModelProvider16B), mark it module- or session-scoped as
appropriate, and add the fixture name to the tests that previously had the
inline monkeypatch to eliminate duplication while preserving behavior.
In `@tests/unit_tests/training/test_config.py`:
- Around line 79-90: Rename the helper function create_test_deepseek_config to a
neutral name like create_test_mla_config since it now returns MLAModelProvider;
update the function definition and all call sites to the new name (search for
create_test_deepseek_config and replace with create_test_mla_config) and keep
the same parameters/defaults and return MLAModelProvider to preserve behavior.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (61)
src/megatron/bridge/models/__init__.pysrc/megatron/bridge/models/conversion/model_bridge.pysrc/megatron/bridge/models/deepseek/__init__.pysrc/megatron/bridge/models/deepseek/deepseek_provider.pysrc/megatron/bridge/models/glm/__init__.pysrc/megatron/bridge/models/glm/glm45_provider.pysrc/megatron/bridge/models/glm_vl/glm_45v_provider.pysrc/megatron/bridge/models/gpt_oss/__init__.pysrc/megatron/bridge/models/gpt_oss/gpt_oss_provider.pysrc/megatron/bridge/models/kimi/__init__.pysrc/megatron/bridge/models/kimi/kimi_provider.pysrc/megatron/bridge/models/llama/__init__.pysrc/megatron/bridge/models/llama/llama_provider.pysrc/megatron/bridge/models/llama_nemotron/__init__.pysrc/megatron/bridge/models/llama_nemotron/llama_nemotron_bridge.pysrc/megatron/bridge/models/llama_nemotron/llama_nemotron_provider.pysrc/megatron/bridge/models/nemotron/__init__.pysrc/megatron/bridge/models/nemotron/nemotron_provider.pysrc/megatron/bridge/models/nemotron_vl/__init__.pysrc/megatron/bridge/models/nemotron_vl/nemotron_vl_bridge.pysrc/megatron/bridge/models/nemotron_vl/nemotron_vl_provider.pysrc/megatron/bridge/models/nemotronh/__init__.pysrc/megatron/bridge/models/nemotronh/nemotron_h_provider.pysrc/megatron/bridge/recipes/kimi/kimi_k2.pysrc/megatron/bridge/recipes/moonlight/moonlight_16b.pysrc/megatron/bridge/recipes/nemotronh/nemotron_3_nano.pysrc/megatron/bridge/recipes/nemotronh/nemotron_nano_v2.pysrc/megatron/bridge/recipes/nemotronh/nemotronh.pytests/functional_tests/data/test_samplers.pytests/functional_tests/models/gpt_oss/test_gpt_oss_provider.pytests/functional_tests/training/test_callbacks.pytests/functional_tests/training/test_decentralized_pg.pytests/functional_tests/training/test_finetune_dora.pytests/functional_tests/training/test_finetune_lora.pytests/functional_tests/training/test_inprocess_restart.pytests/functional_tests/training/test_megatron_fsdp.pytests/functional_tests/training/test_nvrx_straggler.pytests/functional_tests/training/test_pretrain.pytests/functional_tests/training/test_pretrain_resume.pytests/functional_tests/training/test_sample_based_training.pytests/functional_tests/training/test_sft.pytests/functional_tests/training/test_tensor_inspect.pytests/unit_tests/models/deepseek/test_deepseek_provider.pytests/unit_tests/models/glm/test_glm45_provider.pytests/unit_tests/models/glm_vl/test_glm_45v_provider.pytests/unit_tests/models/gpt_oss/test_gpt_oss_provider.pytests/unit_tests/models/kimi/test_kimi_provider.pytests/unit_tests/models/llama/test_llama_provider.pytests/unit_tests/models/llama_nemotron/test_llama_nemotron_bridge.pytests/unit_tests/models/nemotron/test_nemotron_provider.pytests/unit_tests/models/nemotronh/test_nemotron_h_provider.pytests/unit_tests/models/test_models_imports.pytests/unit_tests/recipes/kimi/test_kimi_k2.pytests/unit_tests/recipes/nemotronh/test_nemotron_3_nano.pytests/unit_tests/recipes/nemotronh/test_nemotron_nano_v2.pytests/unit_tests/recipes/nemotronh/test_nemotronh.pytests/unit_tests/recipes/test_glm45_recipes.pytests/unit_tests/recipes/test_moonlight_recipes.pytests/unit_tests/recipes/test_run_plugins.pytests/unit_tests/training/test_config.pytests/unit_tests/training/test_log_non_default_values.py
💤 Files with no reviewable changes (22)
- src/megatron/bridge/models/glm/init.py
- src/megatron/bridge/models/glm/glm45_provider.py
- tests/unit_tests/models/deepseek/test_deepseek_provider.py
- src/megatron/bridge/models/nemotron/init.py
- tests/unit_tests/models/kimi/test_kimi_provider.py
- tests/functional_tests/models/gpt_oss/test_gpt_oss_provider.py
- src/megatron/bridge/models/kimi/kimi_provider.py
- tests/unit_tests/models/glm/test_glm45_provider.py
- src/megatron/bridge/models/gpt_oss/gpt_oss_provider.py
- src/megatron/bridge/models/nemotron_vl/init.py
- src/megatron/bridge/models/llama_nemotron/init.py
- tests/unit_tests/models/llama/test_llama_provider.py
- tests/unit_tests/models/nemotron/test_nemotron_provider.py
- src/megatron/bridge/models/gpt_oss/init.py
- tests/unit_tests/models/gpt_oss/test_gpt_oss_provider.py
- src/megatron/bridge/models/llama/llama_provider.py
- src/megatron/bridge/models/deepseek/deepseek_provider.py
- tests/unit_tests/models/nemotronh/test_nemotron_h_provider.py
- src/megatron/bridge/models/nemotronh/init.py
- src/megatron/bridge/models/nemotron/nemotron_provider.py
- src/megatron/bridge/models/nemotronh/nemotron_h_provider.py
- tests/unit_tests/models/llama_nemotron/test_llama_nemotron_bridge.py
| add_bias_linear: bool = False | ||
| add_qkv_bias: bool = True | ||
| seq_length: int = 131072 | ||
| init_method_std: int = 0.02 |
There was a problem hiding this comment.
Type annotation mismatch: init_method_std should be float, not int.
The default value 0.02 is a float, but the type hint declares int.
🐛 Proposed fix
- init_method_std: int = 0.02
+ init_method_std: float = 0.02📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| init_method_std: int = 0.02 | |
| init_method_std: float = 0.02 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/models/glm_vl/glm_45v_provider.py` at line 56, The type
hint for the variable init_method_std is incorrect: it is declared as int but
initialized with a float (0.02); update the annotation from int to float for
init_method_std in glm_45v_provider.py so the signature matches the default
value (leave the default 0.02 unchanged) — locate the init_method_std
declaration and change its type to float.
| mamba_num_heads=256, | ||
| ffn_hidden_size=32768, | ||
| num_attention_heads=64, | ||
| attention_backend=AttnBackend.auto, |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Get context around the attention_backend assignments and overrides
rg -n -B5 -A5 'attention_backend=AttnBackend\.auto|cfg\.model\.attention_backend = None' \
src/megatron/bridge/recipes/nemotronh/nemotronh.py | head -200Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 6074
🏁 Script executed:
#!/bin/bash
# Get function signatures and full context around the assignments
# First, find which functions contain these lines
rg -n -B30 'attention_backend=AttnBackend\.auto' src/megatron/bridge/recipes/nemotronh/nemotronh.py | \
grep -E '(^[0-9]+-def |attention_backend=AttnBackend\.auto)' | head -20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 366
🏁 Script executed:
#!/bin/bash
# Check if MambaModelProvider has attention_backend parameter
rg -A20 'class MambaModelProvider|def __init__' src/megatron/bridge/models/mimo/mamba_provider.py | head -80Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 160
🏁 Script executed:
#!/bin/bash
# Find MambaModelProvider definition
fd -t f '*.py' -x grep -l 'class MambaModelProvider' {}Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 354
🏁 Script executed:
#!/bin/bash
# Search for attention_backend in MambaModelProvider
rg -l 'MambaModelProvider' --type py | head -10Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 626
🏁 Script executed:
#!/bin/bash
# Check MambaModelProvider's __init__ signature
rg -A30 'class MambaModelProvider' src/megatron/bridge/models/mamba/mamba_provider.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1319
🏁 Script executed:
#!/bin/bash
# Check what parameters MambaModelProvider accepts
rg 'attention_backend' src/megatron/bridge/models/mamba/mamba_provider.pyRepository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 125
🏁 Script executed:
#!/bin/bash
# Check if Mamba actually uses attention_backend or if it's just a placeholder
# Look at the full context of why attention_backend is set to None
rg -B10 -A5 'cfg.model.attention_backend = None' src/megatron/bridge/recipes/nemotronh/nemotronh.py | \
grep -A15 'Kernel selections'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 5400
🏁 Script executed:
#!/bin/bash
# Check if Mamba is truly an attention-free architecture
rg -C5 'attention' src/megatron/bridge/models/mamba/mamba_provider.py | head -40Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 1547
🏁 Script executed:
#!/bin/bash
# Check if there's any code between constructor call and the None assignment that might depend on attention_backend
# Get full function bodies for the three functions
rg -A100 'def nemotronh_56b_pretrain_config' src/megatron/bridge/recipes/nemotronh/nemotronh.py | \
head -120Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 3447
🏁 Script executed:
#!/bin/bash
# Verify: does EVERY function that sets AttnBackend.auto also override with None?
# Count occurrences of each pattern
echo "=== Functions with AttnBackend.auto ==="
rg -B30 'attention_backend=AttnBackend\.auto' src/megatron/bridge/recipes/nemotronh/nemotronh.py | \
grep 'def ' | wc -l
echo "=== Functions with cfg.model.attention_backend = None ==="
rg -B50 'cfg\.model\.attention_backend = None' src/megatron/bridge/recipes/nemotronh/nemotronh.py | \
grep 'def ' | wc -l
echo "=== All function names that set AttnBackend.auto ==="
rg -B30 'attention_backend=AttnBackend\.auto' src/megatron/bridge/recipes/nemotronh/nemotronh.py | \
grep 'def ' | awk '{print $NF}'Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 286
🏁 Script executed:
#!/bin/bash
# Check: is there any code between AttnBackend.auto and the None override that uses attention_backend?
rg -B1 -A70 'attention_backend=AttnBackend\.auto' src/megatron/bridge/recipes/nemotronh/nemotronh.py | \
grep -E '(attention_backend|^[0-9]+-.*cfg\.model\.attention_backend = None)' | head -20Repository: NVIDIA-NeMo/Megatron-Bridge
Length of output: 319
Remove redundant attention_backend=AttnBackend.auto from MambaModelProvider constructors in all three 56B variants.
Lines 455, 982, and 1563 set attention_backend=AttnBackend.auto, but each function immediately overrides it with cfg.model.attention_backend = None (lines 524, 1032, and 1613). Since Mamba models do not use traditional attention, the explicit None assignment is the intended configuration; remove the constructor argument to avoid misleading maintainers about the actual backend choice.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/megatron/bridge/recipes/nemotronh/nemotronh.py` at line 455, Remove the
misleading explicit attention_backend=AttnBackend.auto argument from the
MambaModelProvider constructor calls in the three 56B variants (the places that
currently pass attention_backend=AttnBackend.auto) because each function later
sets cfg.model.attention_backend = None; edit those MambaModelProvider(...)
invocations to omit the attention_backend parameter so the intended None
configuration (used by Mamba models) is the effective setting and will not
confuse maintainers.
|
/ok to test c3afb89 |
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
- Remove NemotronNano12Bv2Provider from nemotron_vl/__init__.py (was a deprecated alias from deleted nemotron_h_provider.py) - Remove invalid max_position_embeddings kwarg from kimi and moonlight recipes (not a field on MLAModelProvider) - Update moonlight test to monkeypatch MLAModelProvider instead of deleted MoonlightModelProvider16B Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
DeepSeekV2Bridge and DeepSeekV3Bridge were defined but never imported in the parent models/__init__.py, so their @register_bridge decorators never fired and DeepseekV3ForCausalLM was not a supported architecture. Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
Move yarn_* fields (yarn_rotary_scaling_factor, yarn_original_max_position_embeddings, yarn_beta_fast, yarn_beta_slow, yarn_mscale, yarn_mscale_all_dim, yarn_correction_range_round_to_int) from dynamic setattr injection to proper dataclass fields on GPTModelProvider. This enables models like llama, qwen, gpt-oss, and ministral3 to toggle YARN configs through the standard provider without needing model-specific provider subclasses. - Remove GPTOSSProvider import and re-wrapping hack in gpt_oss_bridge.py - Pass YARN params directly as provider kwargs in model_bridge.py instead of stashing in _yarn_params dict for later setattr - Fix megatron_to_hf_config to check value is not None instead of hasattr (since fields always exist now but default to None) Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
yarn_* fields are now proper dataclass fields on GPTModelProvider defaulting to None, so None means 'unset' and should be omitted from exported rope_scaling dict rather than preserved. Made-with: Cursor Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
…resets Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
…n NemotronNano12Bv2VLModelProvider After PR #2628 renamed the field from hybrid_override_pattern to hybrid_layer_pattern in MambaModelProvider, NemotronNano12Bv2VLModelProvider still used the old name as a class default. When finalize() was called, both the class default (hybrid_override_pattern) and the bridge CONFIG_MAPPING-set value (hybrid_layer_pattern) were non-None, triggering an assertion error. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
…rovider finalize() derives num_layers from hybrid_layer_pattern; having both a class-level default num_layers=62 and a user-supplied custom pattern caused finalize() to assert 62==N(pattern) and fail. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
…d_layer_pattern) - Replace removed Llama3ModelProvider with GPTModelProvider in test_local_checkpointing.py (added from main after provider removal) - Rename deprecated hybrid_override_pattern to hybrid_layer_pattern in all nemotronh recipes (nemotronh.py, nemotron_nano_v2.py, nemotron_3_nano.py) to match PR #2628 field rename in MambaModelProvider Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
…w fixes - Add num_layers=3 to MEGATRON_NEMOTRON_3_NANO_OVERRIDES so it matches the toy hybrid_layer_pattern="M*E" (3 layers vs recipe's 52) - Fix init_method_std type hint from int to float in glm_45v_provider.py - Remove redundant attention_backend=AttnBackend.auto from 56B nemotronh recipe constructors (each function overrides it to None later) Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
Signed-off-by: ruit <ruit@nvidia.com>
The base provider_bridge only forced RoPE when rope_type was missing, so HF configs with rope_type="llama3" (or "default", "longrope", …) fell back to the provider default learned_absolute. Fix by setting position_embedding_type="rope" for every rope_type except "yarn" (which is already handled by hf_config_to_provider_kwargs). Also align the key lookup to check both "type" and "rope_type" keys consistently. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
…xport Add the missing KimiK2Bridge implementation for Kimi K2 model support and fix the kimi __init__.py to export KimiK2Bridge (matching the actual class name) instead of the non-existent KimiBridge. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
|
/ok to test c7aa37a |
The base class provider_bridge now sets position_embedding_type="rope" for all non-YARN models, but Mamba models don't use position embeddings. Override it back to "none" in NemotronHBridge. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
|
/ok to test 827d325 |
The test_provider_bridge_mamba_config test was checking hybrid_layer_pattern without calling finalize(), but this field is only populated during finalize() from the deprecated hybrid_override_pattern. Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
|
/ok to test f932780 |
|
/ok to test c72bb59 |
Summary
Remove size-specific model provider dataclasses that are no longer needed now that bridges dynamically configure model architectures via
GPTModelProvider,MLAModelProvider, andMambaModelProvider.Deleted provider files (7):
llama_provider.py—Llama{2,3,31,32,4}ModelProvider*(26 classes)nemotron_provider.py—Nemotron{3,4}ModelProvider*(8 classes)gpt_oss_provider.py—GPTOSSModelProvider*(4 classes)deepseek_provider.py—DeepSeek{V2,V3}ModelProvider*(4 classes)kimi_provider.py—KimiK2Providerglm45_provider.py—GLM45{Air,Plus}ModelProvider*(5 classes)nemotron_h_provider.py—NemotronH{,Nano}ModelProvider*(6 classes)Refactored providers with cross-dependencies (3):
llama_nemotron_provider.py— now inherits directly fromGPTModelProviderglm_45v_provider.py— now inherits directly fromGPTModelProvider(inlined GLM MoE defaults)nemotron_vl_provider.py— now inherits directly fromMambaModelProvider(inlined NemotronH defaults)Updated recipes to use base providers (
GPTModelProvider,MLAModelProvider,MambaModelProvider) with inlined architecture parameters instead of importing deleted specific providers.Cleaned up tests:
Net: -1,947 lines (~3,527 deleted, ~1,580 added for inlined parameters)
Test plan
GPTModelProviderparametersMade with Cursor
Summary by CodeRabbit
Release Notes
Refactor
Chores