[model] feat: add GLM-5 (MoE + MLA + DSA) bridge and provider#2913
Merged
Conversation
Introduce FusedExpertMapping and FusedGatedExpertMapping in param_mapping.py to handle many-to-one / one-to-many expert weight conversions generically. This eliminates duplicated maybe_modify_converted_hf_weight overrides and hf_weights_cache from GPT-OSS, GLM-4.5, GLM-4.5V, and Qwen3-VL bridges (-502 / +307 lines). Also fixes two pre-existing bugs: - GLM-4.5 MTP mappings used stale 'transformer_layer' instead of 'mtp_model_layer', causing missing-mapping warnings - hf_to_megatron_generate_text.py set mtp_num_layers=None which crashed MTP-enabled models; replaced with m.mtp_process=False Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
- Remove NemotronNano12Bv2Provider from nemotron_vl/__init__.py (was a deprecated alias from deleted nemotron_h_provider.py) - Remove invalid max_position_embeddings kwarg from kimi and moonlight recipes (not a field on MLAModelProvider) - Update moonlight test to monkeypatch MLAModelProvider instead of deleted MoonlightModelProvider16B Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
Add Megatron Bridge for MiniMaxAI/MiniMax-M2, a sparse MoE model with 256 experts, top-8 sigmoid routing, and expert bias correction. Includes: - Bridge with config mapping and per-expert weight conversion (block_sparse_moe prefix, w1/w2/w3 format) - Partial RoPE support (rotary_dim -> rotary_percent) - QK layernorm intentionally disabled (full-dim vs per-head mismatch) - Functional test with toy model for TP/PP/EP parallelism - Example scripts for conversion, inference, and verification - compare.py fix: truncate Megatron logits to HF vocab size for proper comparison when Megatron pads vocab for kernel efficiency Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
…ti-node support for MiniMax-M2 Add custom full-dimension QK normalization (minimax_m2_provider.py) since MiniMax-M2 applies RMSNorm over the entire Q/K projection rather than per-head. The implementation uses sum-of-squares all-reduce across TP ranks and provides sharded_state_dict for distributed checkpointing. Add on-the-fly FP8 block-wise dequantization in the bridge via maybe_modify_loaded_hf_weight, converting float8_e4m3fn weights to bfloat16 using per-block scale_inv factors during HF->Megatron conversion. Add multi-node Slurm scripts (slurm_conversion.sh, slurm_inference.sh) for configurations requiring TP*EP*PP > 8 GPUs. Update verify_toy_model.py to extract real pretrained weights (N layers) from the FP8 model, dequantize to bf16, and verify round-trip accuracy. Fix dtype mismatch handling in hf_megatron_roundtrip_multi_gpu.py for FP8 source models. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
…x-M2 expert mappings - Add missing FusedGatedExpertMapping alias (GLMExpertGateUpProjMapping) to glm_moe_mappings.py, fixing ImportError after fused expert refactor - Remove duplicate local_experts.* mappings from MiniMax-M2 bridge since moe_grouped_gemm=True (only grouped-gemm weight* path needed) Verified: TP=2, PP=2, EP=2 roundtrip tests pass on cluster with zero mapping warnings. Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
Remove verify_toy_model.py dev script. Align conversion.sh and inference.sh with GPT-OSS pattern (import + export + roundtrip, multi-checkpoint inference). Rewrite slurm_conversion.sh to sweep parallelism configs (TP,PP,EP) with roundtrip validation. Clean up slurm_inference.sh for consistency. All configs verified on cluster-cw with toy model: TP=2,PP=1,EP=4 | TP=1,PP=2,EP=4 | TP=2,PP=2,EP=2 → EXIT=0 Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
…eanup - Add SLURM env var auto-population in model_provider.py for srun launches (RANK, WORLD_SIZE, LOCAL_RANK, MASTER_ADDR, MASTER_PORT from SLURM vars) - Increase NCCL init_process_group timeout to 60 minutes for large MoE models - Fix ImportError crash in save_artifacts for trust_remote_code models - Accept SLURM_NTASKS in hf_megatron_roundtrip_multi_gpu.py for srun launches - Rewrite MiniMax-M2 slurm scripts to use srun-native (ntasks-per-node=8) instead of torch.distributed.run - Remove single-node conversion.sh/inference.sh (MiniMax-M2 requires multi-node) - Set verified parallelism defaults: TP=2,EP=8 roundtrip; TP=1,EP=16 inference Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
…or PR #2628) Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
…mappings The refactor in param_mapping.py renamed GLMExpertGateUpProjMapping to FusedGatedExpertMapping but only added GLMExpertDownProjMapping alias in glm_moe_mappings.py. Add the missing alias so existing bridge imports (glm45_bridge.py, glm_45v_bridge.py) continue to work. Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Split multi-name import block into two separate import statements, each with per-line # noqa: F401 comments, to satisfy ruff's import block formatting requirements. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
…ext tests - Set PROVIDER_CLASS = Qwen3NextModelProvider so super().provider_bridge() instantiates the correct provider (not GPTModelProvider which lacks MLA/hybrid fields like q_lora_rank) - Add value is not None guard in hf_config_to_provider_kwargs to skip None-valued config fields - Add null_attr fixture loop in test mocks to suppress Mock() objects for MLA/alternative-expert CONFIG_MAPPING fields Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Keep once-per-class dtype mismatch warning from HEAD (suppresses duplicate warnings) over main's per-call version. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
… inference
- gpt_oss_bridge: transpose only down_proj (not gate_up_proj) in
maybe_modify_loaded_hf_weight; use _align_expert_weight_to_shape in
GPTOSSMLPGateUpProjMapping.hf_to_megatron for auto-detection of
transposed vs standard expert weight layout across transformers versions
- glm45_bridge: loop over both mtp_model_layer and transformer_layer
prefixes when building MTP param mappings, fixing roundtrip for MTP layers
- generate_text: set mtp_num_layers=0 (not None) to make range(0) a
safe no-op instead of crashing with range(None) for MTP models
Verified on cluster: GPT-OSS BF16 roundtrip ✅, GPT-OSS MXFP4 inference ✅,
GLM-4.5-Air roundtrip ✅, GLM-4.5-Air inference ✅ ("Paris is the capital of France.")
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
- Add `transpose_on_export` flag to `FusedExpertMapping` instead of shape-based transpose detection (avoids silent bugs with square tensors) - Make `GLMExpertDownProjMapping` a proper subclass with `transpose_on_export=True` always set - Add `transpose_on_export=True` to down_proj `FusedExpertMapping` instances in Qwen3-VL and Qwen3.5-VL bridges - Restore inline comments removed from `stream_weights_megatron_to_hf` - Refactor MTP-disable logic in `hf_to_megatron_generate_text.py` to use a `_disable_mtp` helper (matching vlm script pattern); fix `mtp_num_layers=0` → `None` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Resolve conflict in qwen35_vl_bridge.py: keep PR's FusedExpertMapping / FusedGatedExpertMapping approach (removing hf_weights_cache and maybe_modify_converted_hf_weight) rather than main's ExpertMLPDownProjMapping / ExpertMLPGateUpProjMapping pattern. The class now inherits from MegatronModelBridge directly (main's change, kept). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
…train The μP functional test (eb93f96) referenced Llama32ModelProvider1B but never defined or imported it, causing an F821 lint failure. Add a local @DataClass subclass of GPTModelProvider with Llama 3.2 1B architecture defaults. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
…rt weight alignment - generate_text: extract _disable_mtp() helper (mirrors generate_vlm.py pattern) with mtp_num_layers=0 to avoid range(None) crash on MTP-enabled text models - param_mapping: add transpose_hint param to _align_expert_weight_to_shape (True/False for explicit control, None for auto-detect with square-shape guard) as suggested by reviewer; raise clear error when auto-detect encounters ambiguous square 2D weights Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
- param_mapping.py: inline raise ValueError to satisfy line-length formatting - gpt_oss_bridge.py: remove unused GPTOSSProvider import (F401) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Merge Llama32ModelProvider1B definitions: take main's more complete set of architecture defaults (activation_func, position_embedding_type, bias/fusion flags, rotary_base) and add HEAD-only fields (kv_channels, rope_scaling, rope_scaling_factor) for a full Llama 3.2 1B config. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
…ngs' into yuya/add-minimax-m2-bridge
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
- model_bridge.py: take remote's simpler provider_kwargs guard (value is not None) and unconditional transpose_on_export path - hf_to_megatron_generate_text.py: take remote's _disable_mtp that uses mtp_num_layers=0 and sets grad_scale_func=None in one place - Sync hf_megatron_roundtrip_multi_gpu.py and qwen3_vl transformer_config from remote Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Bring submodule pointer in line with origin/main (905c0e38). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
…n merge The file was removed in yuya/refactor-fused-expert-mappings but main still has it. Restore from origin/main to keep this PR scoped to MiniMax-M2 bridge only. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Contributor
Author
|
/claude review |
| def _create_glm5_toy_model(model_dir: Path) -> None: | ||
| model_dir.mkdir(parents=True, exist_ok=True) | ||
|
|
||
| # Create GLM 4.5 config from the toy model config using AutoConfig |
Contributor
There was a problem hiding this comment.
Typo: this is GLM-5, not GLM 4.5.
Suggested change
| # Create GLM 4.5 config from the toy model config using AutoConfig | |
| # Create GLM-5 config from the toy model config using AutoConfig |
Comment on lines
+201
to
+207
| src_dir = getattr(self, "model_name_or_path", None) | ||
| if src_dir is not None: | ||
| src_file = Path(str(src_dir)) / f"{name}.json" | ||
| if src_file.exists(): | ||
| shutil.copy(src_file, Path(save_path) / f"{name}.json") | ||
| else: | ||
| logger.warning(f"Source file {src_file} not found; skipping {name}.") |
Contributor
There was a problem hiding this comment.
Nit: if model_name_or_path is None (e.g. model was constructed in-memory), the fallback silently does nothing after logging the initial warning about save_pretrained failing. Adding an else branch with a warning would make debugging easier:
Suggested change
| src_dir = getattr(self, "model_name_or_path", None) | |
| if src_dir is not None: | |
| src_file = Path(str(src_dir)) / f"{name}.json" | |
| if src_file.exists(): | |
| shutil.copy(src_file, Path(save_path) / f"{name}.json") | |
| else: | |
| logger.warning(f"Source file {src_file} not found; skipping {name}.") | |
| src_dir = getattr(self, "model_name_or_path", None) | |
| if src_dir is not None: | |
| src_file = Path(str(src_dir)) / f"{name}.json" | |
| if src_file.exists(): | |
| shutil.copy(src_file, Path(save_path) / f"{name}.json") | |
| else: | |
| logger.warning(f"Source file {src_file} not found; skipping {name}.") | |
| else: | |
| logger.warning(f"No model_name_or_path set; cannot fall back to file copy for {name}.") |
- Revert try/except fallback in hf_pretrained/base.py (not needed in base) - Refactor provider_bridge to use super() matching DeepSeek V3 pattern - Combine param_mappings and layer_specific_mappings into single dict - Remove example scripts (glm5_converted_inference, glm5_orig_inference, glm5_roundtrip) - Simplify slurm_conversion.sh to match MiniMax-M2 pattern Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Contributor
Author
|
/ok to test a02455b |
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Made-with: Cursor Signed-off-by: Yu Yao <yaoyu.094@gmail.com>
Contributor
Author
|
/ok to test 5124c61 |
yaoyu-33
commented
Apr 16, 2026
…ax duplicates Move GLM-5 test from models/ to test_groups/models/glm_moe_dsa/ following CI conventions (hardcoded /opt/Megatron-Bridge, standard coverage, autoconfig roundtrip). Add L0 launch script to launch_scripts/active/. Remove duplicate minimax_m2 files (models/minimax_m2/ and root-level launch script) since test_groups/ already has the canonical versions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
Author
|
/ok to test 4be9445 |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
Author
|
/ok to test 867aaf7 |
…arily" This reverts commit 867aaf7.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rash GLM-5 HF config has num_nextn_predict_layers=1, which creates MTP layers in the megatron model. The bridge has no MTP weight mappings for GLM-5, causing NoneType crashes during weight transfer. Disable MTP for now (not needed for inference) and add defensive None checks in model_bridge. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
Author
|
/ok to test b40478c |
Contributor
|
@yaoyu-33 do we have a plan to use better DSA kernels such tilelang version or cuteDSL version or do we have a TE kernel plan? |
Contributor
Contributor
|
@ISEEKYAN it's on our roadmap for TE, I should add a public issue to track it |
vasunvidia
pushed a commit
to vasunvidia/Megatron-Bridge
that referenced
this pull request
Jun 10, 2026
…-NeMo#2913) Signed-off-by: Yu Yao <yaoyu.094@gmail.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Takes over #2469 and rebases onto latest main.
Adds support for the GLM-5 model (MoE + MLA + DSA architecture).
Changes
GLM5Bridge: bidirectional HF ↔ Megatron-Core checkpoint conversion forGlmMoeDsaForCausalLMGLM5ModelProvider: config/arch mapping for theglm_moe_dsaarchitecturemaybe_modify_loaded_hf_weight/maybe_modify_converted_hf_weightto handle fused expert tensors (3D) introduced intransformers>=5.2.0GLM5BridgeandGLM5ModelProviderin theglm_moe_dsapackageRelated
Closes #2343
Takes over #2469
GitHub Actions CI
Before your PR is "Ready for review"
Pre checks:
Summary by CodeRabbit
Release Notes
New Features
Updates
Tests