Skip to content

[model] feat: add GLM-5 (MoE + MLA + DSA) bridge and provider#2913

Merged
yaoyu-33 merged 56 commits into
mainfrom
yuya-add-glm5
Apr 21, 2026
Merged

[model] feat: add GLM-5 (MoE + MLA + DSA) bridge and provider#2913
yaoyu-33 merged 56 commits into
mainfrom
yuya-add-glm5

Conversation

@yaoyu-33

@yaoyu-33 yaoyu-33 commented Mar 20, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Takes over #2469 and rebases onto latest main.

Adds support for the GLM-5 model (MoE + MLA + DSA architecture).

Changes

  • GLM5Bridge: bidirectional HF ↔ Megatron-Core checkpoint conversion for GlmMoeDsaForCausalLM
  • GLM5ModelProvider: config/arch mapping for the glm_moe_dsa architecture
  • Overrides maybe_modify_loaded_hf_weight / maybe_modify_converted_hf_weight to handle fused expert tensors (3D) introduced in transformers>=5.2.0
  • Exposes GLM5Bridge and GLM5ModelProvider in the glm_moe_dsa package
  • Functional tests for GLM-5 conversion across parallelism configs

Note: Requires transformers>=5.2.0 locally to use this feature, due to the new fused expert weight layout.

Related

Closes #2343
Takes over #2469

GitHub Actions CI

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install?
    • Reviewer: Does the PR have correct import guards for all optional libraries?

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for GLM-5 models with Mixture of Experts (MoE) and Distributed Sparse Attention (DSA) capabilities, including HuggingFace-to-Megatron conversion.
  • Updates

    • Reorganized model provider exports for improved API structure, including expanded support for Llama, DeepSeek, NeMoTron, and GLM variants.
    • Updated upstream Megatron-LM dependency.
  • Tests

    • Added functional test suite for GLM-5 model conversion validation across multiple parallelism configurations.

yaoyu-33 and others added 30 commits March 6, 2026 09:31
Introduce FusedExpertMapping and FusedGatedExpertMapping in
param_mapping.py to handle many-to-one / one-to-many expert weight
conversions generically. This eliminates duplicated
maybe_modify_converted_hf_weight overrides and hf_weights_cache from
GPT-OSS, GLM-4.5, GLM-4.5V, and Qwen3-VL bridges (-502 / +307 lines).

Also fixes two pre-existing bugs:
- GLM-4.5 MTP mappings used stale 'transformer_layer' instead of
  'mtp_model_layer', causing missing-mapping warnings
- hf_to_megatron_generate_text.py set mtp_num_layers=None which crashed
  MTP-enabled models; replaced with m.mtp_process=False

Signed-off-by: Yu Yao <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
- Remove NemotronNano12Bv2Provider from nemotron_vl/__init__.py
  (was a deprecated alias from deleted nemotron_h_provider.py)
- Remove invalid max_position_embeddings kwarg from kimi and moonlight
  recipes (not a field on MLAModelProvider)
- Update moonlight test to monkeypatch MLAModelProvider instead of
  deleted MoonlightModelProvider16B

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
Add Megatron Bridge for MiniMaxAI/MiniMax-M2, a sparse MoE model with
256 experts, top-8 sigmoid routing, and expert bias correction.

Includes:
- Bridge with config mapping and per-expert weight conversion
  (block_sparse_moe prefix, w1/w2/w3 format)
- Partial RoPE support (rotary_dim -> rotary_percent)
- QK layernorm intentionally disabled (full-dim vs per-head mismatch)
- Functional test with toy model for TP/PP/EP parallelism
- Example scripts for conversion, inference, and verification
- compare.py fix: truncate Megatron logits to HF vocab size for
  proper comparison when Megatron pads vocab for kernel efficiency

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
…ti-node support for MiniMax-M2

Add custom full-dimension QK normalization (minimax_m2_provider.py) since
MiniMax-M2 applies RMSNorm over the entire Q/K projection rather than
per-head. The implementation uses sum-of-squares all-reduce across TP
ranks and provides sharded_state_dict for distributed checkpointing.

Add on-the-fly FP8 block-wise dequantization in the bridge via
maybe_modify_loaded_hf_weight, converting float8_e4m3fn weights to
bfloat16 using per-block scale_inv factors during HF->Megatron
conversion.

Add multi-node Slurm scripts (slurm_conversion.sh, slurm_inference.sh)
for configurations requiring TP*EP*PP > 8 GPUs.

Update verify_toy_model.py to extract real pretrained weights (N layers)
from the FP8 model, dequantize to bf16, and verify round-trip accuracy.

Fix dtype mismatch handling in hf_megatron_roundtrip_multi_gpu.py for
FP8 source models.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
…x-M2 expert mappings

- Add missing FusedGatedExpertMapping alias (GLMExpertGateUpProjMapping)
  to glm_moe_mappings.py, fixing ImportError after fused expert refactor
- Remove duplicate local_experts.* mappings from MiniMax-M2 bridge since
  moe_grouped_gemm=True (only grouped-gemm weight* path needed)

Verified: TP=2, PP=2, EP=2 roundtrip tests pass on cluster with zero
mapping warnings.

Signed-off-by: Yu Yao <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
Remove verify_toy_model.py dev script. Align conversion.sh and
inference.sh with GPT-OSS pattern (import + export + roundtrip,
multi-checkpoint inference). Rewrite slurm_conversion.sh to sweep
parallelism configs (TP,PP,EP) with roundtrip validation. Clean up
slurm_inference.sh for consistency.

All configs verified on cluster-cw with toy model:
  TP=2,PP=1,EP=4 | TP=1,PP=2,EP=4 | TP=2,PP=2,EP=2 → EXIT=0

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
…eanup

- Add SLURM env var auto-population in model_provider.py for srun launches
  (RANK, WORLD_SIZE, LOCAL_RANK, MASTER_ADDR, MASTER_PORT from SLURM vars)
- Increase NCCL init_process_group timeout to 60 minutes for large MoE models
- Fix ImportError crash in save_artifacts for trust_remote_code models
- Accept SLURM_NTASKS in hf_megatron_roundtrip_multi_gpu.py for srun launches
- Rewrite MiniMax-M2 slurm scripts to use srun-native (ntasks-per-node=8)
  instead of torch.distributed.run
- Remove single-node conversion.sh/inference.sh (MiniMax-M2 requires multi-node)
- Set verified parallelism defaults: TP=2,EP=8 roundtrip; TP=1,EP=16 inference

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
…or PR #2628)

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
…mappings

The refactor in param_mapping.py renamed GLMExpertGateUpProjMapping to
FusedGatedExpertMapping but only added GLMExpertDownProjMapping alias
in glm_moe_mappings.py. Add the missing alias so existing bridge imports
(glm45_bridge.py, glm_45v_bridge.py) continue to work.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Split multi-name import block into two separate import statements,
each with per-line # noqa: F401 comments, to satisfy ruff's import
block formatting requirements.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
…ext tests

- Set PROVIDER_CLASS = Qwen3NextModelProvider so super().provider_bridge()
  instantiates the correct provider (not GPTModelProvider which lacks
  MLA/hybrid fields like q_lora_rank)
- Add value is not None guard in hf_config_to_provider_kwargs to skip
  None-valued config fields
- Add null_attr fixture loop in test mocks to suppress Mock() objects
  for MLA/alternative-expert CONFIG_MAPPING fields

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Keep once-per-class dtype mismatch warning from HEAD (suppresses duplicate
warnings) over main's per-call version.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
… inference

- gpt_oss_bridge: transpose only down_proj (not gate_up_proj) in
  maybe_modify_loaded_hf_weight; use _align_expert_weight_to_shape in
  GPTOSSMLPGateUpProjMapping.hf_to_megatron for auto-detection of
  transposed vs standard expert weight layout across transformers versions
- glm45_bridge: loop over both mtp_model_layer and transformer_layer
  prefixes when building MTP param mappings, fixing roundtrip for MTP layers
- generate_text: set mtp_num_layers=0 (not None) to make range(0) a
  safe no-op instead of crashing with range(None) for MTP models

Verified on cluster: GPT-OSS BF16 roundtrip ✅, GPT-OSS MXFP4 inference ✅,
GLM-4.5-Air roundtrip ✅, GLM-4.5-Air inference ✅ ("Paris is the capital of France.")

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
- Add `transpose_on_export` flag to `FusedExpertMapping` instead of
  shape-based transpose detection (avoids silent bugs with square tensors)
- Make `GLMExpertDownProjMapping` a proper subclass with
  `transpose_on_export=True` always set
- Add `transpose_on_export=True` to down_proj `FusedExpertMapping`
  instances in Qwen3-VL and Qwen3.5-VL bridges
- Restore inline comments removed from `stream_weights_megatron_to_hf`
- Refactor MTP-disable logic in `hf_to_megatron_generate_text.py` to use
  a `_disable_mtp` helper (matching vlm script pattern); fix
  `mtp_num_layers=0` → `None`

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Resolve conflict in qwen35_vl_bridge.py: keep PR's FusedExpertMapping /
FusedGatedExpertMapping approach (removing hf_weights_cache and
maybe_modify_converted_hf_weight) rather than main's ExpertMLPDownProjMapping
/ ExpertMLPGateUpProjMapping pattern. The class now inherits from
MegatronModelBridge directly (main's change, kept).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
…train

The μP functional test (eb93f96) referenced Llama32ModelProvider1B but
never defined or imported it, causing an F821 lint failure.  Add a local
@DataClass subclass of GPTModelProvider with Llama 3.2 1B architecture
defaults.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
…rt weight alignment

- generate_text: extract _disable_mtp() helper (mirrors generate_vlm.py pattern)
  with mtp_num_layers=0 to avoid range(None) crash on MTP-enabled text models
- param_mapping: add transpose_hint param to _align_expert_weight_to_shape
  (True/False for explicit control, None for auto-detect with square-shape guard)
  as suggested by reviewer; raise clear error when auto-detect encounters
  ambiguous square 2D weights

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
- param_mapping.py: inline raise ValueError to satisfy line-length formatting
- gpt_oss_bridge.py: remove unused GPTOSSProvider import (F401)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Merge Llama32ModelProvider1B definitions: take main's more complete set
of architecture defaults (activation_func, position_embedding_type,
bias/fusion flags, rotary_base) and add HEAD-only fields (kv_channels,
rope_scaling, rope_scaling_factor) for a full Llama 3.2 1B config.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
- model_bridge.py: take remote's simpler provider_kwargs guard
  (value is not None) and unconditional transpose_on_export path
- hf_to_megatron_generate_text.py: take remote's _disable_mtp that
  uses mtp_num_layers=0 and sets grad_scale_func=None in one place
- Sync hf_megatron_roundtrip_multi_gpu.py and qwen3_vl transformer_config
  from remote

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Bring submodule pointer in line with origin/main (905c0e38).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
…n merge

The file was removed in yuya/refactor-fused-expert-mappings but main
still has it. Restore from origin/main to keep this PR scoped to
MiniMax-M2 bridge only.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/claude review

def _create_glm5_toy_model(model_dir: Path) -> None:
model_dir.mkdir(parents=True, exist_ok=True)

# Create GLM 4.5 config from the toy model config using AutoConfig

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: this is GLM-5, not GLM 4.5.

Suggested change
# Create GLM 4.5 config from the toy model config using AutoConfig
# Create GLM-5 config from the toy model config using AutoConfig

Comment on lines +201 to +207
src_dir = getattr(self, "model_name_or_path", None)
if src_dir is not None:
src_file = Path(str(src_dir)) / f"{name}.json"
if src_file.exists():
shutil.copy(src_file, Path(save_path) / f"{name}.json")
else:
logger.warning(f"Source file {src_file} not found; skipping {name}.")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: if model_name_or_path is None (e.g. model was constructed in-memory), the fallback silently does nothing after logging the initial warning about save_pretrained failing. Adding an else branch with a warning would make debugging easier:

Suggested change
src_dir = getattr(self, "model_name_or_path", None)
if src_dir is not None:
src_file = Path(str(src_dir)) / f"{name}.json"
if src_file.exists():
shutil.copy(src_file, Path(save_path) / f"{name}.json")
else:
logger.warning(f"Source file {src_file} not found; skipping {name}.")
src_dir = getattr(self, "model_name_or_path", None)
if src_dir is not None:
src_file = Path(str(src_dir)) / f"{name}.json"
if src_file.exists():
shutil.copy(src_file, Path(save_path) / f"{name}.json")
else:
logger.warning(f"Source file {src_file} not found; skipping {name}.")
else:
logger.warning(f"No model_name_or_path set; cannot fall back to file copy for {name}.")

- Revert try/except fallback in hf_pretrained/base.py (not needed in base)
- Refactor provider_bridge to use super() matching DeepSeek V3 pattern
- Combine param_mappings and layer_specific_mappings into single dict
- Remove example scripts (glm5_converted_inference, glm5_orig_inference,
  glm5_roundtrip)
- Simplify slurm_conversion.sh to match MiniMax-M2 pattern

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test a02455b

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Made-with: Cursor
Signed-off-by: Yu Yao <yaoyu.094@gmail.com>
@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test 5124c61

…ax duplicates

Move GLM-5 test from models/ to test_groups/models/glm_moe_dsa/ following
CI conventions (hardcoded /opt/Megatron-Bridge, standard coverage, autoconfig
roundtrip). Add L0 launch script to launch_scripts/active/.

Remove duplicate minimax_m2 files (models/minimax_m2/ and root-level launch
script) since test_groups/ already has the canonical versions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test 4be9445

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test 867aaf7

yaoyu-33 and others added 3 commits April 20, 2026 16:53
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rash

GLM-5 HF config has num_nextn_predict_layers=1, which creates MTP layers
in the megatron model. The bridge has no MTP weight mappings for GLM-5,
causing NoneType crashes during weight transfer. Disable MTP for now
(not needed for inference) and add defensive None checks in model_bridge.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@yaoyu-33

Copy link
Copy Markdown
Contributor Author

/ok to test b40478c

@yaoyu-33 yaoyu-33 merged commit 50e95c1 into main Apr 21, 2026
56 of 57 checks passed
@yaoyu-33 yaoyu-33 deleted the yuya-add-glm5 branch April 21, 2026 04:46
@ISEEKYAN

Copy link
Copy Markdown
Contributor

@yaoyu-33 do we have a plan to use better DSA kernels such tilelang version or cuteDSL version or do we have a TE kernel plan?

@ISEEKYAN

Copy link
Copy Markdown
Contributor

@yaoyu-33 do we have a plan to use better DSA kernels such tilelang version or cuteDSL version or do we have a TE kernel plan?
cc @sbhavani

@sbhavani

Copy link
Copy Markdown
Contributor

@ISEEKYAN it's on our roadmap for TE, I should add a public issue to track it

vasunvidia pushed a commit to vasunvidia/Megatron-Bridge that referenced this pull request Jun 10, 2026
…-NeMo#2913)

Signed-off-by: Yu Yao <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:model Model implementations and HF bridge logic feature New capabilities, enhancements, or enablement work high-priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]GLM-5 support?

4 participants