[model] feat: Add stepfun-ai/Step-3.5-Flash bridge by shifangx · Pull Request #3525 · NVIDIA-NeMo/Megatron-Bridge

shifangx · 2026-04-26T00:26:43Z

…Bridge support

What does this PR do ?

Add support for Step-3.5-Flash, a 196B sparse MoE model (~11B active parameters) with the custom step3p5 architecture.

Architecture features:

45 layers: 3 dense (0-2) + 42 MoE (3-44)
288 routed experts + 1 shared expert, top-8 sigmoid routing with expert bias
Zero-centered RMSNorm + per-head QK RMSNorm
Per-head attention gate (g_proj)
Per-layer RoPE theta (48-element array)
Alternating SWA(Sliding Window Attention)
Specific attention config for SWA, such as num_attention_heads, rotary_base and rotary_percent.

This pr depends on NVIDIA/Megatron-LM#4841 and NVIDIA/Megatron-LM#4473

Changelog

Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

copy-pr-bot · 2026-04-26T00:26:47Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yaoyu-33 · 2026-04-27T12:17:48Z

/claude review

yaoyu-33 · 2026-04-27T12:19:51Z

Naming consistency review

The repo's established pattern for "X.Y"-versioned models drops the dot in filenames/symbols and uses an underscore in HF model_type strings. This PR uses p as a decimal separator everywhere, which matches no other model in tree:

Model	Filenames / classes	HF `model_type`
Qwen 2.5	`qwen25_vl_bridge.py`, `Qwen25...`	`"qwen2_5"` / `"qwen2_5_vl"`
Qwen 3.5	`qwen35_vl_bridge.py`, `Qwen35...`	`"qwen3_5"` / `"qwen3_5_moe"`
GLM 4.5v	`glm_45v_bridge.py`, `modeling_glm_45v.py`	—
Kimi K2.5	`kimi_k25_vl/`, `modeling_kimi_k25_vl.py`	—
Bailing MoE v2	`bailing_moe_v2`, `BailingMoeV2...`	`"bailing_moe_v2"`
Step 3.5 (this PR)	`step3p5_bridge.py`, `Step3p5...`	`"step3p5"`

Suggested renames

step3p5_bridge.py → step35_bridge.py
configuration_step3p5.py → configuration_step35.py
modeling_step3p5.py → modeling_step35.py
recipes/step/step3p5.py → recipes/step/step35.py
Classes: Step3p5Bridge, Step3p5Config, Step3p5ForCausalLM, Step3p5Model, Step3p5RotaryEmbedding, Step3p5CausalLMOutputWithPast → drop the p
Recipe fn: step3p5_196b_a11b_pretrain_config → step35_196b_a11b_pretrain_config
Bridge registration model_type="step3p5" → "step3_5" (matching the qwen3_5 style)

Caveat on the HF model_type string: if "step3p5" is what stepfun-ai uses on HuggingFace Hub (the value baked into their published config.json), it has to stay as-is for from_pretrained to dispatch — in that case only the file/class names should change, and it'd be worth a docstring note explaining why the string differs from the rest of the repo.

Header / boilerplate

configuration_step3p5.py has no copyright header — file starts at from typing import …. Per CLAUDE.md, NVIDIA headers are required on new non-test files.
modeling_step3p5.py carries # Copyright 2025 The LLAMA4 and HuggingFace Inc. team, looks like a Llama4 template was used. Compare with modeling_bailing_moe_v2.py / modeling_ministral3.py, which start with the standard NVIDIA + upstream-attribution header.
Both files use year 2025; please bump to 2026.

Submodule bump

The PR also bumps 3rdparty/Megatron-LM and .gitmodules (commits 49bff34d, 690edcd9, 39828dd8). If that bump isn't strictly required by the Step bridge code, please split it into a separate PR so this one is purely the model addition.

Layout

models/step/{__init__.py, <name>_bridge.py, configuration_*.py, modeling_*.py} matches the precedent set by bailing/ and ministral3/. No <name>_provider.py needed since the bridge subclasses GPTModelProvider directly. ✓

claude

Review Summary

This PR has several issues that need to be addressed before it's ready for merge. The main concerns:

Blockers

.gitmodules points to a personal fork — The submodule URL was changed from https://github.com/NVIDIA/Megatron-LM.git to git@github.com:shifangx/Megatron-LM.git. This will break CI and all other contributors. Must be reverted.
self.attention_dropout undefined — Step3p5Attention references self.attention_dropout during training but never initializes it. This is an AttributeError at runtime.
get_input_embeddings breaks HF API — The override in Step3p5Model takes input_ids and returns embeddings output instead of the embedding module. This breaks tie_weights(), get_input_embeddings() calls from Step3p5ForCausalLM, etc.
Missing bridge parameter mappings — The mapping_registry is missing mappings for shared experts, dense-layer MLPs (layers 0–2), attention gate (g_proj), and router bias. Weight conversion will silently drop or fail on these parameters.

Cleanup

Multiple copy-paste artifacts from Llama4/Qwen3 (copyright header, docstring examples, comments)
Leftover # breakpoint() debug comments
share_expert_dims / share_expert_dim naming inconsistency in config

Missing

No tests. A new model bridge should have at least unit tests for parameter mapping round-trips and config conversion.

yaoyu-33

Inline notes for the naming-consistency review (top-level summary already posted above).

yaoyu-33 · 2026-05-22T00:24:17Z

Addressed the unresolved review feedback in 5e5c2c7681d3cf91ef5dd0c3b0c68ea4d56e0a11:

Moved the model, recipe, example, and test paths to stepfun.
Removed the redundant megatron_to_hf_config() override.
Routed the Stepfun HF config fields through CONFIG_MAPPING / super().provider_bridge() and kept only Stepfun-specific normalization in provider_bridge().
Added functional conversion coverage for TP/PP/EP toy Step-3.5 checkpoints.
Added a 1-node / 8-GPU inference example for Step-3.5-Flash.

Sanitized internal validation evidence:

Unit coverage for Stepfun model/recipe changes: 46 passed.
Functional conversion coverage for Stepfun TP/PP/EP cases: 3 passed.
Real stepfun-ai/Step-3.5-Flash smoke run: local public-model snapshot, 1 node / 8 GPUs, in-memory HF -> Megatron conversion, EP=8, greedy generation completed.
- Prompt: Write one concise sentence about Megatron Bridge.
- Generated: <｜begin▁of▁sentence｜>Write one concise sentence about Megatron Bridge. Write one concise sentence

Posted /ok to test 5e5c2c7681d3cf91ef5dd0c3b0c68ea4d56e0a11 separately.

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 · 2026-05-22T00:36:16Z

/ok to test 32d94b8

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 · 2026-05-22T00:54:46Z

/ok to test 9ab6119

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: yaoyu-33 <yaoyu.094@gmail.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

shifangx changed the title ~~[model] feat: Add stepfun-ai/Step-3.5-Flash bridge~~ [draft] [model] feat: Add stepfun-ai/Step-3.5-Flash bridge Apr 26, 2026

shifangx force-pushed the shifang/step-3.5-flash branch 3 times, most recently from ee5cce6 to 690edcd Compare April 26, 2026 15:18

shifangx changed the title ~~[draft] [model] feat: Add stepfun-ai/Step-3.5-Flash bridge~~ [model] feat: Add stepfun-ai/Step-3.5-Flash bridge Apr 27, 2026

yaoyu-33 added area:model Model implementations and HF bridge logic feature New capabilities, enhancements, or enablement work high-complexity Harder to merge: prone to conflicts and needs additional test coverage needs-review PR is ready for code review and waiting on a reviewer labels Apr 27, 2026

yaoyu-33 reviewed Apr 27, 2026

View reviewed changes

Comment thread src/megatron/bridge/models/step/__init__.py Outdated

yaoyu-33 reviewed Apr 27, 2026

View reviewed changes

Comment thread .gitmodules