[Inference] Add MCore inference examples and model wrappers by cuichenx · Pull Request #3897 · NVIDIA-NeMo/Megatron-Bridge

cuichenx · 2026-05-20T00:15:02Z

Summary

Add Bridge/AutoBridge synchronous offline text generation under examples/inference/text_generation.py.
Add direct MCore-style concurrent async generation and OpenAI-compatible server examples under examples/inference/.
Add launcher scripts and README for the new generic inference examples.
Refactor text-only model inference wrappers to use examples/inference/text_generation.py as the efficient inference entry point.
Keep examples/conversion/hf_to_megatron_generate_text.py as a debugging/parity-forward path rather than the primary inference path.
(Temporary) Update the Megatron-LM submodule pointer to the MCore inference API PR head.

Dependency

Depends on unmerged MCore PR: NVIDIA/Megatron-LM#4697

The new examples import the high-level inference APIs from that PR, including MegatronLLM, MegatronAsyncLLM, and ServeConfig.

Validation

uv run --no-sync pre-commit run --all-files
Runtime checks completed for the generic synchronous, async, and OpenAI-compatible server examples.

Model Wrapper Runtime Notes

Wrapper / model target	Runtime result
`examples/models/gpt_oss/inference.sh`	Passes via legacy static generation with local attention; integrated `text_generation.py --use-legacy-generation --attention-backend local` check produced good chat-template outputs (`4` for `What is 2+2?` and a coherent greeting). Dynamic generation runs but produces repetitive output; TE `unfused` static generation runs but produced empty text.
`examples/models/bailing/inference.sh`	Passes one-node inference for `inclusionAI/Ling-flash-2.0` with `TP=1 EP=8`; generated text: `I'm trying to solve this programming problem`.
`examples/models/falcon_h1/inference.sh`	Passes one-GPU inference via MCore static batching for `tiiuae/Falcon-H1-0.5B-Instruct`; generated text: `Answer: Artificial intelligence`.
`examples/models/glm47/inference.sh`	Blocked for now: the MLA dynamic path is blocked in MCore by FlashMLA/block-size issues tracked in NVIDIA/Megatron-LM#4901; team is investigating.
`examples/models/sarvam/inference.sh`	Passes and produces a coherent short output: `AI is a broad field of computer science that aims to create machines that can`.
`examples/models/glm/glm5/slurm_inference.sh`	Runs on 8 nodes for `zai-org/GLM-5.1` with `TP=2 EP=32` and reaches model load/inference setup, but does not produce generated text; both 100-token and 1-token runs timed out at the 30-minute job limit.
`examples/models/glm47/slurm_inference.sh`	Blocked for now: the full-size path is blocked by the same MLA dynamic inference issue as the single-node path; team is investigating.
`examples/models/minimax/minimax_m2/slurm_inference.sh`	Passes two-node inference for `MiniMaxAI/MiniMax-M2` with `TP=1 EP=16` and `inference_moe_token_dispatcher_type=nccl`; generated text: `AI`.
`examples/models/nemotron/nemotron_3/nano/`	Runs and produces text for `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16`, but the short sample was weak: `What is machine learning? What is deep`.
`examples/models/nemotron/nemotron_3/super/`	Passes two-node inference for `nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16` with `inference_moe_token_dispatcher_type=nccl`; generated text: `AI is the ability`.

Note: uv run pre-commit run --all-files without --no-sync was not usable in the local environment because dependency resolution requires a platform-specific nvidia-resiliency-ext==0.6.0 wheel that is unavailable there.

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot · 2026-05-20T00:15:05Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Chen Cui <chcui@nvidia.com>

…4697 Signed-off-by: Chen Cui <chcui@nvidia.com> # Conflicts: # 3rdparty/Megatron-LM

Signed-off-by: Chen Cui <chcui@nvidia.com>

…eMo#3897) Signed-off-by: Chen Cui <chcui@nvidia.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

[Inference] Add MCore text generation examples

f695435

Signed-off-by: Chen Cui <chcui@nvidia.com>

[Inference] Route model examples through MCore text generation

28468f9

Signed-off-by: Chen Cui <chcui@nvidia.com>

cuichenx changed the title ~~[Inference] Add MCore high-level inference examples~~ [Inference] Add MCore inference examples and model wrappers May 20, 2026

cuichenx mentioned this pull request May 26, 2026

[NeMo FW 26.06 Release] MBridge v0.5.0 Roadmap #3754

Open

cuichenx added 2 commits May 28, 2026 11:18

[Inference] Add legacy text generation option

5c60f2e

Signed-off-by: Chen Cui <chcui@nvidia.com>

[Inference] Note GLM fallback generation path

b239a0f

Signed-off-by: Chen Cui <chcui@nvidia.com>

cuichenx marked this pull request as ready for review May 29, 2026 17:44

cuichenx added the high-priority label May 29, 2026

copy-pr-bot Bot temporarily deployed to public May 29, 2026 17:45 Inactive

copy-pr-bot Bot had a problem deploying to test May 29, 2026 17:45 Error

Merge remote-tracking branch 'origin/main' into chcui/inference-mcore…

e8e1b64

…4697 Signed-off-by: Chen Cui <chcui@nvidia.com> # Conflicts: # 3rdparty/Megatron-LM

copy-pr-bot Bot had a problem deploying to test May 29, 2026 17:51 Error

copy-pr-bot Bot temporarily deployed to public May 29, 2026 17:52 Inactive

[Inference] Clean up inference README

c022342

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot Bot had a problem deploying to test May 29, 2026 17:57 Error

copy-pr-bot Bot temporarily deployed to public May 29, 2026 17:57 Inactive

copy-pr-bot Bot temporarily deployed to public May 29, 2026 18:05 Inactive

[Inference] Skip missing optional inference checkpoints

41eb2a1

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot Bot temporarily deployed to public May 29, 2026 18:07 Inactive

copy-pr-bot Bot had a problem deploying to test May 29, 2026 18:07 Error

copy-pr-bot Bot temporarily deployed to public May 29, 2026 18:16 Inactive

[Inference] Use static generation for Falcon H1 example

f6a7afe

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot Bot temporarily deployed to public May 29, 2026 18:20 Inactive

copy-pr-bot Bot temporarily deployed to test May 29, 2026 18:20 Inactive

copy-pr-bot Bot temporarily deployed to public May 29, 2026 18:27 Inactive

copy-pr-bot Bot temporarily deployed to public May 29, 2026 18:28 Inactive

copy-pr-bot Bot temporarily deployed to public May 29, 2026 18:48 Inactive

yaoyu-33 added area:model Model implementations and HF bridge logic blocked Work cannot move forward until an external dependency is cleared feature New capabilities, enhancements, or enablement work labels May 29, 2026

cuichenx enabled auto-merge (squash) May 29, 2026 22:20

cuichenx disabled auto-merge May 29, 2026 22:58

[Inference] Move Python inference entrypoints

3eb8d7a

Signed-off-by: Chen Cui <chcui@nvidia.com>

copy-pr-bot Bot temporarily deployed to public May 29, 2026 23:02 Inactive

copy-pr-bot Bot temporarily deployed to test May 29, 2026 23:02 Inactive

cuichenx added the docs-only With great power comes great responsibility. label May 29, 2026

copy-pr-bot Bot temporarily deployed to public May 29, 2026 23:11 Inactive

yaoyu-33 approved these changes May 29, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to public May 29, 2026 23:11 Inactive

cuichenx merged commit 58825e5 into main May 29, 2026
24 checks passed

cuichenx deleted the chcui/inference-mcore4697 branch May 29, 2026 23:12

copy-pr-bot Bot temporarily deployed to public May 29, 2026 23:30 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inference] Add MCore inference examples and model wrappers#3897

[Inference] Add MCore inference examples and model wrappers#3897
cuichenx merged 9 commits into
mainfrom
chcui/inference-mcore4697

cuichenx commented May 20, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cuichenx commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dependency

Validation

Model Wrapper Runtime Notes

Uh oh!

copy-pr-bot Bot commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cuichenx commented May 20, 2026 •

edited

Loading