Skip to content

[Inference] Add MCore inference examples and model wrappers#3897

Merged
cuichenx merged 9 commits into
mainfrom
chcui/inference-mcore4697
May 29, 2026
Merged

[Inference] Add MCore inference examples and model wrappers#3897
cuichenx merged 9 commits into
mainfrom
chcui/inference-mcore4697

Conversation

@cuichenx

@cuichenx cuichenx commented May 20, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add Bridge/AutoBridge synchronous offline text generation under examples/inference/text_generation.py.
  • Add direct MCore-style concurrent async generation and OpenAI-compatible server examples under examples/inference/.
  • Add launcher scripts and README for the new generic inference examples.
  • Refactor text-only model inference wrappers to use examples/inference/text_generation.py as the efficient inference entry point.
  • Keep examples/conversion/hf_to_megatron_generate_text.py as a debugging/parity-forward path rather than the primary inference path.
  • (Temporary) Update the Megatron-LM submodule pointer to the MCore inference API PR head.

Dependency

Depends on unmerged MCore PR: NVIDIA/Megatron-LM#4697

The new examples import the high-level inference APIs from that PR, including MegatronLLM, MegatronAsyncLLM, and ServeConfig.

Validation

  • uv run --no-sync pre-commit run --all-files
  • Runtime checks completed for the generic synchronous, async, and OpenAI-compatible server examples.

Model Wrapper Runtime Notes

Wrapper / model target Runtime result
examples/models/gpt_oss/inference.sh Passes via legacy static generation with local attention; integrated text_generation.py --use-legacy-generation --attention-backend local check produced good chat-template outputs (4 for What is 2+2? and a coherent greeting). Dynamic generation runs but produces repetitive output; TE unfused static generation runs but produced empty text.
examples/models/bailing/inference.sh Passes one-node inference for inclusionAI/Ling-flash-2.0 with TP=1 EP=8; generated text: I'm trying to solve this programming problem.
examples/models/falcon_h1/inference.sh Passes one-GPU inference via MCore static batching for tiiuae/Falcon-H1-0.5B-Instruct; generated text: **Answer:** Artificial intelligence.
examples/models/glm47/inference.sh Blocked for now: the MLA dynamic path is blocked in MCore by FlashMLA/block-size issues tracked in NVIDIA/Megatron-LM#4901; team is investigating.
examples/models/sarvam/inference.sh Passes and produces a coherent short output: AI is a broad field of computer science that aims to create machines that can.
examples/models/glm/glm5/slurm_inference.sh Runs on 8 nodes for zai-org/GLM-5.1 with TP=2 EP=32 and reaches model load/inference setup, but does not produce generated text; both 100-token and 1-token runs timed out at the 30-minute job limit.
examples/models/glm47/slurm_inference.sh Blocked for now: the full-size path is blocked by the same MLA dynamic inference issue as the single-node path; team is investigating.
examples/models/minimax/minimax_m2/slurm_inference.sh Passes two-node inference for MiniMaxAI/MiniMax-M2 with TP=1 EP=16 and inference_moe_token_dispatcher_type=nccl; generated text: AI.
examples/models/nemotron/nemotron_3/nano/ Runs and produces text for nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16, but the short sample was weak: What is machine learning? What is deep.
examples/models/nemotron/nemotron_3/super/ Passes two-node inference for nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 with inference_moe_token_dispatcher_type=nccl; generated text: AI is the ability.

Note: uv run pre-commit run --all-files without --no-sync was not usable in the local environment because dependency resolution requires a platform-specific nvidia-resiliency-ext==0.6.0 wheel that is unavailable there.

Signed-off-by: Chen Cui <chcui@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented May 20, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: Chen Cui <chcui@nvidia.com>
@cuichenx cuichenx changed the title [Inference] Add MCore high-level inference examples [Inference] Add MCore inference examples and model wrappers May 20, 2026
cuichenx added 2 commits May 28, 2026 11:18
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
@cuichenx cuichenx marked this pull request as ready for review May 29, 2026 17:44
…4697

Signed-off-by: Chen Cui <chcui@nvidia.com>

# Conflicts:
#	3rdparty/Megatron-LM
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
@yaoyu-33 yaoyu-33 added area:model Model implementations and HF bridge logic blocked Work cannot move forward until an external dependency is cleared feature New capabilities, enhancements, or enablement work labels May 29, 2026
@cuichenx cuichenx enabled auto-merge (squash) May 29, 2026 22:20
@cuichenx cuichenx disabled auto-merge May 29, 2026 22:58
Signed-off-by: Chen Cui <chcui@nvidia.com>
@cuichenx cuichenx added the docs-only With great power comes great responsibility. label May 29, 2026
@cuichenx cuichenx merged commit 58825e5 into main May 29, 2026
24 checks passed
@cuichenx cuichenx deleted the chcui/inference-mcore4697 branch May 29, 2026 23:12
vasunvidia pushed a commit to vasunvidia/Megatron-Bridge that referenced this pull request Jun 10, 2026
…eMo#3897)

Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:model Model implementations and HF bridge logic blocked Work cannot move forward until an external dependency is cleared docs-only With great power comes great responsibility. feature New capabilities, enhancements, or enablement work high-priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants