Inference Optimized MoEs by sidsingh-nvidia · Pull Request #3496 · NVIDIA/Megatron-LM

sidsingh-nvidia · 2026-02-19T19:42:37Z

What does this PR do ?

Optimizes MoE decode (and small batch prefill) performance by eliminating host synchronizations, enabling end-to-end CUDA graph capture, and using latency-optimized NVLS collectives.

Motivation

The default MoE layer in Megatron-LM is not well-suited for inference. The AlltoAll dispatcher and GroupedGEMM rely on CPU-resident tensors for token-expert assignments, which breaks CUDA graph capture due to host synchronizations. The current workaround — padding to maximum capacity (routing all tokens to all experts) — enables static shapes but wastes significant compute and communication.

This PR introduces an inference-optimized MoE layer that achieves the best of both worlds: compute/communication-optimal routing with full CUDA graph compatibility. These layers are specially designed to optimize the decode phase, where small batch sizes make host synchronization and kernel launch overhead disproportionately expensive.

Optimizations

InferenceGroupedMLP — Expert computation layer that operates directly on GPU-resident token-expert splits, eliminating the host synchronizations that block CUDA graph capture in the default GroupedGEMM path. Uses FlashInfer cutlass_fused_moe (fused permute + GEMM) for CUDA-graphed iterations and torch._grouped_mm with GPU-resident cumsum offsets for eager mode. Inherits from TEGroupedMLP for checkpoint compatibility..
InferenceTopKRouter — Stripped-down router that removes training overhead (z-loss, auxiliary losses, token dropping) and is optimized via @torch.compile().
InferenceCUDAGraphTokenDispatcher — Replaces AlltoAll with AllGather/ReduceScatter for token exchange, keeping all metadata GPU-resident. Supports latency-optimized NVLS collectives on Hopper+ with automatic NCCL fallback.
Fused 3-tensor NVLS all-gather kernel (multimem_all_gather_fused) for routing_map, probs, and hidden_states — single kernel launch + single barrier.

Other Minor Changes

InferenceSpecProvider backend for wiring inference-optimized modules into model specs.
MoELayer dynamically swaps between standard and inference dispatchers based on is_inference_cuda_graphed_iteration.
Centralized NVLS eligibility checks (are_tensors_nvls_eligible, is_device_nvls_capable) shared across TP and EP communication paths.
Separate symmetric memory buffer pools for TP and EP (get_global_symmetric_memory_buffer_tp/ep).
Multi-tensor packing in symmetric memory buffers (maybe_get_tensors) with 16-byte alignment.
Kill-switch via --inference-disable-triton-nvls-kernels config flag. This makes the system fallback to NCCL.
Config validation: inference-optimized MoE rejects expert tensor parallelism, capacity-factor routing, and padded routing maps.

How to enable?

these flags -

--transformer-impl inference_optimized \
--moe-router-dtype fp32 # flashinfer only supports fp32 probabilities

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

…ll op

…hed signal downwards

…spatcher

yanring

Approve on behalf of Pingtian

svcnvidia-nemo-ci · 2026-03-03T01:21:40Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/22603788165

New files: - megatron/core/transformer/moe/token_dispatcher_inference.py - tests/unit_tests/inference/test_moe_inference.py

sidsingh-nvidia added 30 commits January 13, 2026 16:58

skeleton of inference moe layer done

0b64ce8

restore

da29281

Merge branch 'main' into siddharth/inference-optimized-moe-layer

75cfef8

match argument signature with training

6e01116

support gpt models like qwen

153265b

make torch grouped gemm work

7915cff

add config restraints for single GPU only and make dtoh and sync a nu…

8dd410d

…ll op

remove requirement for router fusion

b8f5fe5

confirm that this works with nccl all to alls

5063fb2

disable drop and pad for inference optimized, and propogate cuda grap…

297f926

…hed signal downwards

confirm that all-gather dispatch runs within cuda graphs

629dc1f

working

21b9140

replace permute/unpermute kernels with triton

a786eda

minor optimizations

f6ee32c

one round of optimizations

3f7f39d

reduce kernel calls

10da287

symmetric memory AG for hidden states

1983688

nvls all gathers for all three tensors. nvls rs on hidden state

02f315a

full model cg optimizations and bump up max blocks for blackwell

0fac929

Merge branch 'main' into inf-opt-all-gather-dispatcher

df930a2

Merge remote-tracking branch 'origin/main' into inf-opt-all-gather-di…

3606123

…spatcher

fix full model CG for mamba

371043c

remove requirement for moe permute fusion

01cd40f

failed attempt at optimizing router and permute

30d8cf3

tseted with qwen

6cb8a8a

add cutlass kernel

b85e8fe

optimize dummy forwards

98a4d9f

bugfix in inference router

acbc841

latest

986e2a1

return usage characteristics from text gen server

bb8890d

sidsingh-nvidia added 2 commits March 1, 2026 22:46

bugfix

ca75b2b

lint

a61aea5

copy-pr-bot Bot had a problem deploying to test March 2, 2026 07:34 Error

sidsingh-nvidia added 2 commits March 1, 2026 23:41

format and guard properly

4fd23ce

Merge branch 'main' into inf-opt-all-gather-dispatcher

fa25b1b

sidsingh-nvidia requested a review from buptzyb March 2, 2026 07:42

copy-pr-bot Bot temporarily deployed to test March 2, 2026 07:42 Inactive

yanring approved these changes Mar 2, 2026

View reviewed changes

buptzyb approved these changes Mar 2, 2026

View reviewed changes

santhnm2 approved these changes Mar 2, 2026

View reviewed changes

jaredcasper approved these changes Mar 2, 2026

View reviewed changes

Merge branch 'main' into inf-opt-all-gather-dispatcher

5ae4424

copy-pr-bot Bot temporarily deployed to test March 2, 2026 19:32 Inactive

sidsingh-nvidia added 2 commits March 2, 2026 12:16

Merge branch 'main' into inf-opt-all-gather-dispatcher

01ad0b2

fix comments

b1530a6

copy-pr-bot Bot temporarily deployed to test March 2, 2026 20:18 Inactive

Merge branch 'main' into inf-opt-all-gather-dispatcher

a498be3

copy-pr-bot Bot temporarily deployed to test March 2, 2026 22:05 Inactive

shanmugamr1992 approved these changes Mar 2, 2026

View reviewed changes

minor fix in unit test

1b6f066

sidsingh-nvidia enabled auto-merge March 2, 2026 23:41

copy-pr-bot Bot temporarily deployed to test March 2, 2026 23:42 Inactive

sidsingh-nvidia added this pull request to the merge queue Mar 3, 2026

Merged via the queue into NVIDIA:main with commit 7d1c016 Mar 3, 2026
117 checks passed

sidsingh-nvidia deleted the inf-opt-all-gather-dispatcher branch March 3, 2026 02:34

ilml added a commit to ilml/Megatron-LM that referenced this pull request Mar 20, 2026

Add new files from 7d1c016 Inference Optimized MoEs (NVIDIA#3496)

e6e48b7

New files: - megatron/core/transformer/moe/token_dispatcher_inference.py - tests/unit_tests/inference/test_moe_inference.py

yangbofun pushed a commit to xlm-research/Megatron-LM that referenced this pull request May 22, 2026

Inference Optimized MoEs (NVIDIA#3496)

818dd3a

sbhavani mentioned this pull request May 26, 2026

[ROADMAP][2026 Q2] Megatron Core Roadmap #4997

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference Optimized MoEs#3496

Inference Optimized MoEs#3496
sidsingh-nvidia merged 104 commits into
NVIDIA:mainfrom
sidsingh-nvidia:inf-opt-all-gather-dispatcher

sidsingh-nvidia commented Feb 19, 2026 •

edited

Loading

Uh oh!

yanring left a comment

Uh oh!

svcnvidia-nemo-ci commented Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

Conversation

sidsingh-nvidia commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Motivation

Optimizations

Other Minor Changes

How to enable?

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

yanring left a comment

Choose a reason for hiding this comment

Uh oh!

svcnvidia-nemo-ci commented Mar 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

sidsingh-nvidia commented Feb 19, 2026 •

edited

Loading

(Step 1): Add PR label `Expert Review`