MoE ECHO: Unlocking Sync-Free, Full CUDA-Graph Support for Dropless MoE via Elastic Cloning by Victarry · Pull Request #2368 · NVIDIA/Megatron-LM

Victarry · 2025-11-24T07:10:28Z

MoE ECHO: Elastic Cloning for Hot Experts 🚀

Contributors(Equal Contribution, sorted alphabetically): Ahan Huang, Dennis Liu(@Victarry), Nan Zheng(@nanz-nv), Patrick Haft, Qi Zhang(@QiZhangNV), Robin Zhang(@buptzyb), Tong Liu(@Autumn1998), Zijie Yan(@yanring)

Overview 🔍

MoE Echo is a research prototype of a new MoE training paradigm that targets large-scale distributed training. It focuses on achieving load balance and sync-free with Fully CUDA-graph-capturable on dropless MoE training.

Concretely, MoE Echo aims to:

⚖️ Reduce expert load imbalance across Expert Parallel (EP) ranks.
⏱️ Remove host-side synchronization with dynamic routing in dropless MoE.
📊 Enable CUDA-graph-capturable MoE with minimal compute and memory fragmentation.

Sync-Free MoE ⚡

In token-dropless MoE, the number of tokens sent to each EP rank can vary significantly from step to step. The routing decisions (and thus the per-rank shapes) are produced on the GPU, but the host traditionally needs this shape information to:

launch dispatch/combine and grouped GEMM kernels, and
allocate sufficient memory for these kernels.

Naively, this requires device-to-host copies and host-side synchronization on every step, which both slows down training and makes CUDA graph capture difficult.

To build a sync-free, CUDA-graph-friendly MoE, we:

Pre-allocate GPU buffers and decide kernel launches without waiting for host-visible shapes.
Avoid excessive over-provisioning of buffers, which would otherwise cause:
- Compute fragmentation (wasted compute/communication on padded tokens).
- Memory fragmentation (oversized static buffers).

MoE Echo tackles this by:

Reducing compute fragmentation: GPU kernels consume routing/shape information that stays on device and operate only on the true token volume. For example, HybridEP reads shapes directly from the routing map on GPU.
Reducing memory fragmentation: We reduce load imbalance across EP ranks and manage memory more efficiently inside CUDA graphs so the pre-allocated buffers are better utilized.

Elastic Cloning for Hot Experts (ECHO)

To further reduce expert load imbalance, MoE Echo introduces elastic cloning for hot experts (ECHO). The key idea is to dynamically clone high-traffic (“hot”) experts onto EP ranks that receive fewer-than-average tokens.

Cloning experts during training is challenging because expert weights and gradients must remain coherent across all clones at every step. This means:

Synchronizing cloned expert parameters and gradients.
Carefully limiting the number of cloned experts to balance extra communication cost against the load-balance benefit.

MoE Echo addresses this with:

An ECHO planner that decides which popular experts to clone and where to place them, given spare expert slots on each EP rank.
An ECHO dispatcher that:
- dispatches tokens to the appropriate cloned experts and their spare slots according to the plan, and
- during backward, handles any necessary re-dispatch when spare slots are shared across layers and then combines/reduces gradients from all cloned copies into the main expert.

Quick Start 🏁

Install Dependencies

HybridEP

git clone https://github.com/deepseek-ai/DeepEP.git
cd DeepEP & git checkout hybrid-ep
TORCH_CUDA_ARCH_LIST="10.0" pip install -e .

Device-inited-grouped gemm

Note that this kernel is only available for Blackwell GPUs.

git clone https://github.com/QiZhangNV/TransformerEngine.git
cd TransformerEngine & git checkout cutlass_device_grouped_gemm
git submodule update --init --recursive
NVTE_CUDA_ARCHS="100a" NVTE_BUILD_THREADS_PER_JOB=8 NVTE_FRAMEWORK=pytorch pip install --no-cache-dir --no-build-isolation .

Run MoE Echo ▶️

Add the following flags to the command line to enable Echo for your training:

--moe-enable-echo
--moe-num-echo-experts 32 # number of echo experts totally
--moe-echo-expert-dispatcher-type hybridep # Only hybridep support sync-free dispatch
--moe-received-token-capacity 2.0 # capacity of total received tokens on each ep rank (if not set, sync-version will be used)
--moe-use-device-initiated-grouped-gemm # use device-initiated grouped gemm (only available for Blackwell GPUs MXFP8 GEMM)
--fp8-format e4m3
--fp8-recipe mxfp8
--fp8-param-gather
--reuse-grad-buf-for-mxfp8-param-ag
--moe-echo-recompute-expert-dispatch # recompute expert dispatch, such that the echo expert buffer is shared across layers
# Enable CUDA Graph
--enable-cuda-graph
--cuda-graph-scope full_iteration
--te-rng-tracker

Roadmap 🗺️

copy-pr-bot · 2025-11-24T07:10:31Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

- Updated `TransformerConfig` to include new parameters for echo experts and offloading capabilities. - Introduced `one_shot_greedy_assignment` function for efficient token assignment in offloading planner. - Added tests for echo experts and offloading planner to ensure functionality and performance. - Adjusted existing functions to support new metadata handling for echo expert dispatching. This commit improves load balancing and reduces communication overhead in MoE models.

Delete outdated and unused files

… expert slot

…oading_plan.

…n in the expert dispatch.

yuguo-Jack · 2026-01-21T10:19:13Z

why deepep backend not surpport sync-free dispatch

Victarry added the dev branch Dev branch related issues and development label Nov 24, 2025

yanring changed the title ~~[Draft] MoE Echo: Elastic Cloning for Hot Experts~~ [Draft] MoE Echo: Unlocking Sync-Free, Full CUDA-Graph Support for Dropless MoE via Elastic Cloning Nov 24, 2025

yanring changed the title ~~[Draft] MoE Echo: Unlocking Sync-Free, Full CUDA-Graph Support for Dropless MoE via Elastic Cloning~~ MoE Echo: Unlocking Sync-Free, Full CUDA-Graph Support for Dropless MoE via Elastic Cloning Nov 24, 2025

yanring changed the title ~~MoE Echo: Unlocking Sync-Free, Full CUDA-Graph Support for Dropless MoE via Elastic Cloning~~ MoE ECHO: Unlocking Sync-Free, Full CUDA-Graph Support for Dropless MoE via Elastic Cloning Nov 24, 2025

Victarry added 3 commits November 27, 2025 03:11

Fix 1F1B overlap + hybridep.

4a0aa9d

Fix recompute.

c12b26c

yanring mentioned this pull request Dec 3, 2025

[ROADMAP][Updated on April 07] Megatron Core MoE Roadmap #1729

Open

48 tasks

Victarry added 2 commits December 21, 2025 22:33

Fix dist ckpt.

25fd58c

Add echo data dump.

843ce7d

Victarry force-pushed the moe_hot_expert_poc branch from 95ee60c to 843ce7d Compare December 24, 2025 06:36

nanz-nv and others added 9 commits December 24, 2025 16:10

Update planner to use approximate greedy binpacking

00f5a43

Delete outdated and unused files

throw exception approximate bin packing is used with more than 1 echo…

3828803

… expert slot

Update to latest hybridep.

31456c0

Increase the weight chunk size

724b6fe

Merge pull request #2 from nanz-nv/moe_hot_expert_poc

1d48d5f

Change defualt offloading algorithm and add torch.compile to gen_offl…

d8232ad

…oading_plan.

Fix dist ckpt load shape mismatch.

08c335f

Prevent proprocess of expert dispatch in the backward.

bf233a6

Fix mismatch handle of fc1 expert dispatch and fc2 expert dispatch.

51cf207

Victarry force-pushed the moe_hot_expert_poc branch 3 times, most recently from 44ac439 to cb82c5b Compare January 16, 2026 05:59

Fix missing backward in expert dispatch and make gradient accumulatio…

78a1fc1

…n in the expert dispatch.

Victarry force-pushed the moe_hot_expert_poc branch from cb82c5b to 78a1fc1 Compare January 16, 2026 06:02

Handle recompute for different cases.

a2b16b8

Victarry mentioned this pull request May 15, 2026

[ROADMAP][2026 Q2] Megatron Core MoE Roadmap #4815

Open

71 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MoE ECHO: Unlocking Sync-Free, Full CUDA-Graph Support for Dropless MoE via Elastic Cloning#2368

MoE ECHO: Unlocking Sync-Free, Full CUDA-Graph Support for Dropless MoE via Elastic Cloning#2368
Victarry wants to merge 16 commits into
NVIDIA:devfrom
Victarry:moe_hot_expert_poc

Victarry commented Nov 24, 2025 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Nov 24, 2025

Uh oh!

yuguo-Jack commented Jan 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Victarry commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MoE ECHO: Elastic Cloning for Hot Experts 🚀

Overview 🔍

Sync-Free MoE ⚡

Elastic Cloning for Hot Experts (ECHO)

Quick Start 🏁

Install Dependencies

HybridEP

Device-inited-grouped gemm

Run MoE Echo ▶️

Roadmap 🗺️

Uh oh!

copy-pr-bot Bot commented Nov 24, 2025

Uh oh!

yuguo-Jack commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Victarry commented Nov 24, 2025 •

edited

Loading

yuguo-Jack commented Jan 21, 2026 •

edited

Loading