feat(moe): Fine-grained activation offloading#1913

Merged

ko3n1g merged 71 commits into

NVIDIA:mainfrom

lhb8125:hongbinl/activation_offloading_github_main

Jan 15, 2026

lhb8125 commented Oct 24, 2025 •

edited

Loading

Contributor

What does this PR do ?

PR for dev branch

Memory capacity are more and more important with the rising of extreme sparse MoE models like DeepSeek-V3 and Qwen3-235B. Fine-grained recomputing reduces the memory footprint at the cost of extra recomputation, while offloading could utilize the host-device bandwidth to achieve nearly zero-overhead.

The current CPU offloading strategy from TE is a layer-level strategy, which offloads the activations in a granularity of the transformer layer, which is coarse-level and hard to highlight the most prominent activations.

Fine-grained Activation Offloading targets at offloading the activation at the granularity of specific modules, so that we can calibrate the amount of offloading activation to maximize the training throughput.

Compared with the current cpu offloading strategy provided by TE, this PR has several advantages:

support PP=1/PP/VPP;
support MoE models;
manually specify offloading the modules with a large memory footprint;
Work with fine-grained recomputation to reduce the total activations as much as possible;

How does fine-grained offloading work with fine-grained recomputing?

For modules with minor perf overhead like layernorm or moe_act, use recomputing to reduce memory footprint;
For other modules, use offloading to reduce memory footprint;
Make sure the offloading/reloading could be overlapped with computing;

Benchmark

DeepSeek-V3-proxy on H100

Setup

Layer parameters are same as DeepSeek-V3 model
Layer number is cut off to 14 layers
Replace the fisrt 3 dense layers with 3 moe layers
TP1PP4EP16VPP1CP1-MBS1GBS512, bf16 training
Offload expert_fc1, moe_act, act_norm and mlp_norm

Results

	Throughput (TFlops)	Max reserved memory (MB)
Baseline	321	74306
Offload expert_fc1,moe_act,layernorm	315	61046

DeepSeek-V3 on GB200 (from @hongxiaob )

Setup

Same model structure with DeepSeek-V3 but no mtp
TP1PP8EP32CP1VPP4-MBS1GBS2048, mxfp8
Offload moe_act

Results

	Throughput (TFlops)	Max reserved memory (MB)
Baseline	945	169094
Offload moe_act	930	151054

lhb8125 requested review from a team as code owners

October 24, 2025 03:14

copy-pr-bot Bot commented Oct 24, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

lhb8125 self-assigned this

lhb8125 added the Expert Review label

lhb8125 added this to the Core 0.15 milestone

lhb8125 commented Oct 24, 2025

Contributor Author

/ok to test 9520f77

copy-pr-bot Bot temporarily deployed to nemo-ci

October 24, 2025 03:35

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

October 24, 2025 03:36

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

October 24, 2025 03:36

Inactive

copy-pr-bot Bot temporarily deployed to test

October 24, 2025 03:37

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

October 24, 2025 04:12

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

October 24, 2025 04:26

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

October 24, 2025 04:26

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

October 24, 2025 04:26

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

October 24, 2025 04:26

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

October 24, 2025 04:26

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

October 24, 2025 04:26

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

October 24, 2025 04:48

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

October 24, 2025 04:48

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

October 24, 2025 04:48

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

October 24, 2025 04:48

Inactive

lhb8125 commented Oct 24, 2025

Contributor Author

/ok to test e26d092

copy-pr-bot Bot temporarily deployed to nemo-ci

October 24, 2025 05:24

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

October 24, 2025 05:24

Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci

October 24, 2025 05:24

Inactive

lhb8125 commented Jan 12, 2026

Contributor Author

/ok to test 483d87a

lhb8125 mentioned this pull request

[Dev]feat(moe): code refactor for fine grained activation offloading #2905

Merged

6 tasks

lhb8125 commented Jan 12, 2026

Contributor Author

/ok to test 483d87a


          Merge branch 'main' into hongbinl/activation_offloading_github_main

a498067

lhb8125 commented Jan 12, 2026

Contributor Author

/ok to test a498067

fanshiqing reviewed

View reviewed changes

megatron/core/models/gpt/gpt_model.py Outdated

fanshiqing reviewed

View reviewed changes

megatron/core/pipeline_parallel/utils.py

fanshiqing reviewed

View reviewed changes

megatron/core/transformer/attention.py Outdated

fanshiqing reviewed

View reviewed changes

megatron/core/transformer/moe/README.md Outdated

lhb8125 and others added 3 commits

January 13, 2026 17:58


          fix doc

2f42e91

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>


          code refactor

469cef0

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>


          Merge branch 'main' into hongbinl/activation_offloading_github_main

c1fdba4

lhb8125 commented Jan 14, 2026

Contributor Author

/ok to test c1fdba4

lhb8125 and others added 3 commits

January 14, 2026 01:48


          remove group_start() calls

b93d212

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>


          format

16d4114

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>


          Merge branch 'main' into hongbinl/activation_offloading_github_main

lhb8125 commented Jan 14, 2026

Contributor Author

/ok to test 1964268


          add comments

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

lhb8125 commented Jan 14, 2026

Contributor Author

/ok to test 6263630


          fix min_offload_size and update golden values

9423c6b

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

lhb8125 commented Jan 14, 2026

Contributor Author

/ok to test 9423c6b

fanshiqing approved these changes

View reviewed changes

fanshiqing left a comment

Member

LGTM.

kvareddy approved these changes

View reviewed changes

lhb8125 and others added 2 commits

January 15, 2026 01:48


          rename group_commit

cc28dd7

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>


          Merge branch 'main' into hongbinl/activation_offloading_github_main

b7153fa

lhb8125 commented Jan 15, 2026

Contributor Author

/ok to test b7153fa


          fix test_mamba_moe_model.py

871bdaf

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>

lhb8125 commented Jan 15, 2026

Contributor Author

/ok to test 871bdaf

ericharper approved these changes

View reviewed changes

ananthsub mentioned this pull request

[sync] Fine-grained activation offloading NVIDIA-NeMo/Megatron-Bridge#2122

Merged

5 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

jaredcasper jaredcasper left review comments

deepakn94 deepakn94 left review comments

fanshiqing fanshiqing approved these changes

kvareddy kvareddy approved these changes

ericharper ericharper approved these changes

sanandaraj5597 Awaiting requested review from sanandaraj5597

+3 more reviewers

Skylion007 Skylion007 left review comments

pablo-garay pablo-garay approved these changes

yanring yanring approved these changes

Labels

complexity: high dev2main: mbridge Expert Review Final Review