Skip to content

feat(moe): Fine-grained activation offloading#1913

Merged
ko3n1g merged 71 commits into
NVIDIA:mainfrom
lhb8125:hongbinl/activation_offloading_github_main
Jan 15, 2026
Merged

feat(moe): Fine-grained activation offloading#1913
ko3n1g merged 71 commits into
NVIDIA:mainfrom
lhb8125:hongbinl/activation_offloading_github_main

Conversation

@lhb8125

@lhb8125 lhb8125 commented Oct 24, 2025

Copy link
Copy Markdown
Contributor

What does this PR do ?

PR for dev branch

Memory capacity are more and more important with the rising of extreme sparse MoE models like DeepSeek-V3 and Qwen3-235B. Fine-grained recomputing reduces the memory footprint at the cost of extra recomputation, while offloading could utilize the host-device bandwidth to achieve nearly zero-overhead.

The current CPU offloading strategy from TE is a layer-level strategy, which offloads the activations in a granularity of the transformer layer, which is coarse-level and hard to highlight the most prominent activations.

Fine-grained Activation Offloading targets at offloading the activation at the granularity of specific modules, so that we can calibrate the amount of offloading activation to maximize the training throughput.

Design Doc

Compared with the current cpu offloading strategy provided by TE, this PR has several advantages:

  • support PP=1/PP/VPP;
  • support MoE models;
  • manually specify offloading the modules with a large memory footprint;
  • Work with fine-grained recomputation to reduce the total activations as much as possible;

How does fine-grained offloading work with fine-grained recomputing?

  • For modules with minor perf overhead like layernorm or moe_act, use recomputing to reduce memory footprint;
  • For other modules, use offloading to reduce memory footprint;
  • Make sure the offloading/reloading could be overlapped with computing;
image

Benchmark

DeepSeek-V3-proxy on H100

Setup
  • Layer parameters are same as DeepSeek-V3 model
  • Layer number is cut off to 14 layers
  • Replace the fisrt 3 dense layers with 3 moe layers
  • TP1PP4EP16VPP1CP1-MBS1GBS512, bf16 training
  • Offload expert_fc1, moe_act, act_norm and mlp_norm
Results
Throughput (TFlops) Max reserved memory (MB)
Baseline 321 74306
Offload expert_fc1,moe_act,layernorm 315 61046

DeepSeek-V3 on GB200 (from @hongxiaob )

Setup
  • Same model structure with DeepSeek-V3 but no mtp
  • TP1PP8EP32CP1VPP4-MBS1GBS2048, mxfp8
  • Offload moe_act
Results
Throughput (TFlops) Max reserved memory (MB)
Baseline 945 169094
Offload moe_act 930 151054

@lhb8125 lhb8125 requested review from a team as code owners October 24, 2025 03:14
@copy-pr-bot

copy-pr-bot Bot commented Oct 24, 2025

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@lhb8125 lhb8125 self-assigned this Oct 24, 2025
@lhb8125 lhb8125 added the Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. label Oct 24, 2025
@lhb8125 lhb8125 added this to the Core 0.15 milestone Oct 24, 2025
@lhb8125

lhb8125 commented Oct 24, 2025

Copy link
Copy Markdown
Contributor Author

/ok to test 9520f77

@lhb8125

lhb8125 commented Oct 24, 2025

Copy link
Copy Markdown
Contributor Author

/ok to test e26d092

@lhb8125

lhb8125 commented Jan 12, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 483d87a

@lhb8125

lhb8125 commented Jan 12, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 483d87a

@lhb8125

lhb8125 commented Jan 12, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test a498067

Comment thread megatron/core/models/gpt/gpt_model.py Outdated
Comment thread megatron/core/pipeline_parallel/utils.py
Comment thread megatron/core/transformer/attention.py Outdated
Comment thread megatron/core/transformer/moe/README.md Outdated
lhb8125 and others added 3 commits January 13, 2026 17:58
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
@lhb8125

lhb8125 commented Jan 14, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test c1fdba4

lhb8125 and others added 3 commits January 14, 2026 01:48
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
@lhb8125

lhb8125 commented Jan 14, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 1964268

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
@lhb8125

lhb8125 commented Jan 14, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 6263630

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
@lhb8125

lhb8125 commented Jan 14, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 9423c6b

@fanshiqing fanshiqing left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@lhb8125

lhb8125 commented Jan 15, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test b7153fa

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
@lhb8125

lhb8125 commented Jan 15, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 871bdaf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

complexity: high dev2main: mbridge dev to main: this PR is needed in main for mbridge Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. Final Review PR is in the "final review" stage

Projects

None yet

Development

Successfully merging this pull request may close these issues.