[Dev] Paged Stashing#2690
Merged
Merged
Conversation
d99b74f to
f733d51
Compare
Contributor
|
/ok to test 3e8c042 |
Contributor
|
Thank you for your contribution! NVIDIA Megatron-LM is currently transitioning to development on Github. We will aim to review your PR after we complete our transition and stabilize our Github development process. Thank you for your understanding. |
QiZhangNV
reviewed
Dec 19, 2025
vasunvidia
reviewed
Jan 7, 2026
vasunvidia
reviewed
Jan 7, 2026
vasunvidia
reviewed
Jan 7, 2026
vasunvidia
reviewed
Jan 7, 2026
48 tasks
jianyuh
reviewed
Jan 19, 2026
63126cc to
d4eee90
Compare
f30202f to
a1103bb
Compare
a1103bb to
095db06
Compare
3cd7a47 to
b5b19b0
Compare
hxbai
reviewed
Mar 24, 2026
146c763 to
1c9b6aa
Compare
Contributor
|
/ok to test 1c9b6aa |
Contributor
Author
|
/ok to test 807f963 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/24329522123 |
cursor Bot
pushed a commit
to AMD-AGI/Primus
that referenced
this pull request
Apr 17, 2026
…dule
This change completes the paged-stashing feature work:
1. Ports NVIDIA/Megatron-LM PR #2690 (Paged Stashing) as a runtime patch so
that the feature can be enabled on top of stock Megatron-LM without
keeping a fork of third_party/Megatron-LM.
- Adds primus/backends/megatron/core/transformer/moe/paged_stash.py, a
copy of the upstream paged_stash.py module that is installed into
megatron.core.transformer.moe.paged_stash at runtime.
- Adds primus/backends/megatron/patches/moe_patches/paged_stash_patches.py,
a before_train patch (gated on --moe_paged_stash) which wires all the
integration points: TransformerConfig fields, FullCudaGraphWrapper
extensions, _HybridEPManager over-budget tracking, MoEFlexTokenDispatcher
check/reset_over_budget helpers, TEGroupedMLP.forward paged-stash
context, pipeline-schedule paged_stash_reset calls,
GPTModel.preprocess_for_paged_stash, and PagedStashRunner injection via
get_forward_backward_func.
2. Resets third_party/Megatron-LM to d3528a2 (Primus main baseline) and
redirects the submodule back to NVIDIA/Megatron-LM upstream, matching the
Primus main branch configuration.
Reference: NVIDIA/Megatron-LM#2690
Co-authored-by: zhenhuang12 <zhenhuang12@users.noreply.github.com>
5 tasks
5 tasks
71 tasks
nanz-nv
added a commit
to vasunvidia/Megatron-LM
that referenced
this pull request
May 19, 2026
Co-authored-by: Qi Zhang <qizhang@nvidia.com> Co-authored-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by: a <a> Co-authored-by: tongliu <tongliu@nvidia.com>
nanz-nv
added a commit
to vasunvidia/Megatron-LM
that referenced
this pull request
May 20, 2026
Co-authored-by: Qi Zhang <qizhang@nvidia.com> Co-authored-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by: a <a> Co-authored-by: tongliu <tongliu@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Main PR: #4247
Main contributors (Equal Contribution, sorted alphabetically): Nan Zheng (@nanz-nv), Vasudevan Rengasamy (@vasunvidia)
Other contributors (sorted alphabetically): Dennis Liu(@Victarry), Hongbin Liu(@lhb8125), Qi Zhang(@QiZhangNV), Robin Zhang(@buptzyb), Tong Liu(@Autumn1998), Zijie Yan(@yanring)
Background
In token-dropless MoE training, the number of tokens received by each expert might vary, resulting in dynamic shaped tensors. Dynamic shaped tensors are naturally supported by PyTorch, thanks to its eager mode nature. This is done by creating a tensor lazily when the shape of the tensor is known at run-time. Albeit working well in eager mode, dynamic shaped tensor poses challenges for CUDA graphs because the the size of a tensor cannot be dynamically adjusted at runtime without the intervene of the host. In order to remove the sync and enable CUDA graph, one solution is to oversize the buffer in the expert part. This however causes significantly higher memory consumption compared to the eager-mode baseline through the form of memory fragmentation.
Idea overview
To address this problem, paged stashing decouples the need of oversized buffers for compute and the need of a properly sized buffer for storing activations for the backward pass. Paged stashing achieves this through adding one level of indirection: stashing and restoring. The stash operation copies the activation from the oversized static buffer to a pre-allocated stashing buffer after the forward for that module is done, and the restore operation does the reverse operation during the backward pass.
The key of saving memory lies in the fact that the stash operation packs the variable-size activation into a contiguous stashing buffer to reduce memory fragmentation. For simple scheduling where the activation allocation and deallocation follows a first-in-last-out pattern, stash and restore can be done easily in a bump-allocation manner. To accommodate complicated scheduling schedules, e.g. pipeline parallel, paging can be used, hence the name paged stashing.
page management
To accomodate complex scheduling such as that needed in pipeline parallelism, activations are partitioned into pages and a light-weight memory management kernel is in charge of allocate and deallocate pages for stashing. Pages are managed by lightweight GPU memory management kernels that can be fused with the stash/restore GPU kernels. It maintains a freelist which is implemented as a circular buffer. Each freelist keeps track of one type of pages.
CPU offloading
Paged stashing naturally supports offloading. When the stashing buffer is a pinned CPU tensor, the activation is offloaded to the host memory during forward and is reloaded to the GPU during backward.
Furthermore, one can easily extend the paging management system to accommodate partial offloading or on-demand offloading. This feature is currently WIP.
scheduling
Overlapping stashing and restore operations with compute can be implemented by inserting two autograd functions before and after the expert compute layer: pre-scheduler and post-scheduler that schedules stash and restore operations. The roles of these autograd functions are enumerated below:
Additionally, in case of pipeline parallelism, this can be used to record the pipeline schedule during the first iteration.
Wait for restore operation for the current layer to complete. Additionally, in case of pipeline parallelism, this can be used to record the pipeline schedule during the first iteration.