Skip to content

[Dev][feat] Support CUDA Graph capture offloading modules#3219

Merged
lhb8125 merged 125 commits into
NVIDIA:devfrom
lhb8125:hongbinl/activation_offloading_refactor_cuda_graph
Mar 30, 2026
Merged

[Dev][feat] Support CUDA Graph capture offloading modules#3219
lhb8125 merged 125 commits into
NVIDIA:devfrom
lhb8125:hongbinl/activation_offloading_refactor_cuda_graph

Conversation

@lhb8125

@lhb8125 lhb8125 commented Feb 3, 2026

Copy link
Copy Markdown
Contributor

What does this PR do ?

PR to main branch

This PR enables Fine-Grained Activation Offloading to work seamlessly with Transformer Engine CUDA Graph capture and replay. Previously, these two features were mutually exclusive — CUDA Graph captures a fixed sequence of GPU operations, while activation offloading involves dynamic D2H/H2D memory copies that conflict with graph semantics. This PR resolves the conflict by introducing dedicated CUDA stream/event synchronization and an optional deferred-commit strategy.
Scope: 18 files changed, +944 / -259 lines

Key Changes

1. In-Graph Offload Synchronization (transformer_layer.py)

  • _te_cuda_graph_capture(): When offload_module_in_cuda_graph=True, inserts backward_record() at the sub-graph entry (synchronizes the compute stream with the H2D reload stream during backward) and calls forward_record() at the sub-graph exit (synchronizes the compute stream with the D2H offload stream during forward).
  • _te_cuda_graph_replay(): Supports the delay_offload_until_cuda_graph mode — during replay, enter_replay() / exit_replay() cause offload groups to be enqueued without immediate execution; after replay, flush_delayed_groups() issues the D2H copies during the CPU-idle window between graph launch and subsequent communication.

2. PipelineOffloadManager Extensions (fine_grained_activation_offload.py)

  • Added cuda_graph_stream / cuda_graph_event (external event) dedicated to synchronizing in-graph captured modules with the D2H/H2D offload streams.
  • Deferred offload commit: FineGrainedOffloadingGroupCommitFunction pushes offload groups into a queue when delay_offload=True and the manager is in replay state; flush_delayed_groups() drains the queue in batch during CPU-idle gaps.
  • Warmup hook integration: pre_warmup_hook / post_warmup_hook temporarily disable/enable offloading around TE's warmup phase to avoid state-machine conflicts.

3. GraphableMegatronModule Integration (module.py)

When fine_grained_activation_offloading and offload_module_in_cuda_graph are both active, _get_te_cuda_graph_replay_args() injects cuda_graph_stream and cuda_graph_event into TE's replay kwargs, bridging the TE-side synchronization.

4. Automatic Offload-in-Graph Detection (_set_offload_modules)

Added the offload_module_in_cuda_graph flag, automatically determined by:

  • CudaGraphScope.attn sub-graph containing offloaded qkv_linear / core_attn / attn_proj
  • CudaGraphScope.mlp sub-graph (dense layers) containing offloaded mlp_norm
  • Incompatible combinations (e.g., attn sub-graph + attn_norm offload) are auto-disabled with warnings

5. New Configuration Options (TransformerConfig)

Config Description
delay_offload_until_cuda_graph Defer offload commits until after CUDA Graph replay to minimize CPU overhead
activation_offload_fraction Fraction of activations to offload, range [0, 1]
delta_offload_bytes_across_pp_ranks Differential offload bytes across PP ranks
Validation: CUDA Graph + offload requires cuda_graph_impl="transformer_engine" and cuda_graph_warmup_steps > 0; CudaGraphScope.moe is temporarily unsupported; mutually exclusive with cpu_offloading and mhc recompute.

6. Code Cleanup

Replaced redundant from ... import FineGrainedActivationOffloadingInterface scattered across methods with a single _get_offloading_interface() helper (@lru_cache), accessed uniformly via self.off_interface.

Execution Flow

Phase Behavior
Warmup Offloading is temporarily disabled via hooks during TE's multi-step eager warmup before graph construction
Capture backward_record + forward_record link the compute stream with D2H/H2D streams through a shared cuda_graph_event
Replay (Optional deferred mode) group_commit only enqueues; flush_delayed_groups() issues D2H during the CPU-idle window between graph launch and subsequent communication
Iteration End off_interface.reset() in schedules.py and cuda_graphs._finish_capturing ensures clean state

Tests

  • test_fine_grained_activation_offloading_with_cuda_graph: covers multiple combinations of cuda_graph_scope, offload_modules, activation_offload_fraction, and delay_offload_until_cuda_graph (True/False); validates numerical correctness against baseline logits/gradients and performs peak memory sanity checks. Requires TE >= 2.14.

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]
Loading

Pre-checks

  • I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

  1. Attach the Expert Review label when your PR is ready for review.
  2. GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

  1. Add Final Review label
  2. GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

lhb8125 and others added 30 commits October 29, 2025 02:46
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
@lhb8125

lhb8125 commented Mar 12, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test bb4ac50

lhb8125 and others added 2 commits March 22, 2026 23:20
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
@lhb8125

lhb8125 commented Mar 23, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 51ba05a

Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
@lhb8125

lhb8125 commented Mar 23, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 837dba4

@lhb8125

lhb8125 commented Mar 24, 2026

Copy link
Copy Markdown
Contributor Author

@Victarry @NVIDIA/core-nemo Could you give a final review of this PR?

lhb8125 and others added 2 commits March 24, 2026 21:02
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
@lhb8125

lhb8125 commented Mar 25, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test d78b776

Comment thread megatron/core/transformer/multi_latent_attention.py Outdated
Comment thread megatron/core/transformer/transformer_config.py Outdated
Comment thread docs/user-guide/features/fine_grained_activation_offloading.md Outdated
lhb8125 and others added 5 commits March 29, 2026 22:14
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
Signed-off-by: Hongbin Liu <hongbinl@nvidia.com>
@lhb8125

lhb8125 commented Mar 30, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 0cef457

@hxbai hxbai left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@svcnvidia-nemo-ci

Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/23732168075

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

complexity: medium core_dev_r0.16.0 Cherry-pick label for core_dev_r0.16.0 release branch dev branch Dev branch related issues and development Final Review PR is in the "final review" stage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants