[Dev][feat] Support CUDA Graph capture offloading modules by lhb8125 · Pull Request #3219 · NVIDIA/Megatron-LM

lhb8125 · 2026-02-03T04:36:38Z

What does this PR do ?

This PR enables Fine-Grained Activation Offloading to work seamlessly with Transformer Engine CUDA Graph capture and replay. Previously, these two features were mutually exclusive — CUDA Graph captures a fixed sequence of GPU operations, while activation offloading involves dynamic D2H/H2D memory copies that conflict with graph semantics. This PR resolves the conflict by introducing dedicated CUDA stream/event synchronization and an optional deferred-commit strategy.
Scope: 18 files changed, +944 / -259 lines

Key Changes

1. In-Graph Offload Synchronization (`transformer_layer.py`)

_te_cuda_graph_capture(): When offload_module_in_cuda_graph=True, inserts backward_record() at the sub-graph entry (synchronizes the compute stream with the H2D reload stream during backward) and calls forward_record() at the sub-graph exit (synchronizes the compute stream with the D2H offload stream during forward).
_te_cuda_graph_replay(): Supports the delay_offload_until_cuda_graph mode — during replay, enter_replay() / exit_replay() cause offload groups to be enqueued without immediate execution; after replay, flush_delayed_groups() issues the D2H copies during the CPU-idle window between graph launch and subsequent communication.

2. `PipelineOffloadManager` Extensions (`fine_grained_activation_offload.py`)

Added cuda_graph_stream / cuda_graph_event (external event) dedicated to synchronizing in-graph captured modules with the D2H/H2D offload streams.
Deferred offload commit: FineGrainedOffloadingGroupCommitFunction pushes offload groups into a queue when delay_offload=True and the manager is in replay state; flush_delayed_groups() drains the queue in batch during CPU-idle gaps.
Warmup hook integration: pre_warmup_hook / post_warmup_hook temporarily disable/enable offloading around TE's warmup phase to avoid state-machine conflicts.

3. `GraphableMegatronModule` Integration (`module.py`)

When fine_grained_activation_offloading and offload_module_in_cuda_graph are both active, _get_te_cuda_graph_replay_args() injects cuda_graph_stream and cuda_graph_event into TE's replay kwargs, bridging the TE-side synchronization.

4. Automatic Offload-in-Graph Detection (`_set_offload_modules`)

Added the offload_module_in_cuda_graph flag, automatically determined by:

CudaGraphScope.attn sub-graph containing offloaded qkv_linear / core_attn / attn_proj
CudaGraphScope.mlp sub-graph (dense layers) containing offloaded mlp_norm
Incompatible combinations (e.g., attn sub-graph + attn_norm offload) are auto-disabled with warnings

5. New Configuration Options (`TransformerConfig`)

Config	Description
`delay_offload_until_cuda_graph`	Defer offload commits until after CUDA Graph replay to minimize CPU overhead
`activation_offload_fraction`	Fraction of activations to offload, range [0, 1]
`delta_offload_bytes_across_pp_ranks`	Differential offload bytes across PP ranks
Validation: CUDA Graph + offload requires `cuda_graph_impl="transformer_engine"` and `cuda_graph_warmup_steps > 0`; `CudaGraphScope.moe` is temporarily unsupported; mutually exclusive with `cpu_offloading` and `mhc` recompute.

6. Code Cleanup

Replaced redundant from ... import FineGrainedActivationOffloadingInterface scattered across methods with a single _get_offloading_interface() helper (@lru_cache), accessed uniformly via self.off_interface.

Execution Flow

Phase	Behavior
Warmup	Offloading is temporarily disabled via hooks during TE's multi-step eager warmup before graph construction
Capture	`backward_record` + `forward_record` link the compute stream with D2H/H2D streams through a shared `cuda_graph_event`
Replay	(Optional deferred mode) `group_commit` only enqueues; `flush_delayed_groups()` issues D2H during the CPU-idle window between graph launch and subsequent communication
Iteration End	`off_interface.reset()` in `schedules.py` and `cuda_graphs._finish_capturing` ensures clean state

Tests

test_fine_grained_activation_offloading_with_cuda_graph: covers multiple combinations of cuda_graph_scope, offload_modules, activation_offload_fraction, and delay_offload_until_cuda_graph (True/False); validates numerical correctness against baseline logits/gradients and performs peak memory sanity checks. Requires TE >= 2.14.

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.