Fix Megatron-FSDP optimizer CPU offload and checkpointing by wplf · Pull Request #4623 · NVIDIA/Megatron-LM

wplf · 2026-05-05T04:48:58Z

Summary

Handle Megatron-FSDP DTensor parameters and gradients by operating on local shards before optimizer CPU offload copies.
Save HybridDeviceOptimizer state for fsdp_dtensor checkpoints as deterministic FSDP DTensors, skipping duplicate master_param entries and attaching local DCP chunk metadata without metadata collectives.
Allow SWiGLU/GDN optimizer checkpoint preprocessing to split flat optimizer DTensors using the corresponding model parameter metadata, and restore loaded DTensor optimizer state back to local tensors.

Fixes #4910.

Tests

uv run isort megatron/core/optimizer/distrib_optimizer.py
python -m py_compile megatron/core/optimizer/distrib_optimizer.py megatron/core/optimizer/cpu_offloading/hybrid_optimizer.py megatron/core/transformer/fsdp_dtensor_checkpoint.py
PYTHONDONTWRITEBYTECODE=1 PYTHONPATH=. uv run python -m pytest tests/unit_tests/test_optimizer_cpu_offloading.py -q (72 passed)
Local Qwen3.5-VL proxy repro with optimizer_cpu_offload=True, optimizer_offload_fraction=1, overlap_cpu_optimizer_d2h_h2d=True, use_precision_aware_optimizer=True, and fsdp_dtensor optimizer checkpointing: checkpoint saves at iterations 10 and 12 completed successfully.

copy-pr-bot · 2026-05-05T04:49:01Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

wplf · 2026-05-05T04:55:21Z

/ok to test

copy-pr-bot · 2026-05-05T04:55:24Z

/ok to test

@wplf, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

wplf · 2026-05-05T06:02:24Z

/ok to test 5da7d67

cspades

@wplf Could you share some information on what the bug was? Curious what was the root cause and how it is related to DTensors!

Handle Megatron-FSDP DTensor parameters and gradients by operating on local shards before CPU optimizer offload copies. This avoids dispatching pin_memory/is_pinned through DTensor and lets pin_cpu_params control CPU parameter pinning.

cspades

Some important nits, basically if we can move most of this code into MFSDP source it would be much cleaner and better.

INFO: Could you also add more commentary explaining the lifecycle of the offloaded DTensors here? Just want to make sure if this is the right way to implement this.

cspades · 2026-06-05T16:43:49Z

+        if isinstance(self.optimizer, HybridDeviceOptimizer):
+            packed_state = self._pack_hybrid_optimizer_fsdp_state_dict()
+        else:
+            packed_state = {
+                (self._param_name(k) if isinstance(k, torch.Tensor) else k): v
+                for k, v in self.state.items()
+            }


This optimizer isn't just used for FSDP right? We should check that the parameters are FSDP parameters before doing this.

cspades · 2026-06-05T16:48:04Z

+    def _pack_hybrid_optimizer_fsdp_state(
+        self, param: torch.nn.Parameter, param_name: str, state: dict[str, Any]
+    ) -> dict[str, Any]:
+        """Convert HybridDeviceOptimizer FSDP-local tensor state to DTensors."""


Why not just use preprocess_state_dict_for_uneven_dtensor or update_uneven_dtensor_chunk_metadata instead of re-implementing it here? (Also, we should have all of these types of utilities inside the Megatron-FSDP source directory unless it absolutely needs to be in the training loop / optimizer code!)

cspades · 2026-06-05T16:52:25Z

+                local_tensor = data.to_local()
+            else:
+                assert data.numel() == dist_param.numel(), (
+                    f"DTensor shape mismatch: data.shape={data.shape} vs "
+                    f"dist_param.shape={global_shape}"
+                )
+                local_tensor = data.to_local().view(-1)


These can also be DTensor._local_tensor BTW! Plus more instances of this in this same file.

cspades · 2026-06-05T16:54:26Z

+                    value = make_fsdp_dtensor(
+                        value.data.view(-1),
+                        flat_param,
+                        dist_index=param.megatron_fsdp_dist_index,
+                        is_expert_param=is_expert_param,
+                        run_check=False,
+                        update_uneven_dtensor_chunk_meta=False,
+                    )
+                    self._set_flat_fsdp_dtensor_chunk_metadata(
+                        value, param.megatron_fsdp_slice
+                    )


So TL;DR I assume the packed_state is our offloaded state, and we want them to be DTensors as well?

cspades · 2026-06-05T16:55:11Z

+    def _set_flat_fsdp_dtensor_chunk_metadata(tensor: DTensor, fsdp_slice: slice) -> None:
+        """Attach DCP chunk metadata for a flat FSDP-local shard without collectives."""


We can just re-factor the existing function in uneven_dtensor.py to skip the metadata AG update! Less code outside MFSDP the better!

wplf changed the title ~~Fix optimizer CPU offload for DTensor params~~ Fix optimizer CPU offload for megatron-fsdp dtensor param May 5, 2026

wplf added the module: megatron-fsdp label May 5, 2026

wplf self-assigned this May 5, 2026

wplf marked this pull request as ready for review May 5, 2026 04:50

wplf requested review from a team as code owners May 5, 2026 04:50

svcnvidia-nemo-ci added the complexity: low label May 5, 2026

copy-pr-bot Bot temporarily deployed to test May 5, 2026 06:03 Inactive

yaox12 requested a review from shjwudp May 6, 2026 01:38

cspades reviewed May 6, 2026

View reviewed changes

Comment thread megatron/core/optimizer/cpu_offloading/hybrid_optimizer.py Outdated

Wohox mentioned this pull request May 20, 2026

Fix layer-wise distributed optimizer MXFP8 wgrad failure by padding per-param start #4889

Closed

5 tasks

conver334 mentioned this pull request May 21, 2026

Megatron-FSDP optimizer checkpoint hangs in save with optimizer_cpu_offload=True #4910

Open

wplf changed the title ~~Fix optimizer CPU offload for megatron-fsdp dtensor param~~ Fix Megatron-FSDP optimizer CPU offload and checkpointing May 21, 2026

wplf added 3 commits May 20, 2026 23:41

Fix optimizer CPU offload for DTensor params

c327b79

Handle Megatron-FSDP DTensor parameters and gradients by operating on local shards before CPU optimizer offload copies. This avoids dispatching pin_memory/is_pinned through DTensor and lets pin_cpu_params control CPU parameter pinning.

Fix FSDP DTensor checkpointing with optimizer offload

39496d4

Use local DTensor storage in hybrid optimizer

cdaac89

wplf force-pushed the jinliang/fix-mfsdp-optimizer-offload branch from eff7bfb to cdaac89 Compare May 21, 2026 06:44

conver334 mentioned this pull request May 26, 2026

Fix Megatron-FSDP GDN optimizer checkpoint metadata for FSDP DTensor wplf/Megatron-LM#2

Open

wplf force-pushed the jinliang/fix-mfsdp-optimizer-offload branch from cdaac89 to 2f2442b Compare May 27, 2026 03:49

Merge branch 'dev' into jinliang/fix-mfsdp-optimizer-offload

21ba0b9

wplf force-pushed the jinliang/fix-mfsdp-optimizer-offload branch from 2f2442b to 21ba0b9 Compare May 27, 2026 03:54

cspades reviewed Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Megatron-FSDP optimizer CPU offload and checkpointing#4623

Fix Megatron-FSDP optimizer CPU offload and checkpointing#4623
wplf wants to merge 4 commits into
NVIDIA:devfrom
wplf:jinliang/fix-mfsdp-optimizer-offload

wplf commented May 5, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 5, 2026

Uh oh!

wplf commented May 5, 2026

Uh oh!

copy-pr-bot Bot commented May 5, 2026

Uh oh!

wplf commented May 5, 2026

Uh oh!

cspades left a comment

Uh oh!

Uh oh!

cspades left a comment

Uh oh!

cspades Jun 5, 2026

Uh oh!

cspades Jun 5, 2026

Uh oh!

cspades Jun 5, 2026

Uh oh!

cspades Jun 5, 2026

Uh oh!

cspades Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		def _set_flat_fsdp_dtensor_chunk_metadata(tensor: DTensor, fsdp_slice: slice) -> None:
		"""Attach DCP chunk metadata for a flat FSDP-local shard without collectives."""

Conversation

wplf commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

Uh oh!

copy-pr-bot Bot commented May 5, 2026

Uh oh!

wplf commented May 5, 2026

Uh oh!

copy-pr-bot Bot commented May 5, 2026

Uh oh!

wplf commented May 5, 2026

Uh oh!

cspades left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cspades left a comment

Choose a reason for hiding this comment

Uh oh!

cspades Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cspades Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cspades Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cspades Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

cspades Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wplf commented May 5, 2026 •

edited

Loading