Megatron-FSDP optimizer checkpoint hangs in save with optimizer_cpu_offload=True

## Describe the bug

  @NVIDIA/mcore-oncall

  Megatron-FSDP `fsdp_dtensor` checkpoint save hangs/fails when saving the optimizer state with `optimizer_cpu_offload=True`.

  The run finishes the first training step and starts checkpoint saving, but hangs during  FSDP DTensor checkpoint preprocessing for the optimizer state. Eventually NCCL watchdog aborts an `ALLGATHER` inside `validate_uneven_dtensor()`.

  The same setup can save a model-only checkpoint successfully, and the issue is not observed when optimizer CPU offload is disabled. This suggests the problem is specific to the optimizer state dict produced under `HybridDeviceOptimizer` / optimizer CPU offload and its interaction with FSDP DTensor checkpoint preprocessing, especially the SWiGLU/GDN split path.

## Steps/Code to reproduce bug
This is VeRL reproduce, see Megatron-LM reproduce in following coment. https://github.com/NVIDIA/Megatron-LM/issues/4910#issuecomment-4504981844


  Environment/config used:

  - Megatron-LM: latest `dev` branch, with PR [#4623](https://github.com/NVIDIA/Megatron-LM/pull/4623)  cherry-picked
  - Megatron-Bridge: latest `main` https://github.com/NVIDIA-NeMo/Megatron-Bridge
  - Launcher:  [verl SFT script](https://github.com/verl-project/verl/pull/6352)
        - Model: Qwen3.5-35B-A3B
        - 4 nodes x 8 GPUs = 32 GPUs
        - TP=2, PP=1, CP=2, EP=8, ETP=1
        - Megatron-FSDP enabled
        - checkpoint contents: `["model", "optimizer"]`

  Important overrides:

  ```bash
  +engine.override_ddp_config.megatron_fsdp_use_decoupled_grad=True
  +engine.override_ddp_config.overlap_grad_reduce=True
  +engine.override_ddp_config.overlap_param_gather=True

  +optim.override_optimizer_config.optimizer_cpu_offload=True
  +optim.override_optimizer_config.optimizer_offload_fraction=1
  +optim.override_optimizer_config.overlap_cpu_optimizer_d2h_h2d=True
  +optim.override_optimizer_config.use_precision_aware_optimizer=True

  trainer.total_training_steps=1
  trainer.save_freq=-1
  checkpoint.save_contents='["model","optimizer"]'
  checkpoint.load_contents='["model","optimizer"]'
```

  Key traceback excerpt:

```text
  [rank12]:   File "/root/mfsdp-intergration/verl/verl/utils/checkpoint/megatron_checkpoint_manager.py", line 471, in _save_megatron_fsdp_checkpoint
  [rank12]:     state_dict = bridge_preprocess_fsdp_dtensor_state_dict(self.transformer_config, state_dict, checkpoint_model)
  [rank12]:   File "/root/mfsdp-intergration/Megatron-Bridge/src/megatron/bridge/training/checkpointing.py", line 1642, in preprocess_fsdp_dtensor_state_dict
  [rank12]:     model_state_dict, optimizer_state_dict = handle_swiglu_in_state_dict(
  [rank12]:   File "/root/mfsdp-intergration/Megatron-LM/megatron/core/transformer/fsdp_dtensor_checkpoint.py", line 403, in handle_swiglu_in_state_dict
  [rank12]:     weight_w, weight_v = split_swiglu_linear_fc1(
  [rank12]:   File "/root/mfsdp-intergration/Megatron-LM/megatron/core/transformer/fsdp_dtensor_checkpoint.py", line 342, in split_swiglu_linear_fc1
  [rank12]:     weight_w = make_fsdp_dtensor(
  [rank12]:   File "/root/mfsdp-intergration/Megatron-LM/megatron/core/distributed/fsdp/src/megatron_fsdp/param_and_grad_buffer.py", line 4706, in make_fsdp_dtensor
  [rank12]:     validate_uneven_dtensor(fsdp_tensor)
  [rank12]:   File "/root/mfsdp-intergration/Megatron-LM/megatron/core/distributed/fsdp/src/megatron_fsdp/uneven_dtensor.py", line 158, in validate_uneven_dtensor
  [rank12]:     chunk_meta = gather_and_compute_chunk_metadata(dtensor)
  [rank12]:   File "/root/mfsdp-intergration/Megatron-LM/megatron/core/distributed/fsdp/src/megatron_fsdp/uneven_dtensor.py", line 85, in gather_and_compute_chunk_metadata
  [rank12]:     _update_offsets_and_cumulative_shape(mesh_dim, offsets, cumulative_shape)
  [rank12]:   File "/root/mfsdp-intergration/Megatron-LM/megatron/core/distributed/fsdp/src/megatron_fsdp/uneven_dtensor.py", line 51, in _update_offsets_and_cumulative_shape
  [rank12]:     dist.all_gather_object(global_shapes, cumulative_shape, group=shard_group)
  [rank12]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1.
```

  NCCL watchdog excerpt:

```
  [rank5]:[E519 02:14:37.325135912 ProcessGroupNCCL.cpp:688] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26523, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=1800000) ran for 1800010 milliseconds before timing out.
  [rank5]:[E519 02:14:37.325225845 ProcessGroupNCCL.cpp:2277] [PG ID 16 PG GUID 222(EXPERT_DATA_PARALLEL_GROUP) Rank 0] failure detected by watchdog at work sequence id: 26523 PG status: last enqueued work: 26523, last completed work: 26522
  [rank10]:[E519 02:14:37.316179584 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26523, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=1800000) ran for 1800010 milliseconds before timing out.
```

## Expected behavior

  Optimizer checkpoint save should complete successfully with Megatron-FSDP fsdp_dtensor checkpointing when optimizer_cpu_offload=True.

##  Additional context

  A model-only checkpoint with the same Megatron-FSDP configuration succeeds. The hang only appears when saving optimizer state as well:

```text
  checkpoint.save_contents='["model"]'      # passes
  checkpoint.save_contents='["model","optimizer"]'  # hangs/fails
```

[q35-mfsdp-bug.txt](https://github.com/user-attachments/files/28081685/q35-mfsdp-bug.txt)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Megatron-FSDP optimizer checkpoint hangs in save with optimizer_cpu_offload=True #4910

Describe the bug

Steps/Code to reproduce bug

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Megatron-FSDP optimizer checkpoint hangs in save with optimizer_cpu_offload=True #4910

Description

Describe the bug

Steps/Code to reproduce bug

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions