Describe the bug
@NVIDIA/mcore-oncall
Megatron-FSDP fsdp_dtensor checkpoint save hangs/fails when saving the optimizer state with optimizer_cpu_offload=True.
The run finishes the first training step and starts checkpoint saving, but hangs during FSDP DTensor checkpoint preprocessing for the optimizer state. Eventually NCCL watchdog aborts an ALLGATHER inside validate_uneven_dtensor().
The same setup can save a model-only checkpoint successfully, and the issue is not observed when optimizer CPU offload is disabled. This suggests the problem is specific to the optimizer state dict produced under HybridDeviceOptimizer / optimizer CPU offload and its interaction with FSDP DTensor checkpoint preprocessing, especially the SWiGLU/GDN split path.
Steps/Code to reproduce bug
This is VeRL reproduce, see Megatron-LM reproduce in following coment. #4910 (comment)
Environment/config used:
Important overrides:
+engine.override_ddp_config.megatron_fsdp_use_decoupled_grad=True
+engine.override_ddp_config.overlap_grad_reduce=True
+engine.override_ddp_config.overlap_param_gather=True
+optim.override_optimizer_config.optimizer_cpu_offload=True
+optim.override_optimizer_config.optimizer_offload_fraction=1
+optim.override_optimizer_config.overlap_cpu_optimizer_d2h_h2d=True
+optim.override_optimizer_config.use_precision_aware_optimizer=True
trainer.total_training_steps=1
trainer.save_freq=-1
checkpoint.save_contents='["model","optimizer"]'
checkpoint.load_contents='["model","optimizer"]'
Key traceback excerpt:
[rank12]: File "/root/mfsdp-intergration/verl/verl/utils/checkpoint/megatron_checkpoint_manager.py", line 471, in _save_megatron_fsdp_checkpoint
[rank12]: state_dict = bridge_preprocess_fsdp_dtensor_state_dict(self.transformer_config, state_dict, checkpoint_model)
[rank12]: File "/root/mfsdp-intergration/Megatron-Bridge/src/megatron/bridge/training/checkpointing.py", line 1642, in preprocess_fsdp_dtensor_state_dict
[rank12]: model_state_dict, optimizer_state_dict = handle_swiglu_in_state_dict(
[rank12]: File "/root/mfsdp-intergration/Megatron-LM/megatron/core/transformer/fsdp_dtensor_checkpoint.py", line 403, in handle_swiglu_in_state_dict
[rank12]: weight_w, weight_v = split_swiglu_linear_fc1(
[rank12]: File "/root/mfsdp-intergration/Megatron-LM/megatron/core/transformer/fsdp_dtensor_checkpoint.py", line 342, in split_swiglu_linear_fc1
[rank12]: weight_w = make_fsdp_dtensor(
[rank12]: File "/root/mfsdp-intergration/Megatron-LM/megatron/core/distributed/fsdp/src/megatron_fsdp/param_and_grad_buffer.py", line 4706, in make_fsdp_dtensor
[rank12]: validate_uneven_dtensor(fsdp_tensor)
[rank12]: File "/root/mfsdp-intergration/Megatron-LM/megatron/core/distributed/fsdp/src/megatron_fsdp/uneven_dtensor.py", line 158, in validate_uneven_dtensor
[rank12]: chunk_meta = gather_and_compute_chunk_metadata(dtensor)
[rank12]: File "/root/mfsdp-intergration/Megatron-LM/megatron/core/distributed/fsdp/src/megatron_fsdp/uneven_dtensor.py", line 85, in gather_and_compute_chunk_metadata
[rank12]: _update_offsets_and_cumulative_shape(mesh_dim, offsets, cumulative_shape)
[rank12]: File "/root/mfsdp-intergration/Megatron-LM/megatron/core/distributed/fsdp/src/megatron_fsdp/uneven_dtensor.py", line 51, in _update_offsets_and_cumulative_shape
[rank12]: dist.all_gather_object(global_shapes, cumulative_shape, group=shard_group)
[rank12]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1.
NCCL watchdog excerpt:
[rank5]:[E519 02:14:37.325135912 ProcessGroupNCCL.cpp:688] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26523, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=1800000) ran for 1800010 milliseconds before timing out.
[rank5]:[E519 02:14:37.325225845 ProcessGroupNCCL.cpp:2277] [PG ID 16 PG GUID 222(EXPERT_DATA_PARALLEL_GROUP) Rank 0] failure detected by watchdog at work sequence id: 26523 PG status: last enqueued work: 26523, last completed work: 26522
[rank10]:[E519 02:14:37.316179584 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26523, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=1800000) ran for 1800010 milliseconds before timing out.
Expected behavior
Optimizer checkpoint save should complete successfully with Megatron-FSDP fsdp_dtensor checkpointing when optimizer_cpu_offload=True.
Additional context
A model-only checkpoint with the same Megatron-FSDP configuration succeeds. The hang only appears when saving optimizer state as well:
checkpoint.save_contents='["model"]' # passes
checkpoint.save_contents='["model","optimizer"]' # hangs/fails
q35-mfsdp-bug.txt
Describe the bug
@NVIDIA/mcore-oncall
Megatron-FSDP
fsdp_dtensorcheckpoint save hangs/fails when saving the optimizer state withoptimizer_cpu_offload=True.The run finishes the first training step and starts checkpoint saving, but hangs during FSDP DTensor checkpoint preprocessing for the optimizer state. Eventually NCCL watchdog aborts an
ALLGATHERinsidevalidate_uneven_dtensor().The same setup can save a model-only checkpoint successfully, and the issue is not observed when optimizer CPU offload is disabled. This suggests the problem is specific to the optimizer state dict produced under
HybridDeviceOptimizer/ optimizer CPU offload and its interaction with FSDP DTensor checkpoint preprocessing, especially the SWiGLU/GDN split path.Steps/Code to reproduce bug
This is VeRL reproduce, see Megatron-LM reproduce in following coment. #4910 (comment)
Environment/config used:
devbranch, with PR #4623 cherry-pickedmainhttps://github.com/NVIDIA-NeMo/Megatron-Bridge- Model: Qwen3.5-35B-A3B
- 4 nodes x 8 GPUs = 32 GPUs
- TP=2, PP=1, CP=2, EP=8, ETP=1
- Megatron-FSDP enabled
- checkpoint contents:
["model", "optimizer"]Important overrides:
Key traceback excerpt:
NCCL watchdog excerpt:
Expected behavior
Optimizer checkpoint save should complete successfully with Megatron-FSDP fsdp_dtensor checkpointing when optimizer_cpu_offload=True.
Additional context
A model-only checkpoint with the same Megatron-FSDP configuration succeeds. The hang only appears when saving optimizer state as well:
q35-mfsdp-bug.txt