Skip to content

Megatron-FSDP optimizer checkpoint hangs in save with optimizer_cpu_offload=True #4910

@conver334

Description

@conver334

Describe the bug

@NVIDIA/mcore-oncall

Megatron-FSDP fsdp_dtensor checkpoint save hangs/fails when saving the optimizer state with optimizer_cpu_offload=True.

The run finishes the first training step and starts checkpoint saving, but hangs during FSDP DTensor checkpoint preprocessing for the optimizer state. Eventually NCCL watchdog aborts an ALLGATHER inside validate_uneven_dtensor().

The same setup can save a model-only checkpoint successfully, and the issue is not observed when optimizer CPU offload is disabled. This suggests the problem is specific to the optimizer state dict produced under HybridDeviceOptimizer / optimizer CPU offload and its interaction with FSDP DTensor checkpoint preprocessing, especially the SWiGLU/GDN split path.

Steps/Code to reproduce bug

This is VeRL reproduce, see Megatron-LM reproduce in following coment. #4910 (comment)

Environment/config used:

Important overrides:

+engine.override_ddp_config.megatron_fsdp_use_decoupled_grad=True
+engine.override_ddp_config.overlap_grad_reduce=True
+engine.override_ddp_config.overlap_param_gather=True

+optim.override_optimizer_config.optimizer_cpu_offload=True
+optim.override_optimizer_config.optimizer_offload_fraction=1
+optim.override_optimizer_config.overlap_cpu_optimizer_d2h_h2d=True
+optim.override_optimizer_config.use_precision_aware_optimizer=True

trainer.total_training_steps=1
trainer.save_freq=-1
checkpoint.save_contents='["model","optimizer"]'
checkpoint.load_contents='["model","optimizer"]'

Key traceback excerpt:

  [rank12]:   File "/root/mfsdp-intergration/verl/verl/utils/checkpoint/megatron_checkpoint_manager.py", line 471, in _save_megatron_fsdp_checkpoint
  [rank12]:     state_dict = bridge_preprocess_fsdp_dtensor_state_dict(self.transformer_config, state_dict, checkpoint_model)
  [rank12]:   File "/root/mfsdp-intergration/Megatron-Bridge/src/megatron/bridge/training/checkpointing.py", line 1642, in preprocess_fsdp_dtensor_state_dict
  [rank12]:     model_state_dict, optimizer_state_dict = handle_swiglu_in_state_dict(
  [rank12]:   File "/root/mfsdp-intergration/Megatron-LM/megatron/core/transformer/fsdp_dtensor_checkpoint.py", line 403, in handle_swiglu_in_state_dict
  [rank12]:     weight_w, weight_v = split_swiglu_linear_fc1(
  [rank12]:   File "/root/mfsdp-intergration/Megatron-LM/megatron/core/transformer/fsdp_dtensor_checkpoint.py", line 342, in split_swiglu_linear_fc1
  [rank12]:     weight_w = make_fsdp_dtensor(
  [rank12]:   File "/root/mfsdp-intergration/Megatron-LM/megatron/core/distributed/fsdp/src/megatron_fsdp/param_and_grad_buffer.py", line 4706, in make_fsdp_dtensor
  [rank12]:     validate_uneven_dtensor(fsdp_tensor)
  [rank12]:   File "/root/mfsdp-intergration/Megatron-LM/megatron/core/distributed/fsdp/src/megatron_fsdp/uneven_dtensor.py", line 158, in validate_uneven_dtensor
  [rank12]:     chunk_meta = gather_and_compute_chunk_metadata(dtensor)
  [rank12]:   File "/root/mfsdp-intergration/Megatron-LM/megatron/core/distributed/fsdp/src/megatron_fsdp/uneven_dtensor.py", line 85, in gather_and_compute_chunk_metadata
  [rank12]:     _update_offsets_and_cumulative_shape(mesh_dim, offsets, cumulative_shape)
  [rank12]:   File "/root/mfsdp-intergration/Megatron-LM/megatron/core/distributed/fsdp/src/megatron_fsdp/uneven_dtensor.py", line 51, in _update_offsets_and_cumulative_shape
  [rank12]:     dist.all_gather_object(global_shapes, cumulative_shape, group=shard_group)
  [rank12]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1.

NCCL watchdog excerpt:

  [rank5]:[E519 02:14:37.325135912 ProcessGroupNCCL.cpp:688] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26523, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=1800000) ran for 1800010 milliseconds before timing out.
  [rank5]:[E519 02:14:37.325225845 ProcessGroupNCCL.cpp:2277] [PG ID 16 PG GUID 222(EXPERT_DATA_PARALLEL_GROUP) Rank 0] failure detected by watchdog at work sequence id: 26523 PG status: last enqueued work: 26523, last completed work: 26522
  [rank10]:[E519 02:14:37.316179584 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=26523, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=1800000) ran for 1800010 milliseconds before timing out.

Expected behavior

Optimizer checkpoint save should complete successfully with Megatron-FSDP fsdp_dtensor checkpointing when optimizer_cpu_offload=True.

Additional context

A model-only checkpoint with the same Megatron-FSDP configuration succeeds. The hang only appears when saving optimizer state as well:

  checkpoint.save_contents='["model"]'      # passes
  checkpoint.save_contents='["model","optimizer"]'  # hangs/fails

q35-mfsdp-bug.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions