Skip to content

feat: online load/save huggingface model weights for Megatron-FSDP#1910

Merged
yaoyu-33 merged 64 commits into
NVIDIA-NeMo:mainfrom
conver334:mfsdp_to_hf
Apr 20, 2026
Merged

feat: online load/save huggingface model weights for Megatron-FSDP#1910
yaoyu-33 merged 64 commits into
NVIDIA-NeMo:mainfrom
conver334:mfsdp_to_hf

Conversation

@conver334

@conver334 conver334 commented Jan 12, 2026

Copy link
Copy Markdown
Contributor

What does this PR do ?

Supports online load/save huggingface model weights for Megatron-FSDP

API

see detailed usage in file: examples/conversion/hf_fsdp_roundtrip.py

# Init random Megatron-FSDP model then load HF weights
bridge = AutoBridge.from_hf_pretrained(hf_model_id)
model_provider = bridge.to_megatron_provider(load_weights=False)
megatron_model = model_provider.provide_distributed_model(
      ddp_config=ddp_config,
      use_megatron_fsdp=True,
      use_torch_fsdp2=False,
      overlap_param_gather_with_optimizer_step=False,
      data_parallel_random_init=False,
  )
bridge.load_hf_weights(megatron_model)

# Init GPT model with HF weights, then wrap with Megatron-FSDP 
bridge = AutoBridge.from_hf_pretrained(hf_model_id)
model_provider = bridge.to_megatron_provider(load_weights=True)
megatron_model = model_provider.provide_distributed_model(
      ddp_config=ddp_config,
      use_megatron_fsdp=True,
      use_torch_fsdp2=False,
      overlap_param_gather_with_optimizer_step=False,
      data_parallel_random_init=False,
  )

# Export HF weights
bridge.save_hf_pretrained(megatron_model, save_path)

Test

srun -A coreai_dlalgo_mcore -p interactive --time=01:00:00 --gpus-per-node=8 --container-image="nvcr.io#nvidia/nemo:25.11" --container-mounts=/lustre:/lustre -J coreai_dlalgo_mcore:qwen3_vl --pty bash

export MEGATRON_BRIDGE_PATH=/your_path/Megatron-Bridge
export MEGATRON_LM_PATH=/lustre/your_path/Megatron-LM
export HF_HOME=/your_path/models
unset CUDA_DEVICE_MAX_CONNECTIONS
export PYTHONPATH=${MEGATRON_BRIDGE_PATH}/src:${MEGATRON_LM_PATH}:${PYTHONPATH}
cd ${MEGATRON_BRIDGE_PATH}/

python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_fsdp_roundtrip.py --hf-model-id Qwen/Qwen3-VL-30B-A3B-Instruct --ep 8

Summary by CodeRabbit

  • New Features

    • Added example scripts for round-trip model conversion between Hugging Face and Megatron FSDP.
    • Added text generation example workflow with distributed training setup.
    • Added support for Fully Sharded Data Parallel Megatron models.
    • Added example YAML configurations for Qwen3-VL model training.
  • Improvements

    • Enhanced weight verification with shape validation and improved dtype handling.

✏️ Tip: You can customize this high-level summary in your review settings.

@copy-pr-bot

copy-pr-bot Bot commented Jan 12, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@BoxiangW

Copy link
Copy Markdown
Contributor

/ok to test 20d9b52

@conver334 conver334 force-pushed the mfsdp_to_hf branch 2 times, most recently from b2a5ee5 to 49244f5 Compare January 16, 2026 07:26
@conver334 conver334 changed the title Draft conversion of fsdp_dtensor and HF Online load/save huggingface model weights for Megatron-FSDP Jan 19, 2026
@conver334 conver334 changed the title Online load/save huggingface model weights for Megatron-FSDP feat: online load/save huggingface model weights for Megatron-FSDP Jan 19, 2026
Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
@BoxiangW

Copy link
Copy Markdown
Contributor

/ok to test fe1934e

Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>
@BoxiangW

Copy link
Copy Markdown
Contributor

/ok to test dfccfd5

Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
@yaoyu-33

Copy link
Copy Markdown
Contributor

@conver334 need ci tests for changes, for this one we will need both functional and unit tests. cc/ @BoxiangW

rhmukundan and others added 9 commits April 10, 2026 15:36
…eMo#3252)

Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>
Signed-off-by: Parth Mannan <pmannan@nvidia.com>
Signed-off-by: Changlong <changlyu@amazon.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Co-authored-by: Yan Bai <baiyan1996@icloud.com>
Co-authored-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Co-authored-by: yaoyu-33 <yaoyu.094@gmail.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…Mo#3249)

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…crash (NVIDIA-NeMo#3254)

Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: zhangyuekai <zhangyuekai@foxmail.com>
Signed-off-by: root <root@h20-2.cm.cluster>
Signed-off-by: root <zhangyuekai@foxmail.com>
Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
@yaoyu-33

Copy link
Copy Markdown
Contributor

/ok to test 49e23d3

conver334 and others added 4 commits April 13, 2026 00:51
Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
@yaoyu-33

Copy link
Copy Markdown
Contributor

/ok to test ffa9649

Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
@yaoyu-33

Copy link
Copy Markdown
Contributor

/ok to test 1f3e872

yaoyu-33
yaoyu-33 previously approved these changes Apr 16, 2026
Signed-off-by: conver334 <conver334@gmail.com>
Signed-off-by: conver334 <conver334@gmail.com>
@yaoyu-33

Copy link
Copy Markdown
Contributor

/ok to test ae8295c

@yaoyu-33

Copy link
Copy Markdown
Contributor

/ok to test 525921a

@yaoyu-33

Copy link
Copy Markdown
Contributor

defer to next pr fix coverage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-to-merge PR is approved, current, and only waiting for CI to pass before merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.