feat: online load/save huggingface model weights for Megatron-FSDP by conver334 · Pull Request #1910 · NVIDIA-NeMo/Megatron-Bridge

conver334 · 2026-01-12T15:25:12Z

What does this PR do ?

Supports online load/save huggingface model weights for Megatron-FSDP

API

see detailed usage in file: examples/conversion/hf_fsdp_roundtrip.py

# Init random Megatron-FSDP model then load HF weights
bridge = AutoBridge.from_hf_pretrained(hf_model_id)
model_provider = bridge.to_megatron_provider(load_weights=False)
megatron_model = model_provider.provide_distributed_model(
      ddp_config=ddp_config,
      use_megatron_fsdp=True,
      use_torch_fsdp2=False,
      overlap_param_gather_with_optimizer_step=False,
      data_parallel_random_init=False,
  )
bridge.load_hf_weights(megatron_model)

# Init GPT model with HF weights, then wrap with Megatron-FSDP 
bridge = AutoBridge.from_hf_pretrained(hf_model_id)
model_provider = bridge.to_megatron_provider(load_weights=True)
megatron_model = model_provider.provide_distributed_model(
      ddp_config=ddp_config,
      use_megatron_fsdp=True,
      use_torch_fsdp2=False,
      overlap_param_gather_with_optimizer_step=False,
      data_parallel_random_init=False,
  )

# Export HF weights
bridge.save_hf_pretrained(megatron_model, save_path)

Test

srun -A coreai_dlalgo_mcore -p interactive --time=01:00:00 --gpus-per-node=8 --container-image="nvcr.io#nvidia/nemo:25.11" --container-mounts=/lustre:/lustre -J coreai_dlalgo_mcore:qwen3_vl --pty bash

export MEGATRON_BRIDGE_PATH=/your_path/Megatron-Bridge
export MEGATRON_LM_PATH=/lustre/your_path/Megatron-LM
export HF_HOME=/your_path/models
unset CUDA_DEVICE_MAX_CONNECTIONS
export PYTHONPATH=${MEGATRON_BRIDGE_PATH}/src:${MEGATRON_LM_PATH}:${PYTHONPATH}
cd ${MEGATRON_BRIDGE_PATH}/

python -m torch.distributed.run --nproc_per_node=8 examples/conversion/hf_fsdp_roundtrip.py --hf-model-id Qwen/Qwen3-VL-30B-A3B-Instruct --ep 8

Summary by CodeRabbit

New Features
- Added example scripts for round-trip model conversion between Hugging Face and Megatron FSDP.
- Added text generation example workflow with distributed training setup.
- Added support for Fully Sharded Data Parallel Megatron models.
- Added example YAML configurations for Qwen3-VL model training.
Improvements
- Enhanced weight verification with shape validation and improved dtype handling.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

copy-pr-bot · 2026-01-12T15:25:16Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

BoxiangW · 2026-01-13T21:15:39Z

/ok to test 20d9b52

Signed-off-by: conver334 <conver334@gmail.com>

BoxiangW · 2026-01-23T20:07:15Z

/ok to test fe1934e

Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>

BoxiangW · 2026-01-23T21:14:23Z

/ok to test dfccfd5

Signed-off-by: conver334 <conver334@gmail.com>

yaoyu-33 · 2026-01-28T17:25:28Z

@conver334 need ci tests for changes, for this one we will need both functional and unit tests. cc/ @BoxiangW

…eMo#3252) Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

Signed-off-by: Changlong <changlyu@amazon.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: Yan Bai <baiyan1996@icloud.com> Co-authored-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

…DIA-NeMo#3260) Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

…Mo#3249) Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…crash (NVIDIA-NeMo#3254) Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: zhangyuekai <zhangyuekai@foxmail.com> Signed-off-by: root <root@h20-2.cm.cluster> Signed-off-by: root <zhangyuekai@foxmail.com> Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 · 2026-04-10T22:36:42Z

/ok to test 49e23d3

Signed-off-by: conver334 <conver334@gmail.com>

yaoyu-33 · 2026-04-13T16:30:53Z

/ok to test ffa9649

Signed-off-by: conver334 <conver334@gmail.com>

yaoyu-33 · 2026-04-16T16:49:28Z

/ok to test 1f3e872

Signed-off-by: conver334 <conver334@gmail.com>

yaoyu-33 · 2026-04-17T15:22:41Z

/ok to test ae8295c

yaoyu-33 · 2026-04-20T01:10:04Z

/ok to test 525921a

yaoyu-33 · 2026-04-20T03:10:50Z

defer to next pr fix coverage

github-actions Bot added the community-request label Jan 12, 2026

BoxiangW requested a review from yaoyu-33 January 13, 2026 21:15

copy-pr-bot Bot temporarily deployed to nemo-ci January 13, 2026 21:16 Inactive

conver334 force-pushed the mfsdp_to_hf branch 2 times, most recently from b2a5ee5 to 49244f5 Compare January 16, 2026 07:26

conver334 changed the title ~~Draft conversion of fsdp_dtensor and HF~~ Online load/save huggingface model weights for Megatron-FSDP Jan 19, 2026

conver334 changed the title ~~Online load/save huggingface model weights for Megatron-FSDP~~ feat: online load/save huggingface model weights for Megatron-FSDP Jan 19, 2026

conver334 added 6 commits January 19, 2026 03:24

draft of mfsdp to hf

98554ef

Signed-off-by: conver334 <conver334@gmail.com>

Temporary test for fsdp -> hf

8de5507

Signed-off-by: conver334 <conver334@gmail.com>

add qwen3_vl_8b config

eb4b9e6

Signed-off-by: conver334 <conver334@gmail.com>

fix qwen3 vl moe bug

23defea

Signed-off-by: conver334 <conver334@gmail.com>

Round-trip conversion between Hugging Face and Megatron FSDP.

30c892a

Signed-off-by: conver334 <conver334@gmail.com>

load hf model

dfd956f

Signed-off-by: conver334 <conver334@gmail.com>

conver334 force-pushed the mfsdp_to_hf branch from 193f50c to dfd956f Compare January 19, 2026 11:24

Moonlight-16B-A3B-Instruct inv_freq fix

fe1934e

Signed-off-by: conver334 <conver334@gmail.com>

copy-pr-bot Bot had a problem deploying to nemo-ci January 23, 2026 20:07 Error

Lint

dfccfd5

Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>

copy-pr-bot Bot temporarily deployed to nemo-ci January 23, 2026 21:14 Inactive

copy-pr-bot Bot temporarily deployed to test January 23, 2026 21:15 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci January 23, 2026 21:24 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci January 23, 2026 23:30 Failure

conver334 added 3 commits January 28, 2026 02:00

add text generate file

deceba6

Signed-off-by: conver334 <conver334@gmail.com>

reverse qwen3 and deepseek change

0c1dc81

Signed-off-by: conver334 <conver334@gmail.com>

merge

908c7ef

Signed-off-by: conver334 <conver334@gmail.com>

Merge branch 'main' into mfsdp_to_hf

831676a

rhmukundan and others added 9 commits April 10, 2026 15:36

perf: Set micro_batch_size=2 for qwen3 235B B300 V2 configs (NVIDIA-N…

3b79556

…eMo#3252) Signed-off-by: Raghav Hrishikeshan Mukundan <rmukundan@nvidia.com>

Add Hybrid CP arg to initialize_model_parallel (NVIDIA-NeMo#3259)

815f159

Signed-off-by: Parth Mannan <pmannan@nvidia.com>

[doc] chore: add Bailing MoE V2 news and fix chronological order (NVI…

d04801c

…DIA-NeMo#3260) Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

[perf] fix: guard cuda_graph_scope validation against None (NVIDIA-Ne…

43d3691

…Mo#3249) Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

chore(beep boop 🤖): Bump (main, mcore-main) (2026-04-10)

b13939e

[data] fix: enable WAN sequence packing by default to fix rope_utils …

831d824

…crash (NVIDIA-NeMo#3254) Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

Support Qwen3-ASR Megatron Bridge (NVIDIA-NeMo#2836)

de2aff3

Signed-off-by: zhangyuekai <zhangyuekai@foxmail.com> Signed-off-by: root <root@h20-2.cm.cluster> Signed-off-by: root <zhangyuekai@foxmail.com> Signed-off-by: Yuekai Zhang <zhangyuekai@foxmail.com>

[test] feat: add active launch script for FSDP converter functional test

49e23d3

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yxs mentioned this pull request Apr 10, 2026

[megatron] feat: support Megatron-FSDP mode for Megatron backend verl-project/verl#5423

Merged

8 tasks

conver334 and others added 4 commits April 13, 2026 00:51

move megatron-fsdp tests into examples/conversion/fsdp

4b22b9d

Signed-off-by: conver334 <conver334@gmail.com>

remove duplicate file

5ed265b

Signed-off-by: conver334 <conver334@gmail.com>

Merge remote-tracking branch 'origin/main' into mfsdp_to_hf

1994fba

Signed-off-by: conver334 <conver334@gmail.com>

Merge branch 'main' into mfsdp_to_hf

ffa9649

conver334 added 2 commits April 16, 2026 00:58

update test_hf_fsdp_conversion.py to fix is_torch_fx_available error

74006d0

Signed-off-by: conver334 <conver334@gmail.com>

Merge remote-tracking branch 'origin/main' into mfsdp_to_hf

1f3e872

Signed-off-by: conver334 <conver334@gmail.com>

yaoyu-33 previously approved these changes Apr 16, 2026

View reviewed changes

conver334 added 2 commits April 17, 2026 00:54

fix CI warning

8ca54b3

Signed-off-by: conver334 <conver334@gmail.com>

Merge remote-tracking branch 'origin/main' into mfsdp_to_hf

ae8295c

Signed-off-by: conver334 <conver334@gmail.com>

yaoyu-33 approved these changes Apr 17, 2026

View reviewed changes

Merge branch 'main' into mfsdp_to_hf

525921a

yaoyu-33 mentioned this pull request Apr 20, 2026

Revert: online load/save huggingface model weights for Megatron-FSDP (#1910) #3418

Merged

1 task

conver334 mentioned this pull request Apr 24, 2026

feat: HuggingFace ↔ Megatron-FSDP weight conversion #3512

Merged

5 tasks

cuichenx mentioned this pull request May 8, 2026

[NeMo FW 26.06 Release] MBridge v0.5.0 Roadmap #3754

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: online load/save huggingface model weights for Megatron-FSDP#1910

feat: online load/save huggingface model weights for Megatron-FSDP#1910
yaoyu-33 merged 64 commits into
NVIDIA-NeMo:mainfrom
conver334:mfsdp_to_hf

conver334 commented Jan 12, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented Jan 12, 2026

Uh oh!

BoxiangW commented Jan 13, 2026

Uh oh!

BoxiangW commented Jan 23, 2026

Uh oh!

BoxiangW commented Jan 23, 2026

Uh oh!

yaoyu-33 commented Jan 28, 2026

Uh oh!

yaoyu-33 commented Apr 10, 2026

Uh oh!

yaoyu-33 commented Apr 13, 2026

Uh oh!

yaoyu-33 commented Apr 16, 2026

Uh oh!

yaoyu-33 commented Apr 17, 2026

Uh oh!

yaoyu-33 commented Apr 20, 2026

Uh oh!

yaoyu-33 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

Conversation

conver334 commented Jan 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

API

Test

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Jan 12, 2026

Uh oh!

BoxiangW commented Jan 13, 2026

Uh oh!

BoxiangW commented Jan 23, 2026

Uh oh!

BoxiangW commented Jan 23, 2026

Uh oh!

yaoyu-33 commented Jan 28, 2026

Uh oh!

yaoyu-33 commented Apr 10, 2026

Uh oh!

yaoyu-33 commented Apr 13, 2026

Uh oh!

yaoyu-33 commented Apr 16, 2026

Uh oh!

yaoyu-33 commented Apr 17, 2026

Uh oh!

yaoyu-33 commented Apr 20, 2026

Uh oh!

yaoyu-33 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

conver334 commented Jan 12, 2026 •

edited by coderabbitai Bot

Loading