fix(mfsdp): skip tokenizer save in convert_checkpoints_fsdp.py by wplf · Pull Request #3987 · NVIDIA-NeMo/Megatron-Bridge

wplf · 2026-05-27T03:23:05Z

examples/conversion/convert_checkpoints.py does not bundle the tokenizer with the converted Megatron checkpoint; the mfsdp variant should match. Drop the hf_tokenizer_path / hf_tokenizer_kwargs arguments from the save_native_megatron_model call (and the unused upstream kwargs collection block). The HuggingFace source ID is still recorded in run_config.yaml, so downstream consumers can fetch the tokenizer separately via huggingface_hub when needed.

This also sidesteps an AttributeError raised at the build_tokenizer -> _set_padded_vocab_size -> vocab_size_with_padding chain when Bridge's dataclass-based TokenizerConfig is passed to Megatron-LM's build_tokenizer (which expects argparse-style attrs make_vocab_size_divisible_by / tensor_model_parallel_size / rank).

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

copy-pr-bot · 2026-05-27T03:23:08Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

wplf · 2026-05-27T03:23:51Z

/ok to test ee3daa8

Document the HF -> Megatron-FSDP DTensor conversion path needed before pretraining from pretrained weights: setup (clone Bridge, pin its 3rdparty/Megatron-LM submodule to this branch), the `torchrun convert_checkpoints_fsdp.py import` command with EP=8 default topology, expected output layout, and the open Bridge dependency (NVIDIA-NeMo/Megatron-Bridge#3987) to skip the post-save tokenizer build that otherwise crashes on this branch.

conver334 · 2026-05-29T03:07:13Z

LGTM, thanks!

`examples/conversion/convert_checkpoints.py` does not bundle the tokenizer with the converted Megatron checkpoint; the mfsdp variant should match. Drop the `hf_tokenizer_path` / `hf_tokenizer_kwargs` arguments from the `save_native_megatron_model` call (and the unused upstream kwargs collection block). The HuggingFace source ID is still recorded in `run_config.yaml`, so downstream consumers can fetch the tokenizer separately via `huggingface_hub` when needed. This also sidesteps an `AttributeError` raised at the `build_tokenizer -> _set_padded_vocab_size -> vocab_size_with_padding` chain when Bridge's dataclass-based `TokenizerConfig` is passed to Megatron-LM's `build_tokenizer` (which expects argparse-style attrs `make_vocab_size_divisible_by` / `tensor_model_parallel_size` / `rank`). Signed-off-by: Jinliang Li <jinliangl@nvidia.com>

wplf · 2026-05-29T09:44:26Z

/ok to test a558274

…A-NeMo#3987) Signed-off-by: Jinliang Li <jinliangl@nvidia.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>

wplf requested a review from conver334 May 27, 2026 03:23

copy-pr-bot Bot temporarily deployed to public May 27, 2026 03:24 Inactive

copy-pr-bot Bot temporarily deployed to test May 27, 2026 03:24 Inactive

wplf mentioned this pull request May 27, 2026

[dev] [5/5] Qwen3.5 support: Qwen3.5-VL training example NVIDIA/Megatron-LM#4751

Merged

4 tasks

copy-pr-bot Bot temporarily deployed to public May 27, 2026 03:32 Inactive

yaoyu-33 added area:ckpt Checkpoint conversion, loading, export, and save paths bug Something isn't working needs-review PR is ready for code review and waiting on a reviewer labels May 27, 2026

conver334 approved these changes May 29, 2026

View reviewed changes

wplf force-pushed the fix/mfsdp-skip-tokenizer-save branch from b06c38b to a558274 Compare May 29, 2026 09:43

copy-pr-bot Bot temporarily deployed to public May 29, 2026 09:44 Inactive

copy-pr-bot Bot temporarily deployed to test May 29, 2026 09:45 Inactive

wplf enabled auto-merge (squash) May 29, 2026 09:48

copy-pr-bot Bot temporarily deployed to public May 29, 2026 09:52 Inactive

wplf merged commit 4c3c081 into NVIDIA-NeMo:main May 29, 2026
96 of 98 checks passed

wplf deleted the fix/mfsdp-skip-tokenizer-save branch June 4, 2026 12:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mfsdp): skip tokenizer save in convert_checkpoints_fsdp.py#3987

fix(mfsdp): skip tokenizer save in convert_checkpoints_fsdp.py#3987
wplf merged 1 commit into
NVIDIA-NeMo:mainfrom
wplf:fix/mfsdp-skip-tokenizer-save

wplf commented May 27, 2026

Uh oh!

copy-pr-bot Bot commented May 27, 2026

Uh oh!

wplf commented May 27, 2026

Uh oh!

conver334 commented May 29, 2026

Uh oh!

wplf commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wplf commented May 27, 2026

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented May 27, 2026

Uh oh!

wplf commented May 27, 2026

Uh oh!

conver334 commented May 29, 2026

Uh oh!

wplf commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants