Skip to content

fix(mfsdp): skip tokenizer save in convert_checkpoints_fsdp.py#3987

Merged
wplf merged 1 commit into
NVIDIA-NeMo:mainfrom
wplf:fix/mfsdp-skip-tokenizer-save
May 29, 2026
Merged

fix(mfsdp): skip tokenizer save in convert_checkpoints_fsdp.py#3987
wplf merged 1 commit into
NVIDIA-NeMo:mainfrom
wplf:fix/mfsdp-skip-tokenizer-save

Conversation

@wplf

@wplf wplf commented May 27, 2026

Copy link
Copy Markdown
Contributor

examples/conversion/convert_checkpoints.py does not bundle the tokenizer with the converted Megatron checkpoint; the mfsdp variant should match. Drop the hf_tokenizer_path / hf_tokenizer_kwargs arguments from the save_native_megatron_model call (and the unused upstream kwargs collection block). The HuggingFace source ID is still recorded in run_config.yaml, so downstream consumers can fetch the tokenizer separately via huggingface_hub when needed.

This also sidesteps an AttributeError raised at the build_tokenizer -> _set_padded_vocab_size -> vocab_size_with_padding chain when Bridge's dataclass-based TokenizerConfig is passed to Megatron-LM's build_tokenizer (which expects argparse-style attrs make_vocab_size_divisible_by / tensor_model_parallel_size / rank).

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

@wplf wplf requested a review from conver334 May 27, 2026 03:23
@copy-pr-bot

copy-pr-bot Bot commented May 27, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@wplf

wplf commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test ee3daa8

wplf added a commit to wplf/Megatron-LM that referenced this pull request May 27, 2026
Document the HF -> Megatron-FSDP DTensor conversion path needed before
pretraining from pretrained weights: setup (clone Bridge, pin its
3rdparty/Megatron-LM submodule to this branch), the `torchrun
convert_checkpoints_fsdp.py import` command with EP=8 default topology,
expected output layout, and the open Bridge dependency
(NVIDIA-NeMo/Megatron-Bridge#3987) to skip the post-save tokenizer
build that otherwise crashes on this branch.
@yaoyu-33 yaoyu-33 added area:ckpt Checkpoint conversion, loading, export, and save paths bug Something isn't working needs-review PR is ready for code review and waiting on a reviewer labels May 27, 2026
wplf added a commit to wplf/Megatron-LM that referenced this pull request May 28, 2026
Document the HF -> Megatron-FSDP DTensor conversion path needed before
pretraining from pretrained weights: setup (clone Bridge, pin its
3rdparty/Megatron-LM submodule to this branch), the `torchrun
convert_checkpoints_fsdp.py import` command with EP=8 default topology,
expected output layout, and the open Bridge dependency
(NVIDIA-NeMo/Megatron-Bridge#3987) to skip the post-save tokenizer
build that otherwise crashes on this branch.
@conver334

Copy link
Copy Markdown
Contributor

LGTM, thanks!

`examples/conversion/convert_checkpoints.py` does not bundle the
tokenizer with the converted Megatron checkpoint; the mfsdp variant
should match. Drop the `hf_tokenizer_path` / `hf_tokenizer_kwargs`
arguments from the `save_native_megatron_model` call (and the
unused upstream kwargs collection block). The HuggingFace source ID
is still recorded in `run_config.yaml`, so downstream consumers can
fetch the tokenizer separately via `huggingface_hub` when needed.

This also sidesteps an `AttributeError` raised at the
`build_tokenizer -> _set_padded_vocab_size -> vocab_size_with_padding`
chain when Bridge's dataclass-based `TokenizerConfig` is passed to
Megatron-LM's `build_tokenizer` (which expects argparse-style attrs
`make_vocab_size_divisible_by` / `tensor_model_parallel_size` / `rank`).

Signed-off-by: Jinliang Li <jinliangl@nvidia.com>
@wplf wplf force-pushed the fix/mfsdp-skip-tokenizer-save branch from b06c38b to a558274 Compare May 29, 2026 09:43
@wplf

wplf commented May 29, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test a558274

@wplf wplf enabled auto-merge (squash) May 29, 2026 09:48
@wplf wplf merged commit 4c3c081 into NVIDIA-NeMo:main May 29, 2026
96 of 98 checks passed
@wplf wplf deleted the fix/mfsdp-skip-tokenizer-save branch June 4, 2026 12:13
vasunvidia pushed a commit to vasunvidia/Megatron-Bridge that referenced this pull request Jun 10, 2026
…A-NeMo#3987)

Signed-off-by: Jinliang Li <jinliangl@nvidia.com>
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:ckpt Checkpoint conversion, loading, export, and save paths bug Something isn't working needs-review PR is ready for code review and waiting on a reviewer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants