fix(mfsdp): skip tokenizer save in convert_checkpoints_fsdp.py#3987
Merged
Conversation
Contributor
Author
|
/ok to test ee3daa8 |
4 tasks
wplf
added a commit
to wplf/Megatron-LM
that referenced
this pull request
May 27, 2026
Document the HF -> Megatron-FSDP DTensor conversion path needed before pretraining from pretrained weights: setup (clone Bridge, pin its 3rdparty/Megatron-LM submodule to this branch), the `torchrun convert_checkpoints_fsdp.py import` command with EP=8 default topology, expected output layout, and the open Bridge dependency (NVIDIA-NeMo/Megatron-Bridge#3987) to skip the post-save tokenizer build that otherwise crashes on this branch.
wplf
added a commit
to wplf/Megatron-LM
that referenced
this pull request
May 28, 2026
Document the HF -> Megatron-FSDP DTensor conversion path needed before pretraining from pretrained weights: setup (clone Bridge, pin its 3rdparty/Megatron-LM submodule to this branch), the `torchrun convert_checkpoints_fsdp.py import` command with EP=8 default topology, expected output layout, and the open Bridge dependency (NVIDIA-NeMo/Megatron-Bridge#3987) to skip the post-save tokenizer build that otherwise crashes on this branch.
Contributor
|
LGTM, thanks! |
conver334
approved these changes
May 29, 2026
`examples/conversion/convert_checkpoints.py` does not bundle the tokenizer with the converted Megatron checkpoint; the mfsdp variant should match. Drop the `hf_tokenizer_path` / `hf_tokenizer_kwargs` arguments from the `save_native_megatron_model` call (and the unused upstream kwargs collection block). The HuggingFace source ID is still recorded in `run_config.yaml`, so downstream consumers can fetch the tokenizer separately via `huggingface_hub` when needed. This also sidesteps an `AttributeError` raised at the `build_tokenizer -> _set_padded_vocab_size -> vocab_size_with_padding` chain when Bridge's dataclass-based `TokenizerConfig` is passed to Megatron-LM's `build_tokenizer` (which expects argparse-style attrs `make_vocab_size_divisible_by` / `tensor_model_parallel_size` / `rank`). Signed-off-by: Jinliang Li <jinliangl@nvidia.com>
b06c38b to
a558274
Compare
Contributor
Author
|
/ok to test a558274 |
vasunvidia
pushed a commit
to vasunvidia/Megatron-Bridge
that referenced
this pull request
Jun 10, 2026
…A-NeMo#3987) Signed-off-by: Jinliang Li <jinliangl@nvidia.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
examples/conversion/convert_checkpoints.pydoes not bundle the tokenizer with the converted Megatron checkpoint; the mfsdp variant should match. Drop thehf_tokenizer_path/hf_tokenizer_kwargsarguments from thesave_native_megatron_modelcall (and the unused upstream kwargs collection block). The HuggingFace source ID is still recorded inrun_config.yaml, so downstream consumers can fetch the tokenizer separately viahuggingface_hubwhen needed.This also sidesteps an
AttributeErrorraised at thebuild_tokenizer -> _set_padded_vocab_size -> vocab_size_with_paddingchain when Bridge's dataclass-basedTokenizerConfigis passed to Megatron-LM'sbuild_tokenizer(which expects argparse-style attrsmake_vocab_size_divisible_by/tensor_model_parallel_size/rank).What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Changelog
GitHub Actions CI
See the CI section in the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information