Skip to content

upstream MCore tokenizers config#3451

Merged
dimapihtar merged 29 commits into
mainfrom
upstream_tokenizer_config
May 21, 2026
Merged

upstream MCore tokenizers config#3451
dimapihtar merged 29 commits into
mainfrom
upstream_tokenizer_config

Conversation

@dimapihtar

Copy link
Copy Markdown
Contributor

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Apr 21, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
@yaoyu-33 yaoyu-33 added feature New capabilities, enhancements, or enablement work area:training Training loop, callbacks, and runtime integration labels Apr 23, 2026
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
@dimapihtar dimapihtar marked this pull request as ready for review May 6, 2026 12:40
Comment thread src/megatron/bridge/training/tokenizers/tokenizer.py
Comment thread src/megatron/bridge/training/config.py Outdated
Comment thread src/megatron/bridge/training/config.py Outdated
@claude

claude Bot commented May 6, 2026

Copy link
Copy Markdown
Contributor

test

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
@claude

claude Bot commented May 6, 2026

Copy link
Copy Markdown
Contributor

Review: upstream MCore tokenizers config

Critical: Deleted module still imported in 16+ files

The PR deletes src/megatron/bridge/training/tokenizers/config.py but does NOT update any of the files that import TokenizerConfig from it. This will cause ImportError across the codebase at import time.

Affected files include: src/megatron/bridge/training/config.py:58, src/megatron/bridge/training/checkpointing.py:69, src/megatron/bridge/training/model_load_save.py:503, tests/unit_tests/training/test_tokenizer.py:20, tests/unit_tests/training/mlm_compat/test_arguments.py:29, 6 files under tests/functional_tests, and 3 files under examples/models/megatron_mimo.

config.py line 58 imports from the deleted module and defines a new TokenizerConfig at line 998. The import crashes before the new class is reached.

Bug: Missing import in tokenizer.py - build_tokenizer references TokenizerConfig in its signature (line 11) but the import was removed. NameError at runtime.

Bug: Wrong dict key for sp_tokenizer_kwargs - Line 1051 uses .get("tokenizer_sentencepiece_legacy", False) but users pass {"legacy": True}. The legacy value is never read.

Bug: None safety on kwargs dicts - hf_tokenizer_kwargs and sp_tokenizer_kwargs allow None but post_init calls .get() unconditionally.

Missing test coverage - No tests for new TokenizerConfig.post_init sync logic.

Suggested test cases - No perf tests impacted. Needed: test_tokenizer_config_post_init_hf_defaults, test_tokenizer_config_post_init_hf_custom, test_tokenizer_config_post_init_sp_legacy, test_tokenizer_config_post_init_none_kwargs, test_tokenizer_config_deprecation_warnings

dimapihtar added 3 commits May 6, 2026 05:52
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
@dimapihtar

Copy link
Copy Markdown
Contributor Author

/ok to test 89bceac

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
@dimapihtar

Copy link
Copy Markdown
Contributor Author

/ok to test fd43b69

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
@dimapihtar

Copy link
Copy Markdown
Contributor Author

/ok to test 00f5a06

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
@dimapihtar

Copy link
Copy Markdown
Contributor Author

/ok to test 94a6672

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
@dimapihtar

Copy link
Copy Markdown
Contributor Author

/ok to test 95e7153

Comment thread src/megatron/bridge/training/tokenizers/config.py
Comment thread src/megatron/bridge/training/tokenizers/tokenizer.py Outdated
Comment thread src/megatron/bridge/training/tokenizers/tokenizer.py Outdated
dimapihtar and others added 2 commits May 21, 2026 05:42
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
@dimapihtar

Copy link
Copy Markdown
Contributor Author

/ok to test fda1b2c

@maanug-nv maanug-nv left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved concerns offline

@dimapihtar dimapihtar merged commit 580c3ea into main May 21, 2026
103 of 104 checks passed
@dimapihtar dimapihtar deleted the upstream_tokenizer_config branch May 21, 2026 15:35
vasunvidia pushed a commit to vasunvidia/Megatron-Bridge that referenced this pull request Jun 10, 2026
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:training Training loop, callbacks, and runtime integration feature New capabilities, enhancements, or enablement work waiting-on-customer Waiting on the original author to respond

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants