upstream MCore tokenizers config#3451
Conversation
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
|
test |
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Review: upstream MCore tokenizers configCritical: Deleted module still imported in 16+ files The PR deletes src/megatron/bridge/training/tokenizers/config.py but does NOT update any of the files that import TokenizerConfig from it. This will cause ImportError across the codebase at import time. Affected files include: src/megatron/bridge/training/config.py:58, src/megatron/bridge/training/checkpointing.py:69, src/megatron/bridge/training/model_load_save.py:503, tests/unit_tests/training/test_tokenizer.py:20, tests/unit_tests/training/mlm_compat/test_arguments.py:29, 6 files under tests/functional_tests, and 3 files under examples/models/megatron_mimo. config.py line 58 imports from the deleted module and defines a new TokenizerConfig at line 998. The import crashes before the new class is reached. Bug: Missing import in tokenizer.py - build_tokenizer references TokenizerConfig in its signature (line 11) but the import was removed. NameError at runtime. Bug: Wrong dict key for sp_tokenizer_kwargs - Line 1051 uses .get("tokenizer_sentencepiece_legacy", False) but users pass {"legacy": True}. The legacy value is never read. Bug: None safety on kwargs dicts - hf_tokenizer_kwargs and sp_tokenizer_kwargs allow None but post_init calls .get() unconditionally. Missing test coverage - No tests for new TokenizerConfig.post_init sync logic. Suggested test cases - No perf tests impacted. Needed: test_tokenizer_config_post_init_hf_defaults, test_tokenizer_config_post_init_hf_custom, test_tokenizer_config_post_init_sp_legacy, test_tokenizer_config_post_init_none_kwargs, test_tokenizer_config_deprecation_warnings |
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
|
/ok to test 89bceac |
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
|
/ok to test fd43b69 |
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
|
/ok to test 00f5a06 |
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
|
/ok to test 94a6672 |
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
|
/ok to test 95e7153 |
|
/ok to test fda1b2c |
maanug-nv
left a comment
There was a problem hiding this comment.
resolved concerns offline
Signed-off-by: dimapihtar <dpykhtar@nvidia.com> Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com>
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information