convert tokenizer args to config#4406
Conversation
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
|
/ok to test 9daf469 |
|
/claude review |
| tokenizer_metadata: str = None | ||
| """Path to the tokenizer metadata file in json format.""" | ||
|
|
||
| tokenizer_special_tokens: list[str] = field(default_factory=list) |
There was a problem hiding this comment.
Bug: The default changed from None (old argparse default=None) to [] (empty list via default_factory=list). This breaks the TikTokenizer code path.
In megatron/core/tokenizers/text/libraries/tiktoken_tokenizer.py:101-102:
if special_tokens is None:
special_tokens = SPECIAL_TOKENS.copy()With the old None default, omitting --tokenizer-special-tokens would pass None to TikTokenTokenizer, triggering the fallback to SPECIAL_TOKENS. With [], the is None check is False, so special_tokens stays as [], and the assertion at line 115-117 fails:
assert set(SPECIAL_TOKENS) <= set(special_tokens)The default should remain None to preserve existing behavior:
| tokenizer_special_tokens: list[str] = field(default_factory=list) | |
| tokenizer_special_tokens: Optional[list[str]] = None |
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
|
/ok to test 7c406cf |
|
/ok to test 50c45f0 |
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
|
/ok to test 415d5e5 |
maanug-nv
left a comment
There was a problem hiding this comment.
left 1 nit. lgtm overall, approving to unblock
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
|
/ok to test c6847e7 |
|
/ok to test c1c4923 |
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
|
/ok to test 44b28cb |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25401532431 |
Upstream commit 'convert tokenizer args to config' (NVIDIA#4406) introduced megatron/training/config/container.py and instantiate_utils.py which both import omegaconf. argument_utils -> config -> container -> omegaconf is now in the import chain reached by megatron/training/arguments.py, so every entrypoint needs it. Marker bumped v3 -> v4 to force reinstall on first post-sync run.
What does this PR do ?
Converts tokenizer arguments into the config format (dataclass).
Contribution process
Pre-checks
Code review
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.