convert tokenizer args to config by dimapihtar · Pull Request #4406 · NVIDIA/Megatron-LM

dimapihtar · 2026-04-21T13:26:12Z

What does this PR do ?

Converts tokenizer arguments into the config format (dataclass).

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>

copy-pr-bot · 2026-04-21T13:26:16Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

dimapihtar · 2026-04-21T13:29:00Z

/ok to test 9daf469

maanug-nv · 2026-04-23T17:09:15Z

/claude review

claude · 2026-04-23T17:11:36Z

+    tokenizer_metadata: str = None
+    """Path to the tokenizer metadata file in json format."""
+
+    tokenizer_special_tokens: list[str] = field(default_factory=list)


Bug: The default changed from None (old argparse default=None) to [] (empty list via default_factory=list). This breaks the TikTokenizer code path.

In megatron/core/tokenizers/text/libraries/tiktoken_tokenizer.py:101-102:

if special_tokens is None: special_tokens = SPECIAL_TOKENS.copy()

With the old None default, omitting --tokenizer-special-tokens would pass None to TikTokenTokenizer, triggering the fallback to SPECIAL_TOKENS. With [], the is None check is False, so special_tokens stays as [], and the assertion at line 115-117 fails:

assert set(SPECIAL_TOKENS) <= set(special_tokens)

The default should remain None to preserve existing behavior:

Suggested change

tokenizer_special_tokens: list[str] = field(default_factory=list)

tokenizer_special_tokens: Optional[list[str]] = None

claude

One bug found: the tokenizer_special_tokens default changed from None to [], which silently breaks the TikTokenizer fallback to default special tokens. See inline comment for details.

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>

dimapihtar · 2026-04-24T17:45:38Z

/ok to test 7c406cf

dimapihtar · 2026-04-24T17:46:41Z

/ok to test 50c45f0

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>

dimapihtar · 2026-04-24T17:49:53Z

/ok to test 415d5e5

maanug-nv

left 1 nit. lgtm overall, approving to unblock

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>

dimapihtar · 2026-04-29T23:57:22Z

/ok to test c6847e7

dimapihtar · 2026-04-30T00:30:53Z

/ok to test c1c4923

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>

dimapihtar · 2026-05-05T13:29:19Z

/ok to test 44b28cb

svcnvidia-nemo-ci · 2026-05-05T20:49:50Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25401532431

Upstream commit 'convert tokenizer args to config' (NVIDIA#4406) introduced megatron/training/config/container.py and instantiate_utils.py which both import omegaconf. argument_utils -> config -> container -> omegaconf is now in the import chain reached by megatron/training/arguments.py, so every entrypoint needs it. Marker bumped v3 -> v4 to force reinstall on first post-sync run.

convert tokenizer args to config

9daf469

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>

dimapihtar marked this pull request as ready for review April 21, 2026 13:28

dimapihtar requested review from a team as code owners April 21, 2026 13:28

dimapihtar added complexity: low Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. labels Apr 21, 2026

svcnvidia-nemo-ci requested a review from a team April 21, 2026 13:28

svcnvidia-nemo-ci added this to the Core 0.16 milestone Apr 21, 2026

copy-pr-bot Bot temporarily deployed to test April 21, 2026 13:30 Inactive

svcnvidia-nemo-ci added complexity: medium and removed complexity: low labels Apr 21, 2026

dimapihtar requested a review from maanug-nv April 22, 2026 17:27

claude Bot reviewed Apr 23, 2026

View reviewed changes

dimapihtar and others added 3 commits April 24, 2026 19:38

Merge branch 'main' into tokenizers_config

49a2508

add new arguments

f3aa626

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>

fix varaibles

0c89084

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>

dimapihtar requested review from a team as code owners April 24, 2026 17:40

fix type

7c406cf

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>

Merge branch 'main' into tokenizers_config

50c45f0

dimapihtar added the Run functional tests label Apr 24, 2026

fix code style

415d5e5

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>

maanug-nv approved these changes Apr 29, 2026

View reviewed changes

dimapihtar added the Final Review PR is in the "final review" stage label Apr 29, 2026

remove comment

c6847e7

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>

svcnvidia-nemo-ci removed the Final Review PR is in the "final review" stage label Apr 29, 2026

copy-pr-bot Bot temporarily deployed to test April 29, 2026 23:58 Inactive

Merge branch 'main' into tokenizers_config

c1c4923

copy-pr-bot Bot temporarily deployed to test April 30, 2026 00:32 Inactive

dimapihtar requested review from ananthsub and ericharper May 4, 2026 20:15

remove deprecated tokenizers flags

44b28cb

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>

dimapihtar requested a review from a team as a code owner May 4, 2026 21:19

ericharper approved these changes May 4, 2026

View reviewed changes

dimapihtar requested a review from tdene May 5, 2026 13:28

copy-pr-bot Bot temporarily deployed to test May 5, 2026 13:30 Inactive

tdene approved these changes May 5, 2026

View reviewed changes

dimapihtar added the Final Review PR is in the "final review" stage label May 5, 2026

svcnvidia-nemo-ci removed the Final Review PR is in the "final review" stage label May 5, 2026

dimapihtar removed the request for review from ananthsub May 5, 2026 14:41

jaredcasper approved these changes May 5, 2026

View reviewed changes

svcnvidia-nemo-ci added the Approved All necessary approvals have been made label May 5, 2026

dimapihtar added this pull request to the merge queue May 5, 2026

Merged via the queue into NVIDIA:main with commit 40d024b May 5, 2026
352 of 358 checks passed

dimapihtar deleted the tokenizers_config branch May 5, 2026 22:01

cuichenx mentioned this pull request May 26, 2026

[NeMo FW 26.06 Release] MBridge v0.5.0 Roadmap NVIDIA-NeMo/Megatron-Bridge#3754

Open

	tokenizer_special_tokens: list[str] = field(default_factory=list)
	tokenizer_special_tokens: Optional[list[str]] = None

Conversation

dimapihtar commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

copy-pr-bot Bot commented Apr 21, 2026

Uh oh!

dimapihtar commented Apr 21, 2026

Uh oh!

maanug-nv commented Apr 23, 2026

Uh oh!

claude Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

dimapihtar commented Apr 24, 2026

Uh oh!

dimapihtar commented Apr 24, 2026

Uh oh!

dimapihtar commented Apr 24, 2026

Uh oh!

maanug-nv left a comment

Choose a reason for hiding this comment

Uh oh!

dimapihtar commented Apr 29, 2026

Uh oh!

dimapihtar commented Apr 30, 2026

Uh oh!

dimapihtar commented May 5, 2026

Uh oh!

svcnvidia-nemo-ci commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

dimapihtar commented Apr 21, 2026 •

edited

Loading