Skip to content

convert tokenizer args to config#4406

Merged
dimapihtar merged 13 commits into
NVIDIA:mainfrom
dimapihtar:tokenizers_config
May 5, 2026
Merged

convert tokenizer args to config#4406
dimapihtar merged 13 commits into
NVIDIA:mainfrom
dimapihtar:tokenizers_config

Conversation

@dimapihtar

@dimapihtar dimapihtar commented Apr 21, 2026

Copy link
Copy Markdown
Contributor

What does this PR do ?

Converts tokenizer arguments into the config format (dataclass).

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact the @mcore-oncall.

Contribution process

Pre-checks

  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

  1. When your PR is ready, click Ready for Review.
  2. An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
    • Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Apr 21, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@dimapihtar dimapihtar marked this pull request as ready for review April 21, 2026 13:28
@dimapihtar dimapihtar requested review from a team as code owners April 21, 2026 13:28
@dimapihtar dimapihtar added complexity: low Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. labels Apr 21, 2026
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team April 21, 2026 13:28
@dimapihtar

Copy link
Copy Markdown
Contributor Author

/ok to test 9daf469

@maanug-nv

Copy link
Copy Markdown
Contributor

/claude review

tokenizer_metadata: str = None
"""Path to the tokenizer metadata file in json format."""

tokenizer_special_tokens: list[str] = field(default_factory=list)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The default changed from None (old argparse default=None) to [] (empty list via default_factory=list). This breaks the TikTokenizer code path.

In megatron/core/tokenizers/text/libraries/tiktoken_tokenizer.py:101-102:

if special_tokens is None:
    special_tokens = SPECIAL_TOKENS.copy()

With the old None default, omitting --tokenizer-special-tokens would pass None to TikTokenTokenizer, triggering the fallback to SPECIAL_TOKENS. With [], the is None check is False, so special_tokens stays as [], and the assertion at line 115-117 fails:

assert set(SPECIAL_TOKENS) <= set(special_tokens)

The default should remain None to preserve existing behavior:

Suggested change
tokenizer_special_tokens: list[str] = field(default_factory=list)
tokenizer_special_tokens: Optional[list[str]] = None

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One bug found: the tokenizer_special_tokens default changed from None to [], which silently breaks the TikTokenizer fallback to default special tokens. See inline comment for details.

dimapihtar and others added 3 commits April 24, 2026 19:38
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
@dimapihtar dimapihtar requested review from a team as code owners April 24, 2026 17:40
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
@dimapihtar

Copy link
Copy Markdown
Contributor Author

/ok to test 7c406cf

@dimapihtar

Copy link
Copy Markdown
Contributor Author

/ok to test 50c45f0

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
@dimapihtar

Copy link
Copy Markdown
Contributor Author

/ok to test 415d5e5

@maanug-nv maanug-nv left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left 1 nit. lgtm overall, approving to unblock

@dimapihtar dimapihtar added the Final Review PR is in the "final review" stage label Apr 29, 2026
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the Final Review PR is in the "final review" stage label Apr 29, 2026
@dimapihtar

Copy link
Copy Markdown
Contributor Author

/ok to test c6847e7

@dimapihtar

Copy link
Copy Markdown
Contributor Author

/ok to test c1c4923

Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
@dimapihtar dimapihtar requested a review from a team as a code owner May 4, 2026 21:19
@dimapihtar dimapihtar requested a review from tdene May 5, 2026 13:28
@dimapihtar

Copy link
Copy Markdown
Contributor Author

/ok to test 44b28cb

@dimapihtar dimapihtar added the Final Review PR is in the "final review" stage label May 5, 2026
@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the Final Review PR is in the "final review" stage label May 5, 2026
@dimapihtar dimapihtar removed the request for review from ananthsub May 5, 2026 14:41
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the Approved All necessary approvals have been made label May 5, 2026
@dimapihtar dimapihtar added this pull request to the merge queue May 5, 2026
@svcnvidia-nemo-ci

Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/25401532431

Merged via the queue into NVIDIA:main with commit 40d024b May 5, 2026
352 of 358 checks passed
@dimapihtar dimapihtar deleted the tokenizers_config branch May 5, 2026 22:01
ischlag added a commit to ischlag/megatron-lm-research-baseline that referenced this pull request May 8, 2026
Upstream commit 'convert tokenizer args to config' (NVIDIA#4406) introduced
megatron/training/config/container.py and instantiate_utils.py which
both import omegaconf. argument_utils -> config -> container -> omegaconf
is now in the import chain reached by megatron/training/arguments.py,
so every entrypoint needs it.

Marker bumped v3 -> v4 to force reinstall on first post-sync run.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Approved All necessary approvals have been made complexity: medium Expert Review [deprecated] Apply this label to indicate that your PR is ready for expert review. Run functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants