fix: check find() return value before adding offset in try_fix_tokenizer#4923
Merged
danielhanchen merged 3 commits intoApr 9, 2026
Merged
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request fixes a logic error in the try_fix_tokenizer function within unsloth/tokenizer_utils.py. The previous implementation incorrectly checked for the existence of a substring after adding the length of the search string to the find result, which would fail to catch cases where the substring was not found. The update correctly validates the find result before calculating the start position. I have no feedback to provide as there were no review comments.
2 tasks
The `str.find()` result was checked for -1 only after adding `len(find_text)`, turning the guard into dead code. When the substring is absent, `start` becomes `len(find_text) - 1` (a positive number), so the `if start == -1: continue` never triggers and the subsequent slice extracts garbage from the tokenizer string. Split the find and offset into two steps so the -1 check works correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Skip loop iteration early when token_id is None to avoid constructing
a find_text that can never match valid JSON
- Guard end = tokenizer_string.find('",', start) against -1 to prevent
silent garbage extraction from malformed tokenizer strings
590cc21 to
736cb59
Compare
for more information, see https://pre-commit.ci
Member
|
@codex review |
Member
|
Thanks so much! |
3 tasks
pull Bot
pushed a commit
to ShinnChow/unsloth
that referenced
this pull request
May 25, 2026
…eholders (unslothai#5763) * fix(chat_templates): check find() return value before slicing on placeholders Two places in `construct_chat_template()` use `str.find()` for sentinel placeholders (`{INPUT}` / `{OUTPUT}`) without checking the -1 return: 1. The `except:` fallback (around line 2464) computes `chat_template[chat_template.find("{OUTPUT}") + len("{OUTPUT}"):]`. If the template has no `{OUTPUT}` marker, `find()` returns -1 and the slice starts at offset 7 (`-1 + len("{OUTPUT}")`), producing garbage that's then `re.escape`-d and fed back into the template-recovery regex. The user sees a confusing `IndexError` on `response_part = response_part[0]` instead of the real problem. 2. The final trim before returning (`input_part[:input_part.find("{INPUT}")]` and the matching `{OUTPUT}` line) silently drops the last character when the placeholder is missing — `find()` returns -1, and `[:-1]` slices everything except the last character, returning a corrupted template prefix to the caller. Replace both with an explicit `-1` check that raises a clear `RuntimeError` naming the missing placeholder, matching the existing guard pattern from unslothai#4923 (`try_fix_tokenizer`). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(chat_templates): also guard {INPUT} and fallback regex/separator paths Builds on the {OUTPUT} / final-trim guards in this branch by closing the three remaining ways the except-block fallback in construct_chat_template() can still raise a confusing IndexError or AttributeError on malformed templates: 1. Validate both {INPUT} and {OUTPUT} before deriving `ending`. The regex two lines later (`{INPUT} + ending + ...`) still produced an empty list and crashed on `response_part[0]` if {INPUT} was missing. 2. Guard the regex no-match case. Some templates contain both placeholders but not in a recoverable two-example shape, in which case `re.findall` returns an empty list and `[0]` raises. 3. Initialize `found = None` before the separator-search loop and raise if the loop never sets it. Previously, if the first iteration's `re.finditer` was empty the loop broke without binding `found`, and `found.group(1)` raised AttributeError on the stale int left over from the outer rfind loop. Rephrase the final-trim error messages from internal variable names ("input_part") to user-facing wording ("instruction section") and include a bounded (200-char) excerpt of the offending content so the error is debuggable without being unbounded. Add tests/python/test_construct_chat_template_validation.py covering each failure mode with a fake tokenizer (no HF_TOKEN, no model download, CPU-only). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
rsd-darshan
pushed a commit
to rsd-darshan/unsloth
that referenced
this pull request
Jun 3, 2026
…eholders (unslothai#5763) * fix(chat_templates): check find() return value before slicing on placeholders Two places in `construct_chat_template()` use `str.find()` for sentinel placeholders (`{INPUT}` / `{OUTPUT}`) without checking the -1 return: 1. The `except:` fallback (around line 2464) computes `chat_template[chat_template.find("{OUTPUT}") + len("{OUTPUT}"):]`. If the template has no `{OUTPUT}` marker, `find()` returns -1 and the slice starts at offset 7 (`-1 + len("{OUTPUT}")`), producing garbage that's then `re.escape`-d and fed back into the template-recovery regex. The user sees a confusing `IndexError` on `response_part = response_part[0]` instead of the real problem. 2. The final trim before returning (`input_part[:input_part.find("{INPUT}")]` and the matching `{OUTPUT}` line) silently drops the last character when the placeholder is missing — `find()` returns -1, and `[:-1]` slices everything except the last character, returning a corrupted template prefix to the caller. Replace both with an explicit `-1` check that raises a clear `RuntimeError` naming the missing placeholder, matching the existing guard pattern from unslothai#4923 (`try_fix_tokenizer`). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(chat_templates): also guard {INPUT} and fallback regex/separator paths Builds on the {OUTPUT} / final-trim guards in this branch by closing the three remaining ways the except-block fallback in construct_chat_template() can still raise a confusing IndexError or AttributeError on malformed templates: 1. Validate both {INPUT} and {OUTPUT} before deriving `ending`. The regex two lines later (`{INPUT} + ending + ...`) still produced an empty list and crashed on `response_part[0]` if {INPUT} was missing. 2. Guard the regex no-match case. Some templates contain both placeholders but not in a recoverable two-example shape, in which case `re.findall` returns an empty list and `[0]` raises. 3. Initialize `found = None` before the separator-search loop and raise if the loop never sets it. Previously, if the first iteration's `re.finditer` was empty the loop broke without binding `found`, and `found.group(1)` raised AttributeError on the stale int left over from the outer rfind loop. Rephrase the final-trim error messages from internal variable names ("input_part") to user-facing wording ("instruction section") and include a bounded (200-char) excerpt of the offending content so the error is debuggable without being unbounded. Add tests/python/test_construct_chat_template_validation.py covering each failure mode with a fake tokenizer (no HF_TOKEN, no model download, CPU-only). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
try_fix_tokenizer(),str.find()returns-1when the substring is not found, but the return value is checked after addinglen(find_text). This meansstartbecomeslen(find_text) - 1(a positive number), soif start == -1: continueis dead code that never triggers.tokenizer_string[start:end]at the wrong position, extracting garbage text and potentially corrupting the tokenizer JSON via the subsequentstr.replace().find()call and the offset addition into two steps, checking for-1before computing the actual start position.Before (bug)
After (fix)
Test plan
str.find()returning-1now correctly triggerscontinuefind()succeeds,startis computed identically to the original code