Skip to content

fix: check find() return value before adding offset in try_fix_tokenizer#4923

Merged
danielhanchen merged 3 commits into
unslothai:mainfrom
Ricardo-M-L:fix/tokenizer-find-check
Apr 9, 2026
Merged

fix: check find() return value before adding offset in try_fix_tokenizer#4923
danielhanchen merged 3 commits into
unslothai:mainfrom
Ricardo-M-L:fix/tokenizer-find-check

Conversation

@Ricardo-M-L

Copy link
Copy Markdown
Contributor

Summary

  • Bug: In try_fix_tokenizer(), str.find() returns -1 when the substring is not found, but the return value is checked after adding len(find_text). This means start becomes len(find_text) - 1 (a positive number), so if start == -1: continue is dead code that never triggers.
  • When the guard fails to trigger, the code slices tokenizer_string[start:end] at the wrong position, extracting garbage text and potentially corrupting the tokenizer JSON via the subsequent str.replace().
  • Fix: Split the find() call and the offset addition into two steps, checking for -1 before computing the actual start position.

Before (bug)

find_text = f'"id":{token_id},"content":"'
start = tokenizer_string.find(find_text) + len(find_text)
if start == -1:  # dead code: start >= len(find_text) - 1
    continue

After (fix)

find_text = f'"id":{token_id},"content":"'
find_pos = tokenizer_string.find(find_text)
if find_pos == -1:
    continue
start = find_pos + len(find_text)

Test plan

  • Verified that str.find() returning -1 now correctly triggers continue
  • Verified that when find() succeeds, start is computed identically to the original code

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a logic error in the try_fix_tokenizer function within unsloth/tokenizer_utils.py. The previous implementation incorrectly checked for the existence of a substring after adding the length of the search string to the find result, which would fail to catch cases where the substring was not found. The update correctly validates the find result before calculating the start position. I have no feedback to provide as there were no review comments.

Ricardo-M-L and others added 2 commits April 9, 2026 12:46
The `str.find()` result was checked for -1 only after adding
`len(find_text)`, turning the guard into dead code. When the substring
is absent, `start` becomes `len(find_text) - 1` (a positive number),
so the `if start == -1: continue` never triggers and the subsequent
slice extracts garbage from the tokenizer string.

Split the find and offset into two steps so the -1 check works correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Skip loop iteration early when token_id is None to avoid constructing
  a find_text that can never match valid JSON
- Guard end = tokenizer_string.find('",', start) against -1 to prevent
  silent garbage extraction from malformed tokenizer strings
@danielhanchen danielhanchen force-pushed the fix/tokenizer-find-check branch from 590cc21 to 736cb59 Compare April 9, 2026 12:46
@pre-commit-ci pre-commit-ci Bot requested a review from rolandtannous as a code owner April 9, 2026 12:47
@danielhanchen

Copy link
Copy Markdown
Member

@codex review

@danielhanchen

Copy link
Copy Markdown
Member

Thanks so much!

@danielhanchen danielhanchen merged commit d5525e8 into unslothai:main Apr 9, 2026
1 check passed
pull Bot pushed a commit to ShinnChow/unsloth that referenced this pull request May 25, 2026
…eholders (unslothai#5763)

* fix(chat_templates): check find() return value before slicing on placeholders

Two places in `construct_chat_template()` use `str.find()` for sentinel
placeholders (`{INPUT}` / `{OUTPUT}`) without checking the -1 return:

1. The `except:` fallback (around line 2464) computes
   `chat_template[chat_template.find("{OUTPUT}") + len("{OUTPUT}"):]`.
   If the template has no `{OUTPUT}` marker, `find()` returns -1 and the
   slice starts at offset 7 (`-1 + len("{OUTPUT}")`), producing garbage
   that's then `re.escape`-d and fed back into the template-recovery
   regex. The user sees a confusing `IndexError` on
   `response_part = response_part[0]` instead of the real problem.

2. The final trim before returning (`input_part[:input_part.find("{INPUT}")]`
   and the matching `{OUTPUT}` line) silently drops the last character
   when the placeholder is missing — `find()` returns -1, and `[:-1]`
   slices everything except the last character, returning a corrupted
   template prefix to the caller.

Replace both with an explicit `-1` check that raises a clear
`RuntimeError` naming the missing placeholder, matching the existing
guard pattern from unslothai#4923 (`try_fix_tokenizer`).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(chat_templates): also guard {INPUT} and fallback regex/separator paths

Builds on the {OUTPUT} / final-trim guards in this branch by closing
the three remaining ways the except-block fallback in
construct_chat_template() can still raise a confusing IndexError or
AttributeError on malformed templates:

1. Validate both {INPUT} and {OUTPUT} before deriving `ending`. The
   regex two lines later (`{INPUT} + ending + ...`) still produced an
   empty list and crashed on `response_part[0]` if {INPUT} was missing.
2. Guard the regex no-match case. Some templates contain both
   placeholders but not in a recoverable two-example shape, in which
   case `re.findall` returns an empty list and `[0]` raises.
3. Initialize `found = None` before the separator-search loop and
   raise if the loop never sets it. Previously, if the first
   iteration's `re.finditer` was empty the loop broke without binding
   `found`, and `found.group(1)` raised AttributeError on the stale
   int left over from the outer rfind loop.

Rephrase the final-trim error messages from internal variable names
("input_part") to user-facing wording ("instruction section") and
include a bounded (200-char) excerpt of the offending content so the
error is debuggable without being unbounded.

Add tests/python/test_construct_chat_template_validation.py covering
each failure mode with a fake tokenizer (no HF_TOKEN, no model
download, CPU-only).

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
rsd-darshan pushed a commit to rsd-darshan/unsloth that referenced this pull request Jun 3, 2026
…eholders (unslothai#5763)

* fix(chat_templates): check find() return value before slicing on placeholders

Two places in `construct_chat_template()` use `str.find()` for sentinel
placeholders (`{INPUT}` / `{OUTPUT}`) without checking the -1 return:

1. The `except:` fallback (around line 2464) computes
   `chat_template[chat_template.find("{OUTPUT}") + len("{OUTPUT}"):]`.
   If the template has no `{OUTPUT}` marker, `find()` returns -1 and the
   slice starts at offset 7 (`-1 + len("{OUTPUT}")`), producing garbage
   that's then `re.escape`-d and fed back into the template-recovery
   regex. The user sees a confusing `IndexError` on
   `response_part = response_part[0]` instead of the real problem.

2. The final trim before returning (`input_part[:input_part.find("{INPUT}")]`
   and the matching `{OUTPUT}` line) silently drops the last character
   when the placeholder is missing — `find()` returns -1, and `[:-1]`
   slices everything except the last character, returning a corrupted
   template prefix to the caller.

Replace both with an explicit `-1` check that raises a clear
`RuntimeError` naming the missing placeholder, matching the existing
guard pattern from unslothai#4923 (`try_fix_tokenizer`).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(chat_templates): also guard {INPUT} and fallback regex/separator paths

Builds on the {OUTPUT} / final-trim guards in this branch by closing
the three remaining ways the except-block fallback in
construct_chat_template() can still raise a confusing IndexError or
AttributeError on malformed templates:

1. Validate both {INPUT} and {OUTPUT} before deriving `ending`. The
   regex two lines later (`{INPUT} + ending + ...`) still produced an
   empty list and crashed on `response_part[0]` if {INPUT} was missing.
2. Guard the regex no-match case. Some templates contain both
   placeholders but not in a recoverable two-example shape, in which
   case `re.findall` returns an empty list and `[0]` raises.
3. Initialize `found = None` before the separator-search loop and
   raise if the loop never sets it. Previously, if the first
   iteration's `re.finditer` was empty the loop broke without binding
   `found`, and `found.group(1)` raised AttributeError on the stale
   int left over from the outer rfind loop.

Rephrase the final-trim error messages from internal variable names
("input_part") to user-facing wording ("instruction section") and
include a bounded (200-char) excerpt of the offending content so the
error is debuggable without being unbounded.

Add tests/python/test_construct_chat_template_validation.py covering
each failure mode with a fake tokenizer (no HF_TOKEN, no model
download, CPU-only).

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Daniel Han <danielhanchen@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants