fix: check find() return value before adding offset in try_fix_tokenizer by Ricardo-M-L · Pull Request #4923 · unslothai/unsloth

Ricardo-M-L · 2026-04-08T15:40:04Z

Summary

Bug: In try_fix_tokenizer(), str.find() returns -1 when the substring is not found, but the return value is checked after adding len(find_text). This means start becomes len(find_text) - 1 (a positive number), so if start == -1: continue is dead code that never triggers.
When the guard fails to trigger, the code slices tokenizer_string[start:end] at the wrong position, extracting garbage text and potentially corrupting the tokenizer JSON via the subsequent str.replace().
Fix: Split the find() call and the offset addition into two steps, checking for -1 before computing the actual start position.

Before (bug)

find_text = f'"id":{token_id},"content":"'
start = tokenizer_string.find(find_text) + len(find_text)
if start == -1:  # dead code: start >= len(find_text) - 1
    continue

After (fix)

find_text = f'"id":{token_id},"content":"'
find_pos = tokenizer_string.find(find_text)
if find_pos == -1:
    continue
start = find_pos + len(find_text)

Test plan

Verified that str.find() returning -1 now correctly triggers continue
Verified that when find() succeeds, start is computed identically to the original code

gemini-code-assist

Code Review

This pull request fixes a logic error in the try_fix_tokenizer function within unsloth/tokenizer_utils.py. The previous implementation incorrectly checked for the existence of a substring after adding the length of the search string to the find result, which would fail to catch cases where the substring was not found. The update correctly validates the find result before calculating the start position. I have no feedback to provide as there were no review comments.

The `str.find()` result was checked for -1 only after adding `len(find_text)`, turning the guard into dead code. When the substring is absent, `start` becomes `len(find_text) - 1` (a positive number), so the `if start == -1: continue` never triggers and the subsequent slice extracts garbage from the tokenizer string. Split the find and offset into two steps so the -1 check works correctly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Skip loop iteration early when token_id is None to avoid constructing a find_text that can never match valid JSON - Guard end = tokenizer_string.find('",', start) against -1 to prevent silent garbage extraction from malformed tokenizer strings

for more information, see https://pre-commit.ci

danielhanchen · 2026-04-09T13:00:53Z

@codex review

danielhanchen · 2026-04-09T13:15:43Z

Thanks so much!

…eholders (unslothai#5763) * fix(chat_templates): check find() return value before slicing on placeholders Two places in `construct_chat_template()` use `str.find()` for sentinel placeholders (`{INPUT}` / `{OUTPUT}`) without checking the -1 return: 1. The `except:` fallback (around line 2464) computes `chat_template[chat_template.find("{OUTPUT}") + len("{OUTPUT}"):]`. If the template has no `{OUTPUT}` marker, `find()` returns -1 and the slice starts at offset 7 (`-1 + len("{OUTPUT}")`), producing garbage that's then `re.escape`-d and fed back into the template-recovery regex. The user sees a confusing `IndexError` on `response_part = response_part[0]` instead of the real problem. 2. The final trim before returning (`input_part[:input_part.find("{INPUT}")]` and the matching `{OUTPUT}` line) silently drops the last character when the placeholder is missing — `find()` returns -1, and `[:-1]` slices everything except the last character, returning a corrupted template prefix to the caller. Replace both with an explicit `-1` check that raises a clear `RuntimeError` naming the missing placeholder, matching the existing guard pattern from unslothai#4923 (`try_fix_tokenizer`). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(chat_templates): also guard {INPUT} and fallback regex/separator paths Builds on the {OUTPUT} / final-trim guards in this branch by closing the three remaining ways the except-block fallback in construct_chat_template() can still raise a confusing IndexError or AttributeError on malformed templates: 1. Validate both {INPUT} and {OUTPUT} before deriving `ending`. The regex two lines later (`{INPUT} + ending + ...`) still produced an empty list and crashed on `response_part[0]` if {INPUT} was missing. 2. Guard the regex no-match case. Some templates contain both placeholders but not in a recoverable two-example shape, in which case `re.findall` returns an empty list and `[0]` raises. 3. Initialize `found = None` before the separator-search loop and raise if the loop never sets it. Previously, if the first iteration's `re.finditer` was empty the loop broke without binding `found`, and `found.group(1)` raised AttributeError on the stale int left over from the outer rfind loop. Rephrase the final-trim error messages from internal variable names ("input_part") to user-facing wording ("instruction section") and include a bounded (200-char) excerpt of the offending content so the error is debuggable without being unbounded. Add tests/python/test_construct_chat_template_validation.py covering each failure mode with a fake tokenizer (no HF_TOKEN, no model download, CPU-only). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: Daniel Han <danielhanchen@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Ricardo-M-L requested review from danielhanchen and mmathew23 as code owners April 8, 2026 15:40

gemini-code-assist Bot reviewed Apr 8, 2026

View reviewed changes

danielhanchen mentioned this pull request Apr 9, 2026

fix: check find() return value before adding offset in try_fix_tokenizer unslothai/unsloth-staging-1#36

Closed

2 tasks

Ricardo-M-L and others added 2 commits April 9, 2026 12:46

danielhanchen force-pushed the fix/tokenizer-find-check branch from 590cc21 to 736cb59 Compare April 9, 2026 12:46

[pre-commit.ci] auto fixes from pre-commit.com hooks

0706001

for more information, see https://pre-commit.ci

pre-commit-ci Bot requested a review from rolandtannous as a code owner April 9, 2026 12:47

danielhanchen merged commit d5525e8 into unslothai:main Apr 9, 2026
1 check passed

Ricardo-M-L mentioned this pull request May 25, 2026

fix(chat_templates): check find() return value before slicing on placeholders #5763

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: check find() return value before adding offset in try_fix_tokenizer#4923

fix: check find() return value before adding offset in try_fix_tokenizer#4923
danielhanchen merged 3 commits into
unslothai:mainfrom
Ricardo-M-L:fix/tokenizer-find-check

Ricardo-M-L commented Apr 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

danielhanchen commented Apr 9, 2026

Uh oh!

danielhanchen commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Ricardo-M-L commented Apr 8, 2026

Summary

Before (bug)

After (fix)

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

danielhanchen commented Apr 9, 2026

Uh oh!

danielhanchen commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants