fix: Gemma 4 MoE multi-turn tool call corruption#1665
Merged
Conversation
Strip stray `<|tool_call>` / `<tool_call|>` tokens from assistant message content before feeding history back to `apply_chat_template`. These tokens only belong in `tool_calls`/`tool_responses`; any occurrence in `content` is a streaming leak artifact. Left in place, the template embeds them as real special tokens, producing unbalanced open/close counts that corrupt the model's context on subsequent turns.
ToolCallStreamFilter flushed a split close marker (e.g. <tool_call| then >) immediately because _partial_suffix_len did not detect it as a partial prefix worth holding. Add stray-close markers to the hold logic so both halves reassemble before the strip check fires.
Owner
|
Thanks for tracking this down. The Gemma 4 multi-turn repro lines up with the leak through stored assistant content, and this fixes the path I care about for #617. I found one small scope issue in the streaming filter for non-Gemma XML close markers, so I will merge this and fold that into a follow-up on main. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
👋 First time contribution for me! Please let me know if you'd like to see stuff changed on this changeset, I'm happy to work with you all. Been loving oMLX for weeks now, figured it was time to give back.
Problem domain
Multi-turn tool-calling conversations with Gemma 4 26B A4B were breaking on the second and later turns. Stray protocol tokens (
<|tool_call>,<tool_call|>) were leaking from the streaming layer into the assistant message content. When that message was getting fed back throughapply_chat_template, the tokens got treated as real special tokens, producing unbalanced delimiters that corrupt the model's context for all subsequent turns.This PR fixes that by identifying and stripping stray closing markers in
tool_calling.py, and sanitizes thecontentin Gemma 4 runs before re-rendering conversation history.Root cause(s)
ToolCallStreamFilterdidn't strip stray close markers.<tool_call|>emitted outside a matched open/close pair passed through unfiltered<tool_call|then>), the buffer would also flush each half immediately rather than holding them for reassemblyextract_gemma4_messagesdidn't sanitize content before re-rendering history.How to reproduce
<tool_call|>embedded literally in the rendered prompt.Edge cases considered
<tool_call|>split as<tool_call| + >across twofeed()calls</tool_call>in prose: The strip is scoped to the bare stray tokens only (no leading slash), so legitimate closing tags in prose are untouchedIssues to close
This ought to finally close #617, which I was getting bit by. It might also close #1465 and/or #1410, but I'm not sure about that; those users should try re-testing if/when this makes it into
main.