Skip to content

grammar : fix grammar trigger crash when token extends beyond trigger pattern#19503

Open
EliasOenal wants to merge 1 commit intoggml-org:masterfrom
EliasOenal:fix-grammar-trigger-crash
Open

grammar : fix grammar trigger crash when token extends beyond trigger pattern#19503
EliasOenal wants to merge 1 commit intoggml-org:masterfrom
EliasOenal:fix-grammar-trigger-crash

Conversation

@EliasOenal
Copy link

This fixes a crash when the model emits a tool call that a) completes the grammar trigger pattern, but also b) contains invalid extra text in the same token. An example would be: we're at <function in the buffer and match the prefix of the trigger pattern <function=. If then the next token comes in as =list, the = matches and completes the trigger, while the list part turns it into an invalid tool call, within the same token. (For the example we assume this tool is not available) I believe some models have per-tool-name triggers for their tool calls and are not susceptible to the bug, but the Qwen3 models and a few others are.

  • This PR fixes the main issue in src/llama-grammar.cpp, by gracefully trying and failing calls to hallucinated tools. The resulting output will most likely be an invalid tool call to the client, but that is what the model generates.

  • As a complementary improvement it adds basic exception handling to tools/server/server-context.cpp. This was missing and in turn resulted in llama-server crashing.

Resolves #19353 and resolves #19304.

Minimal reproducer:

#!/usr/bin/env python3
"""Reproducer: grammar crash when a token completes a trigger AND contains
text beyond it (e.g. token "=list" completing trigger "<function=").
Requires a running llama-server with a Qwen3 model (e.g. Qwen3-4B)."""
import requests, sys, time

URL = "http://127.0.0.1:8080"
PAYLOAD = {
    "prompt": """<|im_start|>system
You must use the tool to answer.
<tools><function><name>list</name><description>List files</description>
<parameters><parameter><name>path</name><type>string</type></parameter>
<required>["path"]</required></parameters></function></tools>
Reply format: <tool_call>\n<function=name>\n<parameter=p>v</parameter>\n</function>\n</tool_call><|im_end|>
<|im_start|>user
List /tmp<|im_end|>
<|im_start|>assistant
""",
    # grammar only allows specific tool names — "list" is intentionally missing
    "grammar": 'root ::= "<tool_call>\\n<function=" ("bash"|"search") ">" [^<]* "</function>\\n</tool_call>"',
    "grammar_lazy": True,
    "grammar_triggers": [{"type": 2, "value": r"<tool_call>\n<function="}],
    "n_predict": 256, "temperature": 0.7,
}

for i in range(1, 6):
    try:
        r = requests.post(f"{URL}/completion", json=PAYLOAD, timeout=60)
        print(f"{i}: {r.json().get('content', '')[:100]}")
    except requests.exceptions.ConnectionError:
        print(f"{i}: SERVER CRASHED"); sys.exit(1)
print("No crash after 5 attempts.")

repro_min.py

Copilot AI review requested due to automatic review settings February 11, 2026 06:07
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@dstolpmann
Copy link

This resolves #19304 for me. Thank you!

@tamascode
Copy link

@EliasOenal

The PR fixes the crash — llama-server no longer terminates when the grammar issue occurs.

However, the underlying grammar failure still happens intermittently during tool-calling with Qwen3 Coder Next. The request now fails gracefully with:

Grammar error: Unexpected empty grammar stack after accepting piece: =read

Observed behavior:

  • prompt processes normally
  • generation starts
  • model emits a tool-call piece (=read)
  • grammar stack becomes empty
  • request returns an error
  • server stays alive and releases the slot correctly

So this appears to be:

  • crash fixed
  • grammar/tool-call mismatch still present

Environment:

  • llama-server (/v1/chat/completions)
  • Qwen3 Coder chat format
  • tool calling / grammar-constrained decoding enabled
  • large prompt (~43k tokens)

@EliasOenal
Copy link
Author

@tamascode I was looking into hallucinated invalid tool calls as well, but I didn't want to introduce major changes to the llama.cpp codebase for my first PR. Thus I focused on fixing the crash, which should be a straight improvement.

To me it seems to be a deliberate decision to only activate the "lazy grammar"-path after the tool name has been completed. And the issue is that those Qwen3 models tend to occasionally hallucinate invalid tools they call. At least for me, with my fix applied and OpenCode, the failed tool call informs the model, which in turn usually gets it right on the second try.

I believe it would be possible to trigger the grammar earlier, to force models to only emit calls to valid tools, but that doesn't seem to be the design goal of "lazy grammar". I am happy to further look into this, if I could get some guidance on would be the best fit for the project.

@ngxson I know #18675 may also address the current llama-server crash, but it seems like a larger undertaking with a potentially longer timeline. Given that Qwen3 Coder Next is a vastly popular model, and many people are facing crashes: do you think it makes sense to merge this targeted fix to make the model work, or would you prefer to wait for the autoparser to get merged instead?

@aldehir
Copy link
Collaborator

aldehir commented Feb 15, 2026

The pattern should trigger before a tool name is generated, to ensure the grammar constrains model output to valid tool calls. The fix here is too invasive.

I rather roll out a separate custom Qwen 3 Coder Next parser, with the proper trigger rules, until the autoparser PR is merged. Could also be as simple as changing the pattern to look for <function instead of <function=, if = is not part of the token.

@EliasOenal
Copy link
Author

Triggering on <function would only reliably fix this, if function never tokenizes with any appendages. And I'm not convinced this is guaranteed. The current crash is a denial of service issue. Any user can take the server down by manipulating the model to emit the right token sequence (see reproducer). And it seemed to me like there were additional code paths that may throw as well. llama-server just wasn't handling the exceptions at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: crash in llama_grammar_accept_token Eval bug: Lllama.cpp crashes when running Qwen Next 80B Coder

5 participants