Skip to content

Autoparser - complete refactoring of parser architecture#18675

Merged
pwilkin merged 2 commits intoggml-org:masterfrom
pwilkin:autoparser
Mar 6, 2026
Merged

Autoparser - complete refactoring of parser architecture#18675
pwilkin merged 2 commits intoggml-org:masterfrom
pwilkin:autoparser

Conversation

@pwilkin
Copy link
Collaborator

@pwilkin pwilkin commented Jan 7, 2026

This is a huge endeavor that I promised back when I applied for maintaining the parser code. The legacy parser code was hard to maintain and buggy and supporting new models with it was really annoying. There was a worthwhile contribution by @hksdpc255 to add some XML toolcalling abstractions, but that was still just a patch on an open wound.

Thanks to @aldehir and his PEG parser, I managed to create an autoparser mechanism, using all the currently supported templates, their parsers and test cases as base. The idea is simple: most models' syntax follows the general pattern of:

<reasoning_markers> <reasoning_content> <end_of_reasoning_markers> <content_markers> <main_content> <end_of_content_markers> <tool_call_markers> ( <json> | <function marker> <args json> | <function marker> <args marker> <value json> ) <end_of_tool_call_marker>

Of course, some elements might not be present in a given template, but that's the general structure. Since this is a pretty finite structure, it's possible to determine the relevant elements by differential analysis - similar to how Minja already does capability detection, but more fine-grained, because by comparing various template outputs, we get to actually extract the relevant markers.

Some models will obviously not get handled so easily. However, in the course of implementing the mechanism, only two models remained that needed to get their separate parsers: Ministral and GPT-OSS, and the prior not because of its complexity, but of the need to rewrite the message structure passed to the template. GPT-OSS is a different beast since it supports arbitrarily many interleaved blocks, so it doesn't fit into the scheme that I mentioned above (but its parser has been rewritten to PEG as well).

This is currently anchored on Minja and uses its capability detection, but since the differential analysis already does its own capability detection, I fully expect to throw that part out and base this on @ngxson 's #18462 instead.

Obsoletes #18353 (sorry @ochafik - I know you put a lot of work into that).

Old parsers, tests and all supporting code are thrown out, templates got new PEG-parser based testcases, all of them now also test streaming behavior. I have tested this extensively on agentic coding (mostly with OpenCode) to ensure that this actually works (my wish to refactor the parser code was mostly caused by my prior experience with agentic coding on llama.cpp, which was extremely buggy with a lot of models, this is an attempt to remedy that). Hopefully, having one unified codebase with a largely reduced line-of-code count will make it easier to fix any potential errors.

This also means that there is no longer need to provide support for new models' specific templates unless they have some odd constructs - they should be supported out of the box. There's a new tool called debug-template-parser that you can point to any Jinja template file or GGUF model with an embedded Jinja template and have it spit out the details of the generated autoparser + toolcaling grammar.

Oh, important note: all Minja polyfills have been disabled. Working templates are now required. Why I see why a year and a half ago having proof-of-concept code that supported tool calling on models that didn't natively have tool calling might've been useless, right now supporting that is making it harder to properly support current and actually used models. Therefore, a functional template with tool calling is required if someone wants tool calling.

I want to ask everyone from the community who can to test this. I will keep this branch current with master, I tried to test this as much as I could, but I'm just one person doing this after work, so obviously my testing abilities were limited. I will keep this as draft until I've gathered enough feedback and testing data.

To not clutter the main repository's issue tracker, please report bugs either (a) in this thread or (b) in my issue tracker https://github.com/pwilkin/llama.cpp/issues

AI DISCLOSURE: Gemini Pro 3, Flash 3, Opus 4.5 and GLM 4.7 would like to admit that a human element did at some points interfere in the coding process, being as bold as to even throw most of the code out at some point and demand it rewritten from scratch. The human also tinkered the code massively, removing a lot of our beautiful comments and some code fragments that they claimed were useless. They had no problems, however, in using us to do all the annoying marker arithmetic. Therefore, we disavow any claim to this code and cede the responsibility onto the human.

@hksdpc255
Copy link
Contributor

Does this mean we don’t need to write a parser anymore, and it will be automatically generated from the chat template?

@pwilkin
Copy link
Collaborator Author

pwilkin commented Jan 8, 2026

Does this mean we don’t need to write a parser anymore, and it will be automatically generated from the chat template?

Yup, that's the gist of it.

@hksdpc255
Copy link
Contributor

This feels almost magical. How does it work? Does it detect common patterns in the rendered template output? What happens if the chat template requires additional arguments?

@pwilkin
Copy link
Collaborator Author

pwilkin commented Jan 8, 2026

This feels almost magical. How does it work? Does it detect common patterns in the rendered template output? What happens if the chat template requires additional arguments?

Yeah, it does differential analysis - it prepares different inputs to the template and then tests the outputs, for example, by using a the same function signature with a different name you can identify where the function name goes, by using the same function with one and two parameters you can identify how parameters are passed etc. etc.

The nice thing is, I managed to squish it to just 2k lines of code (1k for analysis and 1k for helpers), so it's not even that bloated.

As for custom inputs - I assume standard inputs here and that's what most template makers try to adhere to anyway. If not, you end up with a custom handler like for Ministral - but as a followup I want to separate handlers from parsers (since passing extra params is much eaasier than handling an entire template from scratch) or even add autodetection for common custom keywords (we're going to have to support "reasoning" in addition to "reasoning_content" at some point because vLLM is moving to that).

@pwilkin pwilkin force-pushed the autoparser branch 2 times, most recently from dc7dd03 to 5519998 Compare January 8, 2026 14:53
@hksdpc255
Copy link
Contributor

hksdpc255 commented Jan 9, 2026

This approach does not seem to work well for models like Kimi-K2-Thinking, which may generate tool calls inside the thinking block, while the chat template itself automatically closes the thinking block correctly. In other words, the model’s behavior does not seem to be fully aligned with the assumptions made by the chat template. Is that understanding correct? I noticed that you have removed all parsers.

Additionally, I am planning to add a new custom parser for MiroThinker. Its official chat template does not accurately reflect the rendering logic actually used in their benchmarks. Is there a recommended starting point for implementing such a parser for the new parsing architecture?

@pwilkin
Copy link
Collaborator Author

pwilkin commented Jan 9, 2026

I've heard of those mysterious tool calls inside thinking blocks for K2-Thinking, but I've yet to know if they are an actual thing or if they are just an artifact of low quantization. To be honest, outside of the native provider, I haven't seen K2-Thinking implemented anywhere in a working fashion. The Chutes version that I tested quite a few times bugs out on tool calling extremely often.

I'm really skeptical of modifying anything based on hearsay and things "floating around". I remember the discussion here about interleaved thinking and I myself was convinced that meant models could have multiple <think> blocks until @aldehir pointed out that it's all a big misunderstanding and "interleaved thinking" is just the model having multiple message['assistant']['reasoning_content'] blocks next to message['assistant']['tool_call'] blocks. If I really see a working solution with open-sourced code anywhere that really demonstrates support for those thinking blocks, then sure, we can consider a special parser for K2-Thinking.

As for the Mirocode, I guess you're talking about adapting the Python code-based stuff they showed (the one that uses separate tags for MCP servers and code calling)? You can see how custom parsers are defined in chat.cpp, not much has changed besides the fact that since we use the PEG parser there's no longer a dedicated parse() and init() function and the entire parser is defined in the init. I'll probably separate the parsers into dedicated files soon.

@pwilkin pwilkin force-pushed the autoparser branch 2 times, most recently from 420f7bf to 9ea502a Compare January 13, 2026 16:23
@pwilkin pwilkin force-pushed the autoparser branch 2 times, most recently from a963e86 to 3594bd5 Compare January 16, 2026 23:13
@pwilkin pwilkin marked this pull request as ready for review January 17, 2026 17:31
@pwilkin
Copy link
Collaborator Author

pwilkin commented Jan 17, 2026

All right, I've reached the "all tests passed" phase for test-chat, so I'm moving this officially out of draft. Will still test in practice but want to get all structural / architectural etc. issues out of the way in the meantime.

@github-actions github-actions bot added the jinja parser Issues related to the jinja parser label Jan 17, 2026
@DylanSchell
Copy link

Glad than that here is still some real intelligence is involved. :D Hope its derailing took not too much time from you.

Nah, it's fine, I might add a case for end-of-generation as a terminating marker everywhere, not sure though whether having a response without content (just reasoning_content) would make much sense here either, it's a model error.

Just as an layman observer, could this be related to the issue where anthropic -> openai conversion is dropping reasoning blocks? #20090

@pwilkin pwilkin force-pushed the autoparser branch 2 times, most recently from d21ec53 to c8f7024 Compare March 5, 2026 22:57
@pwilkin
Copy link
Collaborator Author

pwilkin commented Mar 5, 2026

Just as an layman observer, could this be related to the issue where anthropic -> openai conversion is dropping reasoning blocks? #20090

Most certainly could.

Copy link
Collaborator

@aldehir aldehir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the recent changes impacted my ability to cleanly rebase on top. I'll just open a new PR on master. My changes are more stylistic than functional, so there's no urgency.

visorcraft added a commit to visorcraft/llama.cpp that referenced this pull request Mar 6, 2026
@pwilkin pwilkin merged commit 566059a into ggml-org:master Mar 6, 2026
81 checks passed
@koush
Copy link

koush commented Mar 6, 2026

I've heard of those mysterious tool calls inside thinking blocks for K2-Thinking, but I've yet to know if they are an actual thing or if they are just an artifact of low quantization. To be honest, outside of the native provider, I haven't seen K2-Thinking implemented anywhere in a working fashion. The Chutes version that I tested quite a few times bugs out on tool calling extremely often.

Kimi K2 thinking is fixed in vllm as of a few days ago. Can confirm since I'm the one that fixed it. I also use opencode. vllm-project/vllm#33646

@pwilkin
Copy link
Collaborator Author

pwilkin commented Mar 6, 2026

@koush oh, nice :) yeah, I implemented the tool-call-in-thinking-block once I saw it for myself in this branch as well :)

@pwilkin
Copy link
Collaborator Author

pwilkin commented Mar 6, 2026

@l0nedigit could you provide some more details? (like what was the prompt?)

@l0nedigit
Copy link

l0nedigit commented Mar 6, 2026

eh, I deleted it. Roo code was giving me an error stating it was unable to communicate with the api. So I tried a curl command:

curl http://localhost:8090/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "qwen3.5-27b-q8",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Say hello in one sentence."}
    ],
    "max_tokens": 50,
    "temperature": 0.7
  }'

That resulted in the error I pasted. When Iremoved temp/max_tokens, there was a response. Just now for some reason there is a very very long reponse time due to a lot of thinking for saying hello in one sentence. Prior, there was no thinking, which the 27b I thought was dense and a non-thinker by default 🤷 FWIW Roo code still replies back with "Unexpected API Response: The language model did not provide any assistant messages. This may indicate an issue with the API or the model's output."

TL;DR - comment deleted cause user error.

@pwilkin
Copy link
Collaborator Author

pwilkin commented Mar 6, 2026

I'll test it more with Roo just to be sure.

@gtrak
Copy link

gtrak commented Mar 6, 2026

I saw an error like that as well specifically on a non-streaming call, and I worked around it by disabling thinking. Streaming responses have been fine. Also qwen3.5 27b.

@l0nedigit
Copy link

Hey thanks! @gtrak and @pwilkin

@l0nedigit
Copy link

@gtrak have a beer on me this weekend ok? Thanks for the pro tip. Enabling streaming so far has produced better results.

@gtrak
Copy link

gtrak commented Mar 6, 2026

I've been using this branch and 27b exclusively for opencode subagents over the last week and thinking has been great. That model at q4_k_s gives me 40 tok/s on a 4090 and it's doing all my code generation better than I expected. The only issue I have is an occasional crash and restart of llama.cpp if i try to use it in parallel, but I think there's another PR floating around that problem.

@Galunid
Copy link
Contributor

Galunid commented Mar 7, 2026

@pwilkin Hi, after this got merged, I'm getting the following error:

{"error":{"code":500,"message":"Failed to parse input at pos 0: math,physics","type":"server_error"}} (click for full log)
[58831] srv  update_slots: all slots are idle
[58831] srv          stop: all tasks already finished, no need to cancel
[58831] que    start_loop: waiting for new tasks
[58831] srv    operator(): got exception: {"error":{"code":500,"message":"Failed to parse input at pos 0: math,physics","type":"server_error"}}
[58831] srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500
[58831] srv  log_server_r: request:  {"messages":[{"role":"system","content":"I am currently working on tagging dataset. I should tag all the problems accordingly using only tags in this list:\n   math - for when problems need specific mathematic solution\n   physics - when problem is a physics one\n   literature - for literature related problems\n\nI should also pay attention to the way I format the tags. My response should be only tags from the first category separated using comma.\nI should also pay attention to the way tags are desribed and apply them only when they match the correct criteria.\nI should also pay attention to the way tags are described and make sure all the relevant tags are applied based on their descriptions.\nI should not provide additional reasoning next to tag. I should provide only the correct list of tags as strings"},{"role":"user","content":"The game of NIM\n\nDetermine the best strategy for each player in the following two-player game. There\nare three piles, each of which contains some number of coins. Players alternate turns,\neach turn consisting of removing any (non-zero) number of coins from a single pile.\nThe goal is to be the person to remove the last coin(s)."}],"model":"Qwen3-1.7B-Q8_0.gguf","response_format":{"type":"json_schema","json_schema":{"schema":{"$defs":{"TagsEnum":{"enum":["math","physics","literature","<class '__main__.TagsEnum.Config'>"],"title":"TagsEnum","type":"string"}},"additionalProperties":true,"properties":{"tags":{"items":{"$ref":"#/$defs/TagsEnum"},"title":"Tags","type":"array"}},"required":["tags"],"title":"Problem","type":"object"},"name":"Problem","strict":true}},"stream":false}
[58831] srv  log_server_r: response: {"error":{"code":500,"message":"Failed to parse input at pos 0: math,physics","type":"server_error"}}
srv  server_http_: received response headers
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500
srv  log_server_r: request:  {"messages":[{"role":"system","content":"I am currently working on tagging dataset. I should tag all the problems accordingly using only tags in this list:\n   math - for when problems need specific mathematic solution\n   physics - when problem is a physics one\n   literature - for literature related problems\n\nI should also pay attention to the way I format the tags. My response should be only tags from the first category separated using comma.\nI should also pay attention to the way tags are desribed and apply them only when they match the correct criteria.\nI should also pay attention to the way tags are described and make sure all the relevant tags are applied based on their descriptions.\nI should not provide additional reasoning next to tag. I should provide only the correct list of tags as strings"},{"role":"user","content":"The game of NIM\n\nDetermine the best strategy for each player in the following two-player game. There\nare three piles, each of which contains some number of coins. Players alternate turns,\neach turn consisting of removing any (non-zero) number of coins from a single pile.\nThe goal is to be the person to remove the last coin(s)."}],"model":"Qwen3-1.7B-Q8_0.gguf","response_format":{"type":"json_schema","json_schema":{"schema":{"$defs":{"TagsEnum":{"enum":["math","physics","literature","<class '__main__.TagsEnum.Config'>"],"title":"TagsEnum","type":"string"}},"additionalProperties":true,"properties":{"tags":{"items":{"$ref":"#/$defs/TagsEnum"},"title":"Tags","type":"array"}},"required":["tags"],"title":"Problem","type":"object"},"name":"Problem","strict":true}},"stream":false}
srv  log_server_r: response: 
srv    operator(): client request thread ended
srv    operator(): http: streamed chunk: {"error":{"code":500,"message":"Failed to parse input at pos 0: math,physics","type":"server_error"}}
srv    operator(): http: stream ended

when running server with json schema. I bisected and the issue was introduced in 566059a.

You should be able to reproduce with information in #20178 (that issue is unrelated, but it has all the scripts I used when I run into this).

@ZUIcat
Copy link

ZUIcat commented Mar 7, 2026

Hello, after merging this branch, I also encountered an error like Failed to parse input at pos. Please see below for the full details. I am using the Qwen3-Coder model. Do you need any further information?

srv    operator(): got exception: {"error":{"code":500,"message":"Failed to parse input at pos 0: <tool_call>\n<function=list_directory>\n<parameter=path>\nAssets\n</parameter>\n<parameter=recursive>\n</parameter>\n</function>\n</tool_call>","type":"server_error"}}

@fairydreaming
Copy link
Collaborator

Investigation on #20193 (Failed to parse input at pos ...) brought me here, likely this PR is the culprit.

@stduhpf
Copy link
Contributor

stduhpf commented Mar 7, 2026

Since this PR, I'm getting "No parser definition detected, assuming pure content parser." spammed in my terminal (with Qwen3.5 27B), for each generated token (at least with the server).

localai-bot pushed a commit to localai-bot/LocalAI that referenced this pull request Mar 7, 2026
…PR #18675)

This update brings the new autoparser architecture from llama.cpp PR #18675,
which completely refactors the chat template parsing and tool calling support.

Key changes in llama.cpp:
- Removed legacy parser files (chat-parser.cpp, chat-parser-xml-toolcall.cpp)
- Added new autoparser infrastructure (chat-auto-parser-*.cpp, chat-diff-analyzer.cpp, chat-peg-parser.cpp)
- Improved tool calling support with automatic template detection
- Better handling of reasoning/thinking content in model outputs

API compatibility:
- common_chat_templates_apply() - compatible
- common_chat_templates_support_enable_thinking() - compatible
- common_chat_templates_inputs - compatible
- common_chat_msg - compatible

Impact on LocalAI:
- No code changes required in grpc-server.cpp
- No changes to prepare.sh or CMakeLists.txt
- Users will benefit from improved tool calling without configuration changes
- Better support for models with reasoning/thinking capabilities

This is a transparent upgrade - existing configurations and tool calling
workflows continue to work, but with improved reliability and broader
model support.

Refs: ggml-org/llama.cpp#18675
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build Compilation issues documentation Improvements or additions to documentation examples jinja parser Issues related to the jinja parser model Model specific python python script changes script Script related server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.