Name and Version
llama-cli --version
version: 8744 (d7ff074)
built with Clang 19.1.5 for Windows x86_64
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-server
Command line
Problem description & steps to reproduce
Token generation speed drops from 98t/s down to 70t/s with new reasoning budget template changes in llama.cpp.
The last good build that works is: https://github.com/ggml-org/llama.cpp/releases/tag/b8742
The build where the performance degradation starts is: https://github.com/ggml-org/llama.cpp/releases/tag/b8744
I've been using the Windows x64 Vulkan build.
Verification:
I've also tested by adding the --chat-template parameter to use a different template using the same build, and performance goes back up to 98t/s. However, that is not a fix, because all templates are broken with any of these newer builds for me.
Performance on other models is fine.
What should happen:
-
You should be able to use the Windows x64 Vulkan build (or other builds, if those are effected too) and the default template should not unnecessarily degrade performance.
-
Alternative templates shouldn't cause Gemma 4 to spit out its own template in responses. I didn't identify which commit started that issue, but if that wasn't a problem then at least it would have been an option to avoid the performance issue temporarily.
First Bad Commit
d7ff074
Relevant log output
Logs
Name and Version
llama-cli --version
version: 8744 (d7ff074)
built with Clang 19.1.5 for Windows x86_64
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-server
Command line
Problem description & steps to reproduce
Token generation speed drops from 98t/s down to 70t/s with new reasoning budget template changes in llama.cpp.
The last good build that works is: https://github.com/ggml-org/llama.cpp/releases/tag/b8742
The build where the performance degradation starts is: https://github.com/ggml-org/llama.cpp/releases/tag/b8744
I've been using the Windows x64 Vulkan build.
Verification:
I've also tested by adding the --chat-template parameter to use a different template using the same build, and performance goes back up to 98t/s. However, that is not a fix, because all templates are broken with any of these newer builds for me.
Performance on other models is fine.
What should happen:
You should be able to use the Windows x64 Vulkan build (or other builds, if those are effected too) and the default template should not unnecessarily degrade performance.
Alternative templates shouldn't cause Gemma 4 to spit out its own template in responses. I didn't identify which commit started that issue, but if that wasn't a problem then at least it would have been an option to avoid the performance issue temporarily.
First Bad Commit
d7ff074
Relevant log output
Logs