Skip to content

Misc. bug: Gemma 4 template changes causing degraded inference speed #21784

@CMay

Description

@CMay

Name and Version

llama-cli --version
version: 8744 (d7ff074)
built with Clang 19.1.5 for Windows x86_64

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-server

Command line

Problem description & steps to reproduce

Token generation speed drops from 98t/s down to 70t/s with new reasoning budget template changes in llama.cpp.

The last good build that works is: https://github.com/ggml-org/llama.cpp/releases/tag/b8742

The build where the performance degradation starts is: https://github.com/ggml-org/llama.cpp/releases/tag/b8744

I've been using the Windows x64 Vulkan build.

Verification:

I've also tested by adding the --chat-template parameter to use a different template using the same build, and performance goes back up to 98t/s. However, that is not a fix, because all templates are broken with any of these newer builds for me.

Performance on other models is fine.

What should happen:

  • You should be able to use the Windows x64 Vulkan build (or other builds, if those are effected too) and the default template should not unnecessarily degrade performance.

  • Alternative templates shouldn't cause Gemma 4 to spit out its own template in responses. I didn't identify which commit started that issue, but if that wasn't a problem then at least it would have been an option to avoid the performance issue temporarily.

First Bad Commit

d7ff074

Relevant log output

Logs

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions