Skip to content

Eval bug: Gemma4 models produce gibberish at some point with -nkvo #21726

@charnet3d

Description

@charnet3d

Name and Version

llama-cli --version
ggml_cuda_init: found 3 CUDA devices (Total VRAM: 26622 MiB):
Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 10239 MiB
Device 1: NVIDIA GeForce RTX 3070, compute capability 8.6, VMM: yes, VRAM: 8191 MiB
Device 2: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes, VRAM: 8191 MiB
load_backend: loaded CUDA backend from D:\Programs_Portable\llama-bad\ggml-cuda.dll
load_backend: loaded RPC backend from D:\Programs_Portable\llama-bad\ggml-rpc.dll
load_backend: loaded CPU backend from D:\Programs_Portable\llama-bad\ggml-cpu-haswell.dll
version: 8702 (c5ce4bc)
built with Clang 19.1.5 for Windows x86_64

Operating systems

Windows

GGML backends

CUDA

Hardware

Xeon E5-2696 v3 + 128GB Ram
RTX 3080 10GB at PCIe 3.0 x16
RTX 3070 8GB at PCIe 3.0 x8
RTX 3060 Ti 8GB at PCIe 3.0 x8

Models

https://huggingface.co/lmstudio-community/gemma-4-E4B-it-GGUF/blob/main/gemma-4-E4B-it-Q4_K_M.gguf
https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF/blob/main/gemma-4-E4B-it-UD-Q6_K_XL.gguf
https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF/blob/main/gemma-4-E4B-it-UD-Q8_K_XL.gguf
https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/blob/main/gemma-4-26B-A4B-it-UD-Q5_K_S.gguf
https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/blob/main/gemma-4-31B-it-UD-Q4_K_XL.gguf

Problem description & steps to reproduce

With Gemma4 models and -nkvo enabled, I get garbled outputs after a certain amount of tokens (around 230 or so). This started happening since release b8702 "CUDA: make cuda graphs props check faster".
I tested cuda 12 and cuda 13 version.
I also checked multiple versions and quants of Gemma4: the main ones I use from unsloth but also a version from lmstudio-community from 8 days ago (to make sure some updated models from unsloth on April 09 didn't introduce the regression).

I tested with the updated chat template from Google on April 9 and the latest build b8744 to check if this solves the issue but it doesn't (logs below).
https://huggingface.co/google/gemma-4-31B-it/blob/main/chat_template.jinja

First Bad Commit

c5ce4bc

Relevant log output

I used llama-completion to reproduce and I generated a number of output files that compare the different versions I tested. The command lines I ran are as follows, the corresponding output files are attached:

This is the OK run using b8701
llama-completion -m D:/lmstudio_models/unsloth/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-UD-Q8_K_XL.gguf --temp 1 --top-k 64 -np 1 --device cuda0 --jinja -ngl all -nkvo -fit off -c 32768 -p "How to learn spanish very fast ?" -n 600 -no-cnv > b8701_ok.txt 2>&1

b8701_ok.txt

This is the first release to have introduced the issue b8702
D:\Programs_Portable\llama-bad\llama-completion -m D:/lmstudio_models/unsloth/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-UD-Q8_K_XL.gguf --temp 1 --top-k 64 -np 1 --device cuda0 --jinja -ngl all -nkvo -fit off -c 32768 -p "How to learn spanish very fast ?" -n 600 -no-cnv > b8702_bad.txt 2>&1

b8702_bad.txt

This is the same release with the issue but it works with -kvo instead of -nkvo
D:\Programs_Portable\llama-bad\llama-completion -m D:/lmstudio_models/unsloth/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-UD-Q8_K_XL.gguf --temp 1 --top-k 64 -np 1 --device cuda0 --jinja -ngl all -kvo -fit off -c 32768 -p "How to learn spanish very fast ?" -n 600 -no-cnv > b8702_kvo_ok.txt 2>&1

b8702_kvo_ok.txt

This is the latest release as of this moment that still has the issue
D:\Programs_Portable\llama-bad-44\llama-completion -m D:/lmstudio_models/unsloth/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-UD-Q8_K_XL.gguf --temp 1 --top-k 64 -np 1 --device cuda0 --jinja -ngl all -nkvo -fit off -c 32768 -p "How to learn spanish very fast ?" -n 600 -no-cnv > b8744_bad.txt 2>&1

b8744_bad.txt

Here I tested the latest chat template file that google updated, just in case it fixed the issue, but it also reproduces
D:\Programs_Portable\llama-bad-44\llama-completion -m D:/lmstudio_models/unsloth/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-UD-Q8_K_XL.gguf --temp 1 --top-k 64 -np 1 --device cuda0 --jinja -ngl all -nkvo -fit off -c 32768 -p "How to learn spanish very fast ?" -n 600 -no-cnv --chat-template-file D:/lmstudio_models/original_gemma4_template.jinja > b8744_latest_template_bad.txt 2>&1

b8744_latest_template_bad.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions