Eval bug: Gemma4 models produce gibberish at some point with -nkvo

### Name and Version

llama-cli --version
ggml_cuda_init: found 3 CUDA devices (Total VRAM: 26622 MiB):
  Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 10239 MiB
  Device 1: NVIDIA GeForce RTX 3070, compute capability 8.6, VMM: yes, VRAM: 8191 MiB
  Device 2: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes, VRAM: 8191 MiB
load_backend: loaded CUDA backend from D:\Programs_Portable\llama-bad\ggml-cuda.dll
load_backend: loaded RPC backend from D:\Programs_Portable\llama-bad\ggml-rpc.dll
load_backend: loaded CPU backend from D:\Programs_Portable\llama-bad\ggml-cpu-haswell.dll
version: 8702 (c5ce4bc22)
built with Clang 19.1.5 for Windows x86_64

### Operating systems

Windows

### GGML backends

CUDA

### Hardware

Xeon E5-2696 v3 + 128GB Ram
RTX 3080 10GB at PCIe 3.0 x16
RTX 3070 8GB at PCIe 3.0 x8
RTX 3060 Ti 8GB at PCIe 3.0 x8

### Models

https://huggingface.co/lmstudio-community/gemma-4-E4B-it-GGUF/blob/main/gemma-4-E4B-it-Q4_K_M.gguf
https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF/blob/main/gemma-4-E4B-it-UD-Q6_K_XL.gguf
https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF/blob/main/gemma-4-E4B-it-UD-Q8_K_XL.gguf
https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/blob/main/gemma-4-26B-A4B-it-UD-Q5_K_S.gguf
https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/blob/main/gemma-4-31B-it-UD-Q4_K_XL.gguf

### Problem description & steps to reproduce

With Gemma4 models and -nkvo enabled, I get garbled outputs after a certain amount of tokens (around 230 or so). This started happening since release b8702 "CUDA: make cuda graphs props check faster".
I tested cuda 12 and cuda 13 version.
I also checked multiple versions and quants of Gemma4: the main ones I use from unsloth but also a version from lmstudio-community from 8 days ago (to make sure some updated models from unsloth on April 09 didn't introduce the regression).

I tested with the updated chat template from Google on April 9 and the latest build b8744 to check if this solves the issue but it doesn't (logs below).
https://huggingface.co/google/gemma-4-31B-it/blob/main/chat_template.jinja

### First Bad Commit

c5ce4bc

### Relevant log output

I used llama-completion to reproduce and I generated a number of output files that compare the different versions I tested. The command lines I ran are as follows, the corresponding output files are attached:

<details>
<summary>This is the OK run using b8701</summary>

```console
llama-completion -m D:/lmstudio_models/unsloth/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-UD-Q8_K_XL.gguf --temp 1 --top-k 64 -np 1 --device cuda0 --jinja -ngl all -nkvo -fit off -c 32768 -p "How to learn spanish very fast ?" -n 600 -no-cnv > b8701_ok.txt 2>&1
```

</details>

[b8701_ok.txt](https://github.com/user-attachments/files/26631574/b8701_ok.txt)

<details>
<summary>This is the first release to have introduced the issue b8702</summary>

```console
D:\Programs_Portable\llama-bad\llama-completion -m D:/lmstudio_models/unsloth/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-UD-Q8_K_XL.gguf --temp 1 --top-k 64 -np 1 --device cuda0 --jinja -ngl all -nkvo -fit off -c 32768 -p "How to learn spanish very fast ?" -n 600 -no-cnv > b8702_bad.txt 2>&1
```

</details>

[b8702_bad.txt](https://github.com/user-attachments/files/26631578/b8702_bad.txt)

<details>
<summary>This is the same release with the issue but it works with -kvo instead of -nkvo</summary>

```console
D:\Programs_Portable\llama-bad\llama-completion -m D:/lmstudio_models/unsloth/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-UD-Q8_K_XL.gguf --temp 1 --top-k 64 -np 1 --device cuda0 --jinja -ngl all -kvo -fit off -c 32768 -p "How to learn spanish very fast ?" -n 600 -no-cnv > b8702_kvo_ok.txt 2>&1
```

</details>

[b8702_kvo_ok.txt](https://github.com/user-attachments/files/26631575/b8702_kvo_ok.txt)

<details>
<summary>This is the latest release as of this moment that still has the issue</summary>

```console
D:\Programs_Portable\llama-bad-44\llama-completion -m D:/lmstudio_models/unsloth/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-UD-Q8_K_XL.gguf --temp 1 --top-k 64 -np 1 --device cuda0 --jinja -ngl all -nkvo -fit off -c 32768 -p "How to learn spanish very fast ?" -n 600 -no-cnv > b8744_bad.txt 2>&1
```

</details>

[b8744_bad.txt](https://github.com/user-attachments/files/26631576/b8744_bad.txt)

<details>
<summary>Here I tested the latest chat template file that google updated, just in case it fixed the issue, but it also reproduces</summary>

```console
D:\Programs_Portable\llama-bad-44\llama-completion -m D:/lmstudio_models/unsloth/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-UD-Q8_K_XL.gguf --temp 1 --top-k 64 -np 1 --device cuda0 --jinja -ngl all -nkvo -fit off -c 32768 -p "How to learn spanish very fast ?" -n 600 -no-cnv --chat-template-file D:/lmstudio_models/original_gemma4_template.jinja > b8744_latest_template_bad.txt 2>&1
```

</details>

[b8744_latest_template_bad.txt](https://github.com/user-attachments/files/26631577/b8744_latest_template_bad.txt)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: Gemma4 models produce gibberish at some point with -nkvo #21726

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Eval bug: Gemma4 models produce gibberish at some point with -nkvo #21726

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions