Name and Version
llama-cli --version
ggml_cuda_init: found 3 CUDA devices (Total VRAM: 26622 MiB):
Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 10239 MiB
Device 1: NVIDIA GeForce RTX 3070, compute capability 8.6, VMM: yes, VRAM: 8191 MiB
Device 2: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes, VRAM: 8191 MiB
load_backend: loaded CUDA backend from D:\Programs_Portable\llama-bad\ggml-cuda.dll
load_backend: loaded RPC backend from D:\Programs_Portable\llama-bad\ggml-rpc.dll
load_backend: loaded CPU backend from D:\Programs_Portable\llama-bad\ggml-cpu-haswell.dll
version: 8702 (c5ce4bc)
built with Clang 19.1.5 for Windows x86_64
Operating systems
Windows
GGML backends
CUDA
Hardware
Xeon E5-2696 v3 + 128GB Ram
RTX 3080 10GB at PCIe 3.0 x16
RTX 3070 8GB at PCIe 3.0 x8
RTX 3060 Ti 8GB at PCIe 3.0 x8
Models
https://huggingface.co/lmstudio-community/gemma-4-E4B-it-GGUF/blob/main/gemma-4-E4B-it-Q4_K_M.gguf
https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF/blob/main/gemma-4-E4B-it-UD-Q6_K_XL.gguf
https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF/blob/main/gemma-4-E4B-it-UD-Q8_K_XL.gguf
https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/blob/main/gemma-4-26B-A4B-it-UD-Q5_K_S.gguf
https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/blob/main/gemma-4-31B-it-UD-Q4_K_XL.gguf
Problem description & steps to reproduce
With Gemma4 models and -nkvo enabled, I get garbled outputs after a certain amount of tokens (around 230 or so). This started happening since release b8702 "CUDA: make cuda graphs props check faster".
I tested cuda 12 and cuda 13 version.
I also checked multiple versions and quants of Gemma4: the main ones I use from unsloth but also a version from lmstudio-community from 8 days ago (to make sure some updated models from unsloth on April 09 didn't introduce the regression).
I tested with the updated chat template from Google on April 9 and the latest build b8744 to check if this solves the issue but it doesn't (logs below).
https://huggingface.co/google/gemma-4-31B-it/blob/main/chat_template.jinja
First Bad Commit
c5ce4bc
Relevant log output
I used llama-completion to reproduce and I generated a number of output files that compare the different versions I tested. The command lines I ran are as follows, the corresponding output files are attached:
This is the OK run using b8701
llama-completion -m D:/lmstudio_models/unsloth/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-UD-Q8_K_XL.gguf --temp 1 --top-k 64 -np 1 --device cuda0 --jinja -ngl all -nkvo -fit off -c 32768 -p "How to learn spanish very fast ?" -n 600 -no-cnv > b8701_ok.txt 2>&1
b8701_ok.txt
This is the first release to have introduced the issue b8702
D:\Programs_Portable\llama-bad\llama-completion -m D:/lmstudio_models/unsloth/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-UD-Q8_K_XL.gguf --temp 1 --top-k 64 -np 1 --device cuda0 --jinja -ngl all -nkvo -fit off -c 32768 -p "How to learn spanish very fast ?" -n 600 -no-cnv > b8702_bad.txt 2>&1
b8702_bad.txt
This is the same release with the issue but it works with -kvo instead of -nkvo
D:\Programs_Portable\llama-bad\llama-completion -m D:/lmstudio_models/unsloth/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-UD-Q8_K_XL.gguf --temp 1 --top-k 64 -np 1 --device cuda0 --jinja -ngl all -kvo -fit off -c 32768 -p "How to learn spanish very fast ?" -n 600 -no-cnv > b8702_kvo_ok.txt 2>&1
b8702_kvo_ok.txt
This is the latest release as of this moment that still has the issue
D:\Programs_Portable\llama-bad-44\llama-completion -m D:/lmstudio_models/unsloth/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-UD-Q8_K_XL.gguf --temp 1 --top-k 64 -np 1 --device cuda0 --jinja -ngl all -nkvo -fit off -c 32768 -p "How to learn spanish very fast ?" -n 600 -no-cnv > b8744_bad.txt 2>&1
b8744_bad.txt
Here I tested the latest chat template file that google updated, just in case it fixed the issue, but it also reproduces
D:\Programs_Portable\llama-bad-44\llama-completion -m D:/lmstudio_models/unsloth/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-UD-Q8_K_XL.gguf --temp 1 --top-k 64 -np 1 --device cuda0 --jinja -ngl all -nkvo -fit off -c 32768 -p "How to learn spanish very fast ?" -n 600 -no-cnv --chat-template-file D:/lmstudio_models/original_gemma4_template.jinja > b8744_latest_template_bad.txt 2>&1
b8744_latest_template_bad.txt
Name and Version
llama-cli --version
ggml_cuda_init: found 3 CUDA devices (Total VRAM: 26622 MiB):
Device 0: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes, VRAM: 10239 MiB
Device 1: NVIDIA GeForce RTX 3070, compute capability 8.6, VMM: yes, VRAM: 8191 MiB
Device 2: NVIDIA GeForce RTX 3060 Ti, compute capability 8.6, VMM: yes, VRAM: 8191 MiB
load_backend: loaded CUDA backend from D:\Programs_Portable\llama-bad\ggml-cuda.dll
load_backend: loaded RPC backend from D:\Programs_Portable\llama-bad\ggml-rpc.dll
load_backend: loaded CPU backend from D:\Programs_Portable\llama-bad\ggml-cpu-haswell.dll
version: 8702 (c5ce4bc)
built with Clang 19.1.5 for Windows x86_64
Operating systems
Windows
GGML backends
CUDA
Hardware
Xeon E5-2696 v3 + 128GB Ram
RTX 3080 10GB at PCIe 3.0 x16
RTX 3070 8GB at PCIe 3.0 x8
RTX 3060 Ti 8GB at PCIe 3.0 x8
Models
https://huggingface.co/lmstudio-community/gemma-4-E4B-it-GGUF/blob/main/gemma-4-E4B-it-Q4_K_M.gguf
https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF/blob/main/gemma-4-E4B-it-UD-Q6_K_XL.gguf
https://huggingface.co/unsloth/gemma-4-E4B-it-GGUF/blob/main/gemma-4-E4B-it-UD-Q8_K_XL.gguf
https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF/blob/main/gemma-4-26B-A4B-it-UD-Q5_K_S.gguf
https://huggingface.co/unsloth/gemma-4-31B-it-GGUF/blob/main/gemma-4-31B-it-UD-Q4_K_XL.gguf
Problem description & steps to reproduce
With Gemma4 models and -nkvo enabled, I get garbled outputs after a certain amount of tokens (around 230 or so). This started happening since release b8702 "CUDA: make cuda graphs props check faster".
I tested cuda 12 and cuda 13 version.
I also checked multiple versions and quants of Gemma4: the main ones I use from unsloth but also a version from lmstudio-community from 8 days ago (to make sure some updated models from unsloth on April 09 didn't introduce the regression).
I tested with the updated chat template from Google on April 9 and the latest build b8744 to check if this solves the issue but it doesn't (logs below).
https://huggingface.co/google/gemma-4-31B-it/blob/main/chat_template.jinja
First Bad Commit
c5ce4bc
Relevant log output
I used llama-completion to reproduce and I generated a number of output files that compare the different versions I tested. The command lines I ran are as follows, the corresponding output files are attached:
This is the OK run using b8701
llama-completion -m D:/lmstudio_models/unsloth/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-UD-Q8_K_XL.gguf --temp 1 --top-k 64 -np 1 --device cuda0 --jinja -ngl all -nkvo -fit off -c 32768 -p "How to learn spanish very fast ?" -n 600 -no-cnv > b8701_ok.txt 2>&1b8701_ok.txt
This is the first release to have introduced the issue b8702
D:\Programs_Portable\llama-bad\llama-completion -m D:/lmstudio_models/unsloth/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-UD-Q8_K_XL.gguf --temp 1 --top-k 64 -np 1 --device cuda0 --jinja -ngl all -nkvo -fit off -c 32768 -p "How to learn spanish very fast ?" -n 600 -no-cnv > b8702_bad.txt 2>&1b8702_bad.txt
This is the same release with the issue but it works with -kvo instead of -nkvo
D:\Programs_Portable\llama-bad\llama-completion -m D:/lmstudio_models/unsloth/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-UD-Q8_K_XL.gguf --temp 1 --top-k 64 -np 1 --device cuda0 --jinja -ngl all -kvo -fit off -c 32768 -p "How to learn spanish very fast ?" -n 600 -no-cnv > b8702_kvo_ok.txt 2>&1b8702_kvo_ok.txt
This is the latest release as of this moment that still has the issue
D:\Programs_Portable\llama-bad-44\llama-completion -m D:/lmstudio_models/unsloth/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-UD-Q8_K_XL.gguf --temp 1 --top-k 64 -np 1 --device cuda0 --jinja -ngl all -nkvo -fit off -c 32768 -p "How to learn spanish very fast ?" -n 600 -no-cnv > b8744_bad.txt 2>&1b8744_bad.txt
Here I tested the latest chat template file that google updated, just in case it fixed the issue, but it also reproduces
D:\Programs_Portable\llama-bad-44\llama-completion -m D:/lmstudio_models/unsloth/gemma-4-E4B-it-GGUF/gemma-4-E4B-it-UD-Q8_K_XL.gguf --temp 1 --top-k 64 -np 1 --device cuda0 --jinja -ngl all -nkvo -fit off -c 32768 -p "How to learn spanish very fast ?" -n 600 -no-cnv --chat-template-file D:/lmstudio_models/original_gemma4_template.jinja > b8744_latest_template_bad.txt 2>&1b8744_latest_template_bad.txt