Name and Version
llama-cli --version
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24124 MiB):
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
version: 8647 (b069b10)
built with GNU 15.1.0 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA
Hardware
Intel i7 12th gen + RTX3090
Models
Gemma 4 31B IT
Problem description & steps to reproduce
From testing, it appears that Gemma 4's "Final Logit Softcapping" might not be taken into consideration during inference. This is probably resulting into exceedingly confident predictions, insensitivity to temperature.
How to observe how final logit softcapping works:
- Download the HF version of Gemma 4 from Google
- Edit
config.json, change final_logit_softcapping from the default value of 30.0 to a low value like 20.0 or 15.0
- Perform inference via Transformers
- Observe how the model quickly becomes incoherent, as if temperature was very high.
How to observe that the same setting doesn't appear to have any effect in llama.cpp
- Try to override it in llama-server CLI settings, e.g.
--override-kv gemma4.final_logit_softcapping=float:15.0 ⇒ no change in outputs
- Try to override it by editing the corresponding key in the GGUF file, e.g. with this code to a low value ⇒ no change
- Try to edit the key in the original HF config.json file, then convert to GGUF ⇒ no change
Conclusion: llama.cpp doesn't properly implement Gemma's final logit softcapping.
This issue might affect Gemma 2 and 3 as well (which also use logit softcapping), although I haven't tested this in depth.
First Bad Commit
No response
Relevant log output
No relevant outputs to report here. With a very low final_logit_softcapping value, generally speaking, model outputs should be completely incoherent.
Name and Version
llama-cli --version
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24124 MiB):
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
version: 8647 (b069b10)
built with GNU 15.1.0 for Linux x86_64
Operating systems
Linux
GGML backends
CUDA
Hardware
Intel i7 12th gen + RTX3090
Models
Gemma 4 31B IT
Problem description & steps to reproduce
From testing, it appears that Gemma 4's "Final Logit Softcapping" might not be taken into consideration during inference. This is probably resulting into exceedingly confident predictions, insensitivity to temperature.
How to observe how final logit softcapping works:
config.json, changefinal_logit_softcappingfrom the default value of 30.0 to a low value like 20.0 or 15.0How to observe that the same setting doesn't appear to have any effect in llama.cpp
--override-kv gemma4.final_logit_softcapping=float:15.0⇒ no change in outputsConclusion: llama.cpp doesn't properly implement Gemma's final logit softcapping.
This issue might affect Gemma 2 and 3 as well (which also use logit softcapping), although I haven't tested this in depth.
First Bad Commit
No response
Relevant log output
No relevant outputs to report here. With a very low
final_logit_softcappingvalue, generally speaking, model outputs should be completely incoherent.