Name and Version
$./llama-server --version
version: 7855 (2eee6c8)
built with GNU 12.4.0 for Linux x86_64
Build flags: -DGGML_NATIVE=ON -DGGML_LTO=ON -DGGML_CUDA=ON -DGGML_CUDA_F16=ON -DGGML_CUDA_FA_ALL_QUANTS=ON
Operating systems
Linux
GGML backends
CPU, CUDA
Hardware
Ryzen 5950x + 4060ti (16GB)
Models
GLM 4.6 (https://huggingface.co/unsloth/GLM-4.6-GGUF/tree/main/UD-Q5_K_XL)
Problem description & steps to reproduce
I noticed that after the merge of #19105 , the GLM 4.6 prefill is now computed on the CPU when -nkvo is specified with these flags:
llama-server --parallel 1 --no-direct-io --mmap --port "${PORT}" -fa on -b 4096 -ub 4096 -m "..../GLM-4.6-UD-Q5_K_XL-00001-of-00006.gguf" --threads 16 --temp 0.0 --cache-type-k f16 --cache-type-v q8_0 -c 96000 -ngl 99 --no-kv-offload --n-cpu-moe 99
While this offloading correctly reduces VRAM usage (saving ~750MB VRAM on my ~32GB KV setup), it causes a significant performance regression in prefill speed compared to the previous behavior.
I am looking for a way to keep the performance-critical computation on the GPU (restoring the previous speed) while still keeping the bulk of the KV cache on the CPU. Is there a way to achieve this with existing flags, or could the offloading logic be adjusted to allow this?
First Bad Commit
8f80d1b
Relevant log output
Name and Version
$./llama-server --version
version: 7855 (2eee6c8)
built with GNU 12.4.0 for Linux x86_64
Build flags:
-DGGML_NATIVE=ON -DGGML_LTO=ON -DGGML_CUDA=ON -DGGML_CUDA_F16=ON -DGGML_CUDA_FA_ALL_QUANTS=ONOperating systems
Linux
GGML backends
CPU, CUDA
Hardware
Ryzen 5950x + 4060ti (16GB)
Models
GLM 4.6 (https://huggingface.co/unsloth/GLM-4.6-GGUF/tree/main/UD-Q5_K_XL)
Problem description & steps to reproduce
I noticed that after the merge of #19105 , the GLM 4.6 prefill is now computed on the CPU when -nkvo is specified with these flags:
llama-server --parallel 1 --no-direct-io --mmap --port "${PORT}" -fa on -b 4096 -ub 4096 -m "..../GLM-4.6-UD-Q5_K_XL-00001-of-00006.gguf" --threads 16 --temp 0.0 --cache-type-k f16 --cache-type-v q8_0 -c 96000 -ngl 99 --no-kv-offload --n-cpu-moe 99While this offloading correctly reduces VRAM usage (saving ~750MB VRAM on my ~32GB KV setup), it causes a significant performance regression in prefill speed compared to the previous behavior.
I am looking for a way to keep the performance-critical computation on the GPU (restoring the previous speed) while still keeping the bulk of the KV cache on the CPU. Is there a way to achieve this with existing flags, or could the offloading logic be adjusted to allow this?
First Bad Commit
8f80d1b
Relevant log output