Eval bug: Performance regression: GLM 4.6 prefill with -nkvo

### Name and Version

$./llama-server --version
version: 7855 (2eee6c866)
built with GNU 12.4.0 for Linux x86_64

Build flags: `-DGGML_NATIVE=ON -DGGML_LTO=ON -DGGML_CUDA=ON -DGGML_CUDA_F16=ON -DGGML_CUDA_FA_ALL_QUANTS=ON`

### Operating systems

Linux

### GGML backends

CPU, CUDA

### Hardware

Ryzen 5950x + 4060ti (16GB)

### Models

GLM 4.6 (https://huggingface.co/unsloth/GLM-4.6-GGUF/tree/main/UD-Q5_K_XL)

### Problem description & steps to reproduce

I noticed that after the merge of #19105 , the GLM 4.6 prefill is now computed on the CPU when -nkvo is specified with these flags:

`llama-server --parallel 1 --no-direct-io --mmap --port "${PORT}" -fa on -b 4096 -ub 4096 -m "..../GLM-4.6-UD-Q5_K_XL-00001-of-00006.gguf" --threads 16 --temp 0.0 --cache-type-k f16 --cache-type-v q8_0 -c 96000 -ngl 99 --no-kv-offload --n-cpu-moe 99`

While this offloading correctly reduces VRAM usage (saving ~750MB VRAM on my ~32GB KV setup), it causes a significant performance regression in prefill speed compared to the previous behavior.

I am looking for a way to keep the performance-critical computation on the GPU (restoring the previous speed) while still keeping the bulk of the KV cache on the CPU. Is there a way to achieve this with existing flags, or could the offloading logic be adjusted to allow this?

### First Bad Commit

8f80d1b

### Relevant log output

`
`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval bug: Performance regression: GLM 4.6 prefill with -nkvo #19158

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Eval bug: Performance regression: GLM 4.6 prefill with -nkvo #19158

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions