Skip to content

Eval bug: Performance regression: GLM 4.6 prefill with -nkvo #19158

@LifesLight

Description

@LifesLight

Name and Version

$./llama-server --version
version: 7855 (2eee6c8)
built with GNU 12.4.0 for Linux x86_64

Build flags: -DGGML_NATIVE=ON -DGGML_LTO=ON -DGGML_CUDA=ON -DGGML_CUDA_F16=ON -DGGML_CUDA_FA_ALL_QUANTS=ON

Operating systems

Linux

GGML backends

CPU, CUDA

Hardware

Ryzen 5950x + 4060ti (16GB)

Models

GLM 4.6 (https://huggingface.co/unsloth/GLM-4.6-GGUF/tree/main/UD-Q5_K_XL)

Problem description & steps to reproduce

I noticed that after the merge of #19105 , the GLM 4.6 prefill is now computed on the CPU when -nkvo is specified with these flags:

llama-server --parallel 1 --no-direct-io --mmap --port "${PORT}" -fa on -b 4096 -ub 4096 -m "..../GLM-4.6-UD-Q5_K_XL-00001-of-00006.gguf" --threads 16 --temp 0.0 --cache-type-k f16 --cache-type-v q8_0 -c 96000 -ngl 99 --no-kv-offload --n-cpu-moe 99

While this offloading correctly reduces VRAM usage (saving ~750MB VRAM on my ~32GB KV setup), it causes a significant performance regression in prefill speed compared to the previous behavior.

I am looking for a way to keep the performance-critical computation on the GPU (restoring the previous speed) while still keeping the bulk of the KV cache on the CPU. Is there a way to achieve this with existing flags, or could the offloading logic be adjusted to allow this?

First Bad Commit

8f80d1b

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions