Name and Version
version: 7836 (0c21677)
built with MSVC 19.50.35721.0 for Windows AMD64
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-server
Command line
llama-server --no-mmap --host ... --port ... --jinja --temp 0.7 --top-p 1.0 --min-p 0.01 -nkvo GLM-4.7-Flash-UD-Q8_K_XL.gguf
Problem description & steps to reproduce
Using the GLM-4.7-Flash model (GLM-4.7-Flash-UD-Q8_K_XL.gguf) with nkvo hits an assert in ggml\src\ggml-cuda\fattn.cu
sched_reserve: reserve took 43.27 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
C:\B\llama.cpp\ggml\src\ggml-cuda\fattn.cu:453: fatal error
First Bad Commit
Bug is recent (last day-ish), but I haven't bisected it.
Relevant log output
-fa is not specfied on the command line, so I assume it is auto. I'm not sure why its hitting the NONE case.
void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
ggml_cuda_set_device(ctx.device);
switch (ggml_cuda_get_best_fattn_kernel(ggml_cuda_get_device(), dst)) {
case BEST_FATTN_KERNEL_NONE:
GGML_ABORT("fatal error");
Name and Version
version: 7836 (0c21677)
built with MSVC 19.50.35721.0 for Windows AMD64
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-server
Command line
Problem description & steps to reproduce
Using the GLM-4.7-Flash model (GLM-4.7-Flash-UD-Q8_K_XL.gguf) with nkvo hits an assert in ggml\src\ggml-cuda\fattn.cu
sched_reserve: reserve took 43.27 ms, sched copies = 1common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)C:\B\llama.cpp\ggml\src\ggml-cuda\fattn.cu:453: fatal errorFirst Bad Commit
Bug is recent (last day-ish), but I haven't bisected it.
Relevant log output
-fa is not specfied on the command line, so I assume it is auto. I'm not sure why its hitting the NONE case.