Name and Version
version: 9525 (ad1b88ca0)
built with MSVC 19.44.35221.0 for Windows AMD64
Operating systems
Windows
GGML backends
CPU, CUDA
Hardware
AMD Ryzen 9 5900X + NVIDIA GeForce RTX 3090 (24 GB)
Models
Qwen3.6-35B-A3B-UD-IQ4_XS.gguf
Problem description & steps to reproduce
llama-server crashes with an access violation on chat completion request with Qwen3.6-35B-A3B. The model is partially offloaded to CPU.
llama-server -m Qwen3.6-35B-A3B-UD-IQ4_XS.gguf
Steps:
- Start the server with the command above.
- Send a chat request with a short prompt, e.g.
{"messages":[{"role":"user","content":"Hi"}]}, completes normally.
- Send a chat request with a longer prompt (a few hundred tokens), server crashes.
First Bad Commit
7acb4e8 - hparams : refactor hparams.n_layer (#24060)
The parent commit 3ecfb15 does not crash.
Relevant log output
Log: llama-server-lv4.log
The following is a stack trace from a custom crash handler I use on Windows.
Crash / stack trace
=== CRASH (unhandled exception) ===
Exception code: 0xC0000005
Exception address: 0x00007FFCD1A60BD3
Stack trace:
#0 0x00007ffcd1a60bd3 _NLG_Return2+0x5a3
#1 0x00007ffc506f34c6 ggml_vec_cpy_f32+0x56 (ggml/src/ggml-cpu/vec.h:119)
#2 0x00007ffc506e3f2b ggml_compute_forward_set_f32+0x2fb (ggml/src/ggml-cpu/ops.cpp:4599)
#3 0x00007ffc506399ad ggml_graph_compute_thread+0xdd (ggml/src/ggml-cpu/ggml-cpu.c:3062)
#4 0x00007ffcbe501801 vcomp_fork+0x2d1
#5 0x00007ffcbe5017c2 vcomp_fork+0x292
#6 0x00007ffcbe509041 vcomp_atomic_div_r8+0xb81
#7 0x00007ffcbe5016e1 vcomp_fork+0x1b1
#8 0x00007ffc5063974d ggml_graph_compute+0x19d (ggml/src/ggml-cpu/ggml-cpu.c:3333)
#9 0x00007ffc5063c23e ggml_backend_cpu_graph_compute+0xbe (ggml/src/ggml-cpu/ggml-cpu.cpp:191)
#10 0x00007ffc6dd96eeb ggml_backend_sched_compute_splits+0x58b (ggml/src/ggml-backend.cpp:1678)
#11 0x00007ffc2a7ece14 llama_context::graph_compute+0xa4 (src/llama-context.cpp:2334)
#12 0x00007ffc2a7f0ec6 llama_context::process_ubatch+0xf6 (src/llama-context.cpp:1317)
#13 0x00007ffc2a7ea47b llama_context::decode+0x68b (src/llama-context.cpp:1795)
#14 0x00007ffc2a7f4cfb llama_decode+0xb (src/llama-context.cpp:3933)
#15 0x00007ffbc9030d9a server_context_impl::update_slots+0x3d5a (tools/server/server-context.cpp:3186)
#16 0x00007ffbc90c94ed server_queue::start_loop+0x65d (tools/server/server-queue.cpp:166)
#17 0x00007ffbc8f09e0c llama_server+0x346c (tools/server/server.cpp:354)
#18 0x00007ff77cd32008 __scrt_common_main_seh+0x10c
#19 0x00007ffceb82e957 BaseThreadInitThunk+0x17
#20 0x00007ffceda8427c RtlUserThreadStart+0x2c
From minidump:
ExceptionCode: c0000005 (Access violation)
Attempt to write to address 000000205df9b0a0
Name and Version
Operating systems
Windows
GGML backends
CPU, CUDA
Hardware
AMD Ryzen 9 5900X + NVIDIA GeForce RTX 3090 (24 GB)
Models
Qwen3.6-35B-A3B-UD-IQ4_XS.gguf
Problem description & steps to reproduce
llama-servercrashes with an access violation on chat completion request with Qwen3.6-35B-A3B. The model is partially offloaded to CPU.Steps:
{"messages":[{"role":"user","content":"Hi"}]}, completes normally.First Bad Commit
7acb4e8 - hparams : refactor
hparams.n_layer(#24060)The parent commit 3ecfb15 does not crash.
Relevant log output
Log: llama-server-lv4.log
The following is a stack trace from a custom crash handler I use on Windows.
Crash / stack trace
From minidump: