Skip to content

Eval bug: llama-server crashes with Qwen3.6-35B-A3B #24223

@aldehir

Description

@aldehir

Name and Version

version: 9525 (ad1b88ca0)
built with MSVC 19.44.35221.0 for Windows AMD64

Operating systems

Windows

GGML backends

CPU, CUDA

Hardware

AMD Ryzen 9 5900X + NVIDIA GeForce RTX 3090 (24 GB)

Models

Qwen3.6-35B-A3B-UD-IQ4_XS.gguf

Problem description & steps to reproduce

llama-server crashes with an access violation on chat completion request with Qwen3.6-35B-A3B. The model is partially offloaded to CPU.

llama-server -m Qwen3.6-35B-A3B-UD-IQ4_XS.gguf

Steps:

  1. Start the server with the command above.
  2. Send a chat request with a short prompt, e.g. {"messages":[{"role":"user","content":"Hi"}]}, completes normally.
  3. Send a chat request with a longer prompt (a few hundred tokens), server crashes.

First Bad Commit

7acb4e8 - hparams : refactor hparams.n_layer (#24060)

The parent commit 3ecfb15 does not crash.

Relevant log output

Log: llama-server-lv4.log

The following is a stack trace from a custom crash handler I use on Windows.

Crash / stack trace
=== CRASH (unhandled exception) ===
Exception code:    0xC0000005
Exception address: 0x00007FFCD1A60BD3

Stack trace:
  #0   0x00007ffcd1a60bd3 _NLG_Return2+0x5a3
  #1   0x00007ffc506f34c6 ggml_vec_cpy_f32+0x56 (ggml/src/ggml-cpu/vec.h:119)
  #2   0x00007ffc506e3f2b ggml_compute_forward_set_f32+0x2fb (ggml/src/ggml-cpu/ops.cpp:4599)
  #3   0x00007ffc506399ad ggml_graph_compute_thread+0xdd (ggml/src/ggml-cpu/ggml-cpu.c:3062)
  #4   0x00007ffcbe501801 vcomp_fork+0x2d1
  #5   0x00007ffcbe5017c2 vcomp_fork+0x292
  #6   0x00007ffcbe509041 vcomp_atomic_div_r8+0xb81
  #7   0x00007ffcbe5016e1 vcomp_fork+0x1b1
  #8   0x00007ffc5063974d ggml_graph_compute+0x19d (ggml/src/ggml-cpu/ggml-cpu.c:3333)
  #9   0x00007ffc5063c23e ggml_backend_cpu_graph_compute+0xbe (ggml/src/ggml-cpu/ggml-cpu.cpp:191)
  #10  0x00007ffc6dd96eeb ggml_backend_sched_compute_splits+0x58b (ggml/src/ggml-backend.cpp:1678)
  #11  0x00007ffc2a7ece14 llama_context::graph_compute+0xa4 (src/llama-context.cpp:2334)
  #12  0x00007ffc2a7f0ec6 llama_context::process_ubatch+0xf6 (src/llama-context.cpp:1317)
  #13  0x00007ffc2a7ea47b llama_context::decode+0x68b (src/llama-context.cpp:1795)
  #14  0x00007ffc2a7f4cfb llama_decode+0xb (src/llama-context.cpp:3933)
  #15  0x00007ffbc9030d9a server_context_impl::update_slots+0x3d5a (tools/server/server-context.cpp:3186)
  #16  0x00007ffbc90c94ed server_queue::start_loop+0x65d (tools/server/server-queue.cpp:166)
  #17  0x00007ffbc8f09e0c llama_server+0x346c (tools/server/server.cpp:354)
  #18  0x00007ff77cd32008 __scrt_common_main_seh+0x10c
  #19  0x00007ffceb82e957 BaseThreadInitThunk+0x17
  #20  0x00007ffceda8427c RtlUserThreadStart+0x2c

From minidump:

ExceptionCode: c0000005 (Access violation)
Attempt to write to address 000000205df9b0a0

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions