Skip to content

Eval bug: Vulkan throws vk::DeviceLostError on Qwen3.5 35B A3B #20462

@itterative

Description

@itterative

Name and Version

$ llama-cli --version
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
version: 8322 (557fe2d)
built with GNU 13.3.0 for Linux x86_64

Operating systems

Linux

GGML backends

Vulkan

Hardware

Ryzen 7 5800X + 2x Radeon RX 7900 XTX

Models

Qwen3.5 35B A3B (quantized to q8_0 using llama-quantize)

Problem description & steps to reproduce

Consistently getting the error vk::DeviceLostError when running qwen3.5 with a large context.

Usually, I'd use rocm, but I've tried vulkan since the pr #20334 was merged and I noticed that the pp/tg tps are much better than on rocm.

So far, I only see this happening after checkpoints are being invalidated. Considering llama-bench doesn't crash, I suspect this could be related.

I've tested llama-bench a few times in a row with 32k context, but I can't replicate the issue using it. Also ran the same prompts a few times before and after the pr commit, and it only has issues after the pr.

Ran the master branch on a different system with a 9070 and it doesn't have issues there. So this looks to be an issue with multiple gpus.

First Bad Commit

40c550d

Relevant log output

llama-bench
$ llama-bench -m ~/Models/LLM/Qwen-Qwen3.5-35B-A3B-q8_0.gguf -ub 1024 -b 2048 -p 32768
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | Vulkan     |  99 |     1024 |         pp32768 |       2286.84 ± 1.97 |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | Vulkan     |  99 |     1024 |           tg128 |         95.51 ± 0.09 |

build: 557fe2d91 (8322)

$ llama-bench -m ~/Models/LLM/Qwen-Qwen3.5-35B-A3B-q8_0.gguf -ub 1024 -b 2048 -p 32768
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | Vulkan     |  99 |     1024 |         pp32768 |       2282.06 ± 2.97 |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | Vulkan     |  99 |     1024 |           tg128 |         95.55 ± 0.12 |

build: 557fe2d91 (8322)

$ llama-bench -m ~/Models/LLM/Qwen-Qwen3.5-35B-A3B-q8_0.gguf -ub 1024 -b 2048 -p 32768
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | Vulkan     |  99 |     1024 |         pp32768 |       2281.24 ± 2.37 |
| qwen35moe 35B.A3B Q8_0         |  34.36 GiB |    34.66 B | Vulkan     |  99 |     1024 |           tg128 |         95.52 ± 0.14 |

build: 557fe2d91 (8322)

# when running with ctk q8_0 and ctv q8_0 (same as llama-server), llama-bench doesn't run
$ llama-bench -m ~/Models/LLM/Qwen-Qwen3.5-35B-A3B-q8_0.gguf -ctk q8_0 -ctv q8_0 -ub 1024 -b 2048 -p 32768
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl | n_ubatch | type_k | type_v |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -----: | -----: | --------------: | -------------------: |
main: error: failed to create context with model '/home/sd/Models/LLM/Qwen-Qwen3.5-35B-A3B-q8_0.gguf'
Logs
llama-server[3416]: [50053] radv/amdgpu: The CS has been cancelled because the context is lost. This context is guilty of a hard recovery.
llama-server[3416]: [50053] [New LWP 3850]
llama-server[3416]: [50053] [New LWP 3849]
llama-server[3416]: [50053] [New LWP 3643]
llama-server[3416]: [50053] [New LWP 3642]
llama-server[3416]: [50053] [New LWP 3641]
llama-server[3416]: [50053] [New LWP 3640]
llama-server[3416]: [50053] [New LWP 3639]
llama-server[3416]: [50053] [New LWP 3638]
llama-server[3416]: [50053] [New LWP 3637]
llama-server[3416]: [50053] [New LWP 3636]
llama-server[3416]: [50053] [New LWP 3635]
llama-server[3416]: [50053] [New LWP 3634]
llama-server[3416]: [50053] [New LWP 3633]
llama-server[3416]: [50053] [New LWP 3632]
llama-server[3416]: [50053] [New LWP 3631]
llama-server[3416]: [50053] [New LWP 3630]
llama-server[3416]: [50053] [New LWP 3629]
llama-server[3416]: [50053] [New LWP 3628]
llama-server[3416]: [50053] [New LWP 3625]
llama-server[3416]: [50053] [New LWP 3624]
llama-server[3416]: [50053] [New LWP 3623]
llama-server[3416]: [50053] [New LWP 3622]
llama-server[3416]: [50053] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_gfxstream.so
llama-server[3416]: [50053] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_virtio.so
llama-server[3416]: [50053] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_asahi.so
llama-server[3416]: [50053] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_nouveau.so
llama-server[3416]: [50053] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_lvp.so
llama-server[3416]: [50053] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libtinfo.so.6
llama-server[3416]: [50053] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_intel.so
llama-server[3416]: [50053] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_radeon.so
llama-server[3416]: [50053] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_intel_hasvk.so
llama-server[3416]: [50053] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libVkLayer_MESA_device_select.so
llama-server[3416]: [50053] [Thread debugging using libthread_db enabled]
llama-server[3416]: [50053] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
llama-server[3416]: [50053] 0x00007d8068b10813 in __GI___wait4 (pid=4748, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
llama-server[3416]: [50053] warning: 30        ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
llama-server[3416]: [50053] #0  0x00007d8068b10813 in __GI___wait4 (pid=4748, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
llama-server[3416]: [50053] 30        in ../sysdeps/unix/sysv/linux/wait4.c
llama-server[3416]: [50053] #1  0x00007d80695977e3 in ggml_print_backtrace () from /home/sd/repos/llama.cpp/build/bin/libggml-base.so.0
llama-server[3416]: [50053] #2  0x00007d80695aa82f in ggml_uncaught_exception() () from /home/sd/repos/llama.cpp/build/bin/libggml-base.so.0
llama-server[3416]: [50053] #3  0x00007d8068ebb0da in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
llama-server[3416]: [50053] #4  0x00007d8068ea5a55 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
llama-server[3416]: [50053] #5  0x00007d8068ebb391 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
llama-server[3416]: [50053] #6  0x00007d806607c5f1 in ggml_vk_submit(std::shared_ptr<vk_context_struct>&, vk::Fence) [clone .cold] () from /home/sd/repos/llama.cpp/build/bin/libggml-vulkan.so.0
llama-server[3416]: [50053] #7  0x00007d806616e6db in ggml_vk_build_graph(ggml_backend_vk_context*, ggml_cgraph*, int, ggml_tensor*, int, bool, bool, bool) [clone .isra.0] () from /home/sd/repos/llama.cpp/build/bin/libggml-vulkan.so.0
llama-server[3416]: [50053] #8  0x00007d806616f38b in ggml_backend_vk_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/sd/repos/llama.cpp/build/bin/libggml-vulkan.so.0
llama-server[3416]: [50053] #9  0x00007d80695b449c in ggml_backend_sched_graph_compute_async () from /home/sd/repos/llama.cpp/build/bin/libggml-base.so.0
llama-server[3416]: [50053] #10 0x00007d80692c5f01 in llama_context::graph_compute(ggml_cgraph*, bool) () from /home/sd/repos/llama.cpp/build/bin/libllama.so.0
llama-server[3416]: [50053] #11 0x00007d80692c8014 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /home/sd/repos/llama.cpp/build/bin/libllama.so.0
llama-server[3416]: [50053] #12 0x00007d80692cf236 in llama_context::decode(llama_batch const&) () from /home/sd/repos/llama.cpp/build/bin/libllama.so.0
llama-server[3416]: [50053] #13 0x00007d80692d0ccf in llama_decode () from /home/sd/repos/llama.cpp/build/bin/libllama.so.0
llama-server[3416]: [50053] #14 0x0000629ca9c8f072 in server_context_impl::update_slots() ()
llama-server[3416]: [50053] #15 0x0000629ca9cdc85e in server_queue::start_loop(long) ()
llama-server[3416]: [50053] #16 0x0000629ca9be96b9 in main ()
llama-server[3416]: [50053] [Inferior 1 (process 3618) detached]
llama-server[3416]: [50053] terminate called after throwing an instance of 'vk::DeviceLostError'
llama-server[3416]: [50053]   what():  vk::Queue::submit: ErrorDeviceLost


(and a different one with a GPUVM fault)

llama-server[3607]: srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
llama-server[3607]: [49321] slot print_timing: id  0 | task 892 |
llama-server[3607]: [49321] prompt eval time =   16423.55 ms /  2940 tokens (    5.59 ms per token,   179.01 tokens per second)
llama-server[3607]: [49321]        eval time =     926.15 ms /    78 tokens (   11.87 ms per token,    84.22 tokens per second)
llama-server[3607]: [49321]       total time =   17349.70 ms /  3018 tokens
llama-server[3607]: [49321] slot      release: id  0 | task 892 | stop processing: n_tokens = 17960, truncated = 0
llama-server[3607]: [49321] srv  update_slots: all slots are idle
llama-server[3607]: srv  proxy_reques: proxying request to model qwen3-5-35b-a3b on port 49321
llama-server[3607]: [49321] srv  params_from_: Chat format: peg-native
llama-server[3607]: [49321] slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.997 (> 0.100 thold), f_keep = 1.000
llama-server[3607]: [49321] slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
llama-server[3607]: [49321] slot launch_slot_: id  0 | task 973 | processing task, is_child = 0
llama-server[3607]: [49321] slot update_slots: id  0 | task 973 | new prompt, n_ctx_slot = 96000, n_keep = 0, task.n_tokens = 18011
llama-server[3607]: [49321] slot update_slots: id  0 | task 973 | cache reuse is not supported - ignoring n_cache_reuse = 256
llama-server[3607]: [49321] slot update_slots: id  0 | task 973 | n_tokens = 17960, memory_seq_rm [17960, end)
llama-server[3607]: [49321] slot update_slots: id  0 | task 973 | prompt processing progress, n_tokens = 18007, batch.n_tokens = 47, progress = 0.999778
llama-server[3607]: [49321] slot update_slots: id  0 | task 973 | created context checkpoint 6 of 64 (pos_min = 17959, pos_max = 17959, n_tokens = 17960, size = 62.813 MiB)
llama-server[3607]: [49321] slot update_slots: id  0 | task 973 | n_tokens = 18007, memory_seq_rm [18007, end)
llama-server[3607]: [49321] slot init_sampler: id  0 | task 973 | init sampler, took 2.05 ms, tokens: text = 18011, total = 18011
llama-server[3607]: [49321] slot update_slots: id  0 | task 973 | prompt processing done, n_tokens = 18011, batch.n_tokens = 4
llama-server[3607]: [49321] radv/amdgpu: The CS has been cancelled because the context is lost. This context is guilty of a hard recovery.
llama-server[3607]: [49321] radv: GPUVM fault detected at address 0x80010007b000.
llama-server[3607]: [49321] GCVM_L2_PROTECTION_FAULT_STATUS: 0x400a10
llama-server[3607]: [49321]          CLIENT_ID: (CPC) 0x5
llama-server[3607]: [49321]          MORE_FAULTS: 0
llama-server[3607]: [49321]          WALKER_ERROR: 0
llama-server[3607]: [49321]          PERMISSION_FAULTS: 1
llama-server[3607]: [49321]          MAPPING_ERROR: 0
llama-server[3607]: [49321]          RW: 0
llama-server[3607]: [49321] [New LWP 4113]
llama-server[3607]: [49321] [New LWP 4112]
llama-server[3607]: [49321] [New LWP 3902]
llama-server[3607]: [49321] [New LWP 3901]
llama-server[3607]: [49321] [New LWP 3900]
llama-server[3607]: [49321] [New LWP 3899]
llama-server[3607]: [49321] [New LWP 3898]
llama-server[3607]: [49321] [New LWP 3897]
llama-server[3607]: [49321] [New LWP 3896]
llama-server[3607]: [49321] [New LWP 3895]
llama-server[3607]: [49321] [New LWP 3894]
llama-server[3607]: [49321] [New LWP 3893]
llama-server[3607]: [49321] [New LWP 3892]
llama-server[3607]: [49321] [New LWP 3891]
llama-server[3607]: [49321] [New LWP 3890]
llama-server[3607]: [49321] [New LWP 3889]
llama-server[3607]: [49321] [New LWP 3888]
llama-server[3607]: [49321] [New LWP 3887]
llama-server[3607]: [49321] [New LWP 3854]
llama-server[3607]: [49321] [New LWP 3853]
llama-server[3607]: [49321] [New LWP 3816]
llama-server[3607]: [49321] [New LWP 3815]
llama-server[3607]: [49321] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_gfxstream.so
llama-server[3607]: [49321] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_virtio.so
llama-server[3607]: [49321] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_asahi.so
llama-server[3607]: [49321] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_nouveau.so
llama-server[3607]: [49321] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_lvp.so
llama-server[3607]: [49321] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libtinfo.so.6
llama-server[3607]: [49321] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_intel.so
llama-server[3607]: [49321] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_radeon.so
llama-server[3607]: [49321] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_intel_hasvk.so
llama-server[3607]: [49321] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libVkLayer_MESA_device_select.so
llama-server[3607]: [49321] [Thread debugging using libthread_db enabled]
llama-server[3607]: [49321] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
llama-server[3607]: [49321] 0x00007485fbd10813 in __GI___wait4 (pid=4408, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
llama-server[3607]: [49321] warning: 30        ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
llama-server[3607]: [49321] #0  0x00007485fbd10813 in __GI___wait4 (pid=4408, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
llama-server[3607]: [49321] 30        in ../sysdeps/unix/sysv/linux/wait4.c
llama-server[3607]: [49321] #1  0x00007485fc7db7e3 in ggml_print_backtrace () from /home/sd/repos/llama.cpp/build/bin/libggml-base.so.0
llama-server[3607]: [49321] #2  0x00007485fc7ee82f in ggml_uncaught_exception() () from /home/sd/repos/llama.cpp/build/bin/libggml-base.so.0
llama-server[3607]: [49321] #3  0x00007485fc0bb0da in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
llama-server[3607]: [49321] #4  0x00007485fc0a5a55 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
llama-server[3607]: [49321] #5  0x00007485fc0bb391 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
llama-server[3607]: [49321] #6  0x00007485f927c5f1 in ggml_vk_submit(std::shared_ptr<vk_context_struct>&, vk::Fence) [clone .cold] () from /home/sd/repos/llama.cpp/build/bin/libggml-vulkan.so.0
llama-server[3607]: [49321] #7  0x00007485f936e6db in ggml_vk_build_graph(ggml_backend_vk_context*, ggml_cgraph*, int, ggml_tensor*, int, bool, bool, bool) [clone .isra.0] () from /home/sd/repos/llama.cpp/build/bin/libggml-vulkan.so.0
llama-server[3607]: [49321] #8  0x00007485f936f38b in ggml_backend_vk_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/sd/repos/llama.cpp/build/bin/libggml-vulkan.so.0
llama-server[3607]: [49321] #9  0x00007485fc7f849c in ggml_backend_sched_graph_compute_async () from /home/sd/repos/llama.cpp/build/bin/libggml-base.so.0
llama-server[3607]: [49321] #10 0x00007485fc4c5f01 in llama_context::graph_compute(ggml_cgraph*, bool) () from /home/sd/repos/llama.cpp/build/bin/libllama.so.0
llama-server[3607]: [49321] #11 0x00007485fc4c8014 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /home/sd/repos/llama.cpp/build/bin/libllama.so.0
llama-server[3607]: [49321] #12 0x00007485fc4cf236 in llama_context::decode(llama_batch const&) () from /home/sd/repos/llama.cpp/build/bin/libllama.so.0
llama-server[3607]: [49321] #13 0x00007485fc4d0ccf in llama_decode () from /home/sd/repos/llama.cpp/build/bin/libllama.so.0
llama-server[3607]: [49321] #14 0x00005b93a7e8e072 in server_context_impl::update_slots() ()
llama-server[3607]: [49321] #15 0x00005b93a7edb85e in server_queue::start_loop(long) ()
llama-server[3607]: [49321] #16 0x00005b93a7de86b9 in main ()
llama-server[3607]: [49321] [Inferior 1 (process 3808) detached]
llama-server[3607]: [49321] terminate called after throwing an instance of 'vk::DeviceLostError'
llama-server[3607]: [49321]   what():  vk::Queue::submit: ErrorDeviceLost

I've attached the full logs of a bigger run (should include settings I'm using)
vulkan-issue-full.txt

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions