-
Notifications
You must be signed in to change notification settings - Fork 15.5k
Description
Name and Version
$ llama-cli --version
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
version: 8322 (557fe2d)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Linux
GGML backends
Vulkan
Hardware
Ryzen 7 5800X + 2x Radeon RX 7900 XTX
Models
Qwen3.5 35B A3B (quantized to q8_0 using llama-quantize)
Problem description & steps to reproduce
Consistently getting the error vk::DeviceLostError when running qwen3.5 with a large context.
Usually, I'd use rocm, but I've tried vulkan since the pr #20334 was merged and I noticed that the pp/tg tps are much better than on rocm.
So far, I only see this happening after checkpoints are being invalidated. Considering llama-bench doesn't crash, I suspect this could be related.
I've tested llama-bench a few times in a row with 32k context, but I can't replicate the issue using it. Also ran the same prompts a few times before and after the pr commit, and it only has issues after the pr.
Ran the master branch on a different system with a 9070 and it doesn't have issues there. So this looks to be an issue with multiple gpus.
First Bad Commit
Relevant log output
llama-bench
$ llama-bench -m ~/Models/LLM/Qwen-Qwen3.5-35B-A3B-q8_0.gguf -ub 1024 -b 2048 -p 32768
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
| model | size | params | backend | ngl | n_ubatch | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q8_0 | 34.36 GiB | 34.66 B | Vulkan | 99 | 1024 | pp32768 | 2286.84 ± 1.97 |
| qwen35moe 35B.A3B Q8_0 | 34.36 GiB | 34.66 B | Vulkan | 99 | 1024 | tg128 | 95.51 ± 0.09 |
build: 557fe2d91 (8322)
$ llama-bench -m ~/Models/LLM/Qwen-Qwen3.5-35B-A3B-q8_0.gguf -ub 1024 -b 2048 -p 32768
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
| model | size | params | backend | ngl | n_ubatch | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q8_0 | 34.36 GiB | 34.66 B | Vulkan | 99 | 1024 | pp32768 | 2282.06 ± 2.97 |
| qwen35moe 35B.A3B Q8_0 | 34.36 GiB | 34.66 B | Vulkan | 99 | 1024 | tg128 | 95.55 ± 0.12 |
build: 557fe2d91 (8322)
$ llama-bench -m ~/Models/LLM/Qwen-Qwen3.5-35B-A3B-q8_0.gguf -ub 1024 -b 2048 -p 32768
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
| model | size | params | backend | ngl | n_ubatch | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | --------------: | -------------------: |
| qwen35moe 35B.A3B Q8_0 | 34.36 GiB | 34.66 B | Vulkan | 99 | 1024 | pp32768 | 2281.24 ± 2.37 |
| qwen35moe 35B.A3B Q8_0 | 34.36 GiB | 34.66 B | Vulkan | 99 | 1024 | tg128 | 95.52 ± 0.14 |
build: 557fe2d91 (8322)
# when running with ctk q8_0 and ctv q8_0 (same as llama-server), llama-bench doesn't run
$ llama-bench -m ~/Models/LLM/Qwen-Qwen3.5-35B-A3B-q8_0.gguf -ctk q8_0 -ctv q8_0 -ub 1024 -b 2048 -p 32768
ggml_vulkan: Found 2 Vulkan devices:
ggml_vulkan: 0 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
ggml_vulkan: 1 = Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: KHR_coopmat
| model | size | params | backend | ngl | n_ubatch | type_k | type_v | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -----: | -----: | --------------: | -------------------: |
main: error: failed to create context with model '/home/sd/Models/LLM/Qwen-Qwen3.5-35B-A3B-q8_0.gguf'Logs
llama-server[3416]: [50053] radv/amdgpu: The CS has been cancelled because the context is lost. This context is guilty of a hard recovery.
llama-server[3416]: [50053] [New LWP 3850]
llama-server[3416]: [50053] [New LWP 3849]
llama-server[3416]: [50053] [New LWP 3643]
llama-server[3416]: [50053] [New LWP 3642]
llama-server[3416]: [50053] [New LWP 3641]
llama-server[3416]: [50053] [New LWP 3640]
llama-server[3416]: [50053] [New LWP 3639]
llama-server[3416]: [50053] [New LWP 3638]
llama-server[3416]: [50053] [New LWP 3637]
llama-server[3416]: [50053] [New LWP 3636]
llama-server[3416]: [50053] [New LWP 3635]
llama-server[3416]: [50053] [New LWP 3634]
llama-server[3416]: [50053] [New LWP 3633]
llama-server[3416]: [50053] [New LWP 3632]
llama-server[3416]: [50053] [New LWP 3631]
llama-server[3416]: [50053] [New LWP 3630]
llama-server[3416]: [50053] [New LWP 3629]
llama-server[3416]: [50053] [New LWP 3628]
llama-server[3416]: [50053] [New LWP 3625]
llama-server[3416]: [50053] [New LWP 3624]
llama-server[3416]: [50053] [New LWP 3623]
llama-server[3416]: [50053] [New LWP 3622]
llama-server[3416]: [50053] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_gfxstream.so
llama-server[3416]: [50053] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_virtio.so
llama-server[3416]: [50053] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_asahi.so
llama-server[3416]: [50053] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_nouveau.so
llama-server[3416]: [50053] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_lvp.so
llama-server[3416]: [50053] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libtinfo.so.6
llama-server[3416]: [50053] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_intel.so
llama-server[3416]: [50053] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_radeon.so
llama-server[3416]: [50053] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_intel_hasvk.so
llama-server[3416]: [50053] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libVkLayer_MESA_device_select.so
llama-server[3416]: [50053] [Thread debugging using libthread_db enabled]
llama-server[3416]: [50053] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
llama-server[3416]: [50053] 0x00007d8068b10813 in __GI___wait4 (pid=4748, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
llama-server[3416]: [50053] warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
llama-server[3416]: [50053] #0 0x00007d8068b10813 in __GI___wait4 (pid=4748, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
llama-server[3416]: [50053] 30 in ../sysdeps/unix/sysv/linux/wait4.c
llama-server[3416]: [50053] #1 0x00007d80695977e3 in ggml_print_backtrace () from /home/sd/repos/llama.cpp/build/bin/libggml-base.so.0
llama-server[3416]: [50053] #2 0x00007d80695aa82f in ggml_uncaught_exception() () from /home/sd/repos/llama.cpp/build/bin/libggml-base.so.0
llama-server[3416]: [50053] #3 0x00007d8068ebb0da in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
llama-server[3416]: [50053] #4 0x00007d8068ea5a55 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
llama-server[3416]: [50053] #5 0x00007d8068ebb391 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
llama-server[3416]: [50053] #6 0x00007d806607c5f1 in ggml_vk_submit(std::shared_ptr<vk_context_struct>&, vk::Fence) [clone .cold] () from /home/sd/repos/llama.cpp/build/bin/libggml-vulkan.so.0
llama-server[3416]: [50053] #7 0x00007d806616e6db in ggml_vk_build_graph(ggml_backend_vk_context*, ggml_cgraph*, int, ggml_tensor*, int, bool, bool, bool) [clone .isra.0] () from /home/sd/repos/llama.cpp/build/bin/libggml-vulkan.so.0
llama-server[3416]: [50053] #8 0x00007d806616f38b in ggml_backend_vk_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/sd/repos/llama.cpp/build/bin/libggml-vulkan.so.0
llama-server[3416]: [50053] #9 0x00007d80695b449c in ggml_backend_sched_graph_compute_async () from /home/sd/repos/llama.cpp/build/bin/libggml-base.so.0
llama-server[3416]: [50053] #10 0x00007d80692c5f01 in llama_context::graph_compute(ggml_cgraph*, bool) () from /home/sd/repos/llama.cpp/build/bin/libllama.so.0
llama-server[3416]: [50053] #11 0x00007d80692c8014 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /home/sd/repos/llama.cpp/build/bin/libllama.so.0
llama-server[3416]: [50053] #12 0x00007d80692cf236 in llama_context::decode(llama_batch const&) () from /home/sd/repos/llama.cpp/build/bin/libllama.so.0
llama-server[3416]: [50053] #13 0x00007d80692d0ccf in llama_decode () from /home/sd/repos/llama.cpp/build/bin/libllama.so.0
llama-server[3416]: [50053] #14 0x0000629ca9c8f072 in server_context_impl::update_slots() ()
llama-server[3416]: [50053] #15 0x0000629ca9cdc85e in server_queue::start_loop(long) ()
llama-server[3416]: [50053] #16 0x0000629ca9be96b9 in main ()
llama-server[3416]: [50053] [Inferior 1 (process 3618) detached]
llama-server[3416]: [50053] terminate called after throwing an instance of 'vk::DeviceLostError'
llama-server[3416]: [50053] what(): vk::Queue::submit: ErrorDeviceLost
(and a different one with a GPUVM fault)
llama-server[3607]: srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
llama-server[3607]: [49321] slot print_timing: id 0 | task 892 |
llama-server[3607]: [49321] prompt eval time = 16423.55 ms / 2940 tokens ( 5.59 ms per token, 179.01 tokens per second)
llama-server[3607]: [49321] eval time = 926.15 ms / 78 tokens ( 11.87 ms per token, 84.22 tokens per second)
llama-server[3607]: [49321] total time = 17349.70 ms / 3018 tokens
llama-server[3607]: [49321] slot release: id 0 | task 892 | stop processing: n_tokens = 17960, truncated = 0
llama-server[3607]: [49321] srv update_slots: all slots are idle
llama-server[3607]: srv proxy_reques: proxying request to model qwen3-5-35b-a3b on port 49321
llama-server[3607]: [49321] srv params_from_: Chat format: peg-native
llama-server[3607]: [49321] slot get_availabl: id 0 | task -1 | selected slot by LCP similarity, sim_best = 0.997 (> 0.100 thold), f_keep = 1.000
llama-server[3607]: [49321] slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
llama-server[3607]: [49321] slot launch_slot_: id 0 | task 973 | processing task, is_child = 0
llama-server[3607]: [49321] slot update_slots: id 0 | task 973 | new prompt, n_ctx_slot = 96000, n_keep = 0, task.n_tokens = 18011
llama-server[3607]: [49321] slot update_slots: id 0 | task 973 | cache reuse is not supported - ignoring n_cache_reuse = 256
llama-server[3607]: [49321] slot update_slots: id 0 | task 973 | n_tokens = 17960, memory_seq_rm [17960, end)
llama-server[3607]: [49321] slot update_slots: id 0 | task 973 | prompt processing progress, n_tokens = 18007, batch.n_tokens = 47, progress = 0.999778
llama-server[3607]: [49321] slot update_slots: id 0 | task 973 | created context checkpoint 6 of 64 (pos_min = 17959, pos_max = 17959, n_tokens = 17960, size = 62.813 MiB)
llama-server[3607]: [49321] slot update_slots: id 0 | task 973 | n_tokens = 18007, memory_seq_rm [18007, end)
llama-server[3607]: [49321] slot init_sampler: id 0 | task 973 | init sampler, took 2.05 ms, tokens: text = 18011, total = 18011
llama-server[3607]: [49321] slot update_slots: id 0 | task 973 | prompt processing done, n_tokens = 18011, batch.n_tokens = 4
llama-server[3607]: [49321] radv/amdgpu: The CS has been cancelled because the context is lost. This context is guilty of a hard recovery.
llama-server[3607]: [49321] radv: GPUVM fault detected at address 0x80010007b000.
llama-server[3607]: [49321] GCVM_L2_PROTECTION_FAULT_STATUS: 0x400a10
llama-server[3607]: [49321] CLIENT_ID: (CPC) 0x5
llama-server[3607]: [49321] MORE_FAULTS: 0
llama-server[3607]: [49321] WALKER_ERROR: 0
llama-server[3607]: [49321] PERMISSION_FAULTS: 1
llama-server[3607]: [49321] MAPPING_ERROR: 0
llama-server[3607]: [49321] RW: 0
llama-server[3607]: [49321] [New LWP 4113]
llama-server[3607]: [49321] [New LWP 4112]
llama-server[3607]: [49321] [New LWP 3902]
llama-server[3607]: [49321] [New LWP 3901]
llama-server[3607]: [49321] [New LWP 3900]
llama-server[3607]: [49321] [New LWP 3899]
llama-server[3607]: [49321] [New LWP 3898]
llama-server[3607]: [49321] [New LWP 3897]
llama-server[3607]: [49321] [New LWP 3896]
llama-server[3607]: [49321] [New LWP 3895]
llama-server[3607]: [49321] [New LWP 3894]
llama-server[3607]: [49321] [New LWP 3893]
llama-server[3607]: [49321] [New LWP 3892]
llama-server[3607]: [49321] [New LWP 3891]
llama-server[3607]: [49321] [New LWP 3890]
llama-server[3607]: [49321] [New LWP 3889]
llama-server[3607]: [49321] [New LWP 3888]
llama-server[3607]: [49321] [New LWP 3887]
llama-server[3607]: [49321] [New LWP 3854]
llama-server[3607]: [49321] [New LWP 3853]
llama-server[3607]: [49321] [New LWP 3816]
llama-server[3607]: [49321] [New LWP 3815]
llama-server[3607]: [49321] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_gfxstream.so
llama-server[3607]: [49321] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_virtio.so
llama-server[3607]: [49321] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_asahi.so
llama-server[3607]: [49321] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_nouveau.so
llama-server[3607]: [49321] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_lvp.so
llama-server[3607]: [49321] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libtinfo.so.6
llama-server[3607]: [49321] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_intel.so
llama-server[3607]: [49321] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_radeon.so
llama-server[3607]: [49321] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libvulkan_intel_hasvk.so
llama-server[3607]: [49321] warning: could not find '.gnu_debugaltlink' file for /lib/x86_64-linux-gnu/libVkLayer_MESA_device_select.so
llama-server[3607]: [49321] [Thread debugging using libthread_db enabled]
llama-server[3607]: [49321] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
llama-server[3607]: [49321] 0x00007485fbd10813 in __GI___wait4 (pid=4408, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
llama-server[3607]: [49321] warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
llama-server[3607]: [49321] #0 0x00007485fbd10813 in __GI___wait4 (pid=4408, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
llama-server[3607]: [49321] 30 in ../sysdeps/unix/sysv/linux/wait4.c
llama-server[3607]: [49321] #1 0x00007485fc7db7e3 in ggml_print_backtrace () from /home/sd/repos/llama.cpp/build/bin/libggml-base.so.0
llama-server[3607]: [49321] #2 0x00007485fc7ee82f in ggml_uncaught_exception() () from /home/sd/repos/llama.cpp/build/bin/libggml-base.so.0
llama-server[3607]: [49321] #3 0x00007485fc0bb0da in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
llama-server[3607]: [49321] #4 0x00007485fc0a5a55 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
llama-server[3607]: [49321] #5 0x00007485fc0bb391 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
llama-server[3607]: [49321] #6 0x00007485f927c5f1 in ggml_vk_submit(std::shared_ptr<vk_context_struct>&, vk::Fence) [clone .cold] () from /home/sd/repos/llama.cpp/build/bin/libggml-vulkan.so.0
llama-server[3607]: [49321] #7 0x00007485f936e6db in ggml_vk_build_graph(ggml_backend_vk_context*, ggml_cgraph*, int, ggml_tensor*, int, bool, bool, bool) [clone .isra.0] () from /home/sd/repos/llama.cpp/build/bin/libggml-vulkan.so.0
llama-server[3607]: [49321] #8 0x00007485f936f38b in ggml_backend_vk_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/sd/repos/llama.cpp/build/bin/libggml-vulkan.so.0
llama-server[3607]: [49321] #9 0x00007485fc7f849c in ggml_backend_sched_graph_compute_async () from /home/sd/repos/llama.cpp/build/bin/libggml-base.so.0
llama-server[3607]: [49321] #10 0x00007485fc4c5f01 in llama_context::graph_compute(ggml_cgraph*, bool) () from /home/sd/repos/llama.cpp/build/bin/libllama.so.0
llama-server[3607]: [49321] #11 0x00007485fc4c8014 in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /home/sd/repos/llama.cpp/build/bin/libllama.so.0
llama-server[3607]: [49321] #12 0x00007485fc4cf236 in llama_context::decode(llama_batch const&) () from /home/sd/repos/llama.cpp/build/bin/libllama.so.0
llama-server[3607]: [49321] #13 0x00007485fc4d0ccf in llama_decode () from /home/sd/repos/llama.cpp/build/bin/libllama.so.0
llama-server[3607]: [49321] #14 0x00005b93a7e8e072 in server_context_impl::update_slots() ()
llama-server[3607]: [49321] #15 0x00005b93a7edb85e in server_queue::start_loop(long) ()
llama-server[3607]: [49321] #16 0x00005b93a7de86b9 in main ()
llama-server[3607]: [49321] [Inferior 1 (process 3808) detached]
llama-server[3607]: [49321] terminate called after throwing an instance of 'vk::DeviceLostError'
llama-server[3607]: [49321] what(): vk::Queue::submit: ErrorDeviceLost
I've attached the full logs of a bigger run (should include settings I'm using)
vulkan-issue-full.txt