-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Deadlock on AMD/Mesa/vk #4686
Description
Description
I wrote a library that runs various unit tests that perform wgpu operations, and those tests sometimes end up in what looks like a deadlock in wgpu.
Repro steps
cargo t -p zaru-image on this commit can be used to reproduce https://github.com/SludgePhD/Zaru/commit/ac29836b0528a2e50c63c2a7ff68eb09b33a6cf3
Extra materials
I've tried to use the parking_lot deadlock detection feature, but it turns out that that does not support RW locks.
GDB output below.
Thread state when the deadlock happens:
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7ffff7c8ccc0 (LWP 11329) "zaru_image-5a5c" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
2 Thread 0x7ffff7c8b6c0 (LWP 11331) "blend::tests::b" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
3 Thread 0x7ffff7a8a6c0 (LWP 11332) "draw::tests::te" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
4 Thread 0x7ffff78896c0 (LWP 11333) "draw::tests::te" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
6 Thread 0x7ffff73c66c0 (LWP 11335) "image::tests::c" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
7 Thread 0x7ffff71c56c0 (LWP 11336) "image::tests::d" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
22 Thread 0x7fffddfff6c0 (LWP 11351) "shader::compute" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
25 Thread 0x7fffdd9fc6c0 (LWP 11354) "shader::compute" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
27 Thread 0x7ffff6fc46c0 (LWP 11356) "view::tests::vi" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
28 Thread 0x7fffcd5ff6c0 (LWP 11357) "zaru_im:disk$0" 0x00007ffff7d174ae in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7ffff009f150) at futex-internal.c:57
29 Thread 0x7fff8f3fd6c0 (LWP 11358) "blend::tests::b" 0x00007ffff7d174ae in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7ffff00a091c) at futex-internal.c:57
The stacks on most of these threads looks like this (though sometimes with a read lock instead of a write lock, and often for a variety of different resources instead of command encoder creation):
Thread 2 (Thread 0x7ffff7c8b6c0 (LWP 11331) "blend::tests::b"):
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1 0x0000555555c12eb4 in parking_lot::raw_rwlock::RawRwLock::lock_exclusive_slow::h7957d3e95355ce44 ()
#2 0x0000555555a41116 in wgpu_core::registry::FutureId<I,T>::assign::h2cbde308a5113f46 ()
#3 0x00005555559293a2 in wgpu_core::device::global::<impl wgpu_core::global::Global<G>>::device_create_command_encoder::hd4fbe81984d0da62 ()
#4 0x00005555559d674f in <wgpu::backend::direct::Context as wgpu::context::Context>::device_create_command_encoder::ha410a29667457a46 ()
#5 0x00005555559df130 in <T as wgpu::context::DynContext>::device_create_command_encoder::hd39f52a0286d846f ()
#6 0x0000555555a63273 in wgpu::Device::create_command_encoder::hb9a94e62fccd0e4e ()
The only threads that look significantly different are the one running the test harness and the following two Mesa/Vulkan-related threads:
Thread 29 (Thread 0x7fff8f3fd6c0 (LWP 11358) "blend::tests::b"):
#0 0x00007ffff7d174ae in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7ffff00a091c) at futex-internal.c:57
#1 __futex_abstimed_wait_common (futex_word=futex_word@entry=0x7ffff00a091c, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0, cancel=cancel@entry=true) at futex-internal.c:87
#2 0x00007ffff7d1752f in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7ffff00a091c, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at futex-internal.c:139
#3 0x00007ffff7d19d40 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7ffff00a08c8, cond=0x7ffff00a08f0) at pthread_cond_wait.c:503
#4 ___pthread_cond_wait (cond=0x7ffff00a08f0, mutex=0x7ffff00a08c8) at pthread_cond_wait.c:618
#5 0x00007ffff5ed9e11 in __gthread_cond_wait (__mutex=<optimized out>, __cond=0x7ffff00a08f0) at /usr/src/debug/gcc/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865
#6 std::__condvar::wait (__m=..., this=0x7ffff00a08f0) at /usr/src/debug/gcc/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/std_mutex.h:171
#7 std::condition_variable::wait (this=this@entry=0x7ffff00a08f0, __lock=...) at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/condition_variable.cc:41
#8 0x00007fffcee28575 in QUEUE_STATE::NextSubmission (this=this@entry=0x7ffff00a0790) at /usr/src/debug/vulkan-validation-layers/Vulkan-ValidationLayers-vulkan-sdk-1.3.268.0/layers/state_tracker/queue_state.cpp:164
#9 0x00007fffcee2a0b8 in QUEUE_STATE::ThreadFunc (this=0x7ffff00a0790) at /usr/src/debug/vulkan-validation-layers/Vulkan-ValidationLayers-vulkan-sdk-1.3.268.0/layers/state_tracker/queue_state.cpp:200
#10 0x00007ffff5ee1943 in std::execute_native_thread_routine (__p=0x7ffff10602b0) at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/thread.cc:104
#11 0x00007ffff7d1a9eb in start_thread (arg=<optimized out>) at pthread_create.c:444
#12 0x00007ffff7d9e7cc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
Thread 28 (Thread 0x7fffcd5ff6c0 (LWP 11357) "zaru_im:disk$0"):
#0 0x00007ffff7d174ae in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7ffff009f150) at futex-internal.c:57
#1 __futex_abstimed_wait_common (futex_word=futex_word@entry=0x7ffff009f150, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0, cancel=cancel@entry=true) at futex-internal.c:87
#2 0x00007ffff7d1752f in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7ffff009f150, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at futex-internal.c:139
#3 0x00007ffff7d19d40 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7ffff009f100, cond=0x7ffff009f128) at pthread_cond_wait.c:503
#4 ___pthread_cond_wait (cond=0x7ffff009f128, mutex=0x7ffff009f100) at pthread_cond_wait.c:618
#5 0x00007fffdd0162fc in cnd_wait () at ../mesa-23.2.1/src/c11/impl/threads_posix.c:135
#6 util_queue_thread_func () at ../mesa-23.2.1/src/util/u_queue.c:290
#7 0x00007fffdd03861c in impl_thrd_routine () at ../mesa-23.2.1/src/c11/impl/threads_posix.c:67
#8 0x00007ffff7d1a9eb in start_thread (arg=<optimized out>) at pthread_create.c:444
#9 0x00007ffff7d9e7cc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
Platform
Arch Linux, wgpu 0.18, Mesa 23.2.1-arch1.2, Radeon RX 6700 XT