Skip to content

Deadlock on AMD/Mesa/vk #4686

@SludgePhD

Description

@SludgePhD

Description
I wrote a library that runs various unit tests that perform wgpu operations, and those tests sometimes end up in what looks like a deadlock in wgpu.

Repro steps
cargo t -p zaru-image on this commit can be used to reproduce https://github.com/SludgePhD/Zaru/commit/ac29836b0528a2e50c63c2a7ff68eb09b33a6cf3

Extra materials
I've tried to use the parking_lot deadlock detection feature, but it turns out that that does not support RW locks.

GDB output below.

Thread state when the deadlock happens:

(gdb) info threads
  Id   Target Id                                           Frame 
* 1    Thread 0x7ffff7c8ccc0 (LWP 11329) "zaru_image-5a5c" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  2    Thread 0x7ffff7c8b6c0 (LWP 11331) "blend::tests::b" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  3    Thread 0x7ffff7a8a6c0 (LWP 11332) "draw::tests::te" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  4    Thread 0x7ffff78896c0 (LWP 11333) "draw::tests::te" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  6    Thread 0x7ffff73c66c0 (LWP 11335) "image::tests::c" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  7    Thread 0x7ffff71c56c0 (LWP 11336) "image::tests::d" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  22   Thread 0x7fffddfff6c0 (LWP 11351) "shader::compute" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  25   Thread 0x7fffdd9fc6c0 (LWP 11354) "shader::compute" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  27   Thread 0x7ffff6fc46c0 (LWP 11356) "view::tests::vi" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
  28   Thread 0x7fffcd5ff6c0 (LWP 11357) "zaru_im:disk$0"  0x00007ffff7d174ae in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7ffff009f150) at futex-internal.c:57
  29   Thread 0x7fff8f3fd6c0 (LWP 11358) "blend::tests::b" 0x00007ffff7d174ae in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7ffff00a091c) at futex-internal.c:57

The stacks on most of these threads looks like this (though sometimes with a read lock instead of a write lock, and often for a variety of different resources instead of command encoder creation):

Thread 2 (Thread 0x7ffff7c8b6c0 (LWP 11331) "blend::tests::b"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x0000555555c12eb4 in parking_lot::raw_rwlock::RawRwLock::lock_exclusive_slow::h7957d3e95355ce44 ()
#2  0x0000555555a41116 in wgpu_core::registry::FutureId<I,T>::assign::h2cbde308a5113f46 ()
#3  0x00005555559293a2 in wgpu_core::device::global::<impl wgpu_core::global::Global<G>>::device_create_command_encoder::hd4fbe81984d0da62 ()
#4  0x00005555559d674f in <wgpu::backend::direct::Context as wgpu::context::Context>::device_create_command_encoder::ha410a29667457a46 ()
#5  0x00005555559df130 in <T as wgpu::context::DynContext>::device_create_command_encoder::hd39f52a0286d846f ()
#6  0x0000555555a63273 in wgpu::Device::create_command_encoder::hb9a94e62fccd0e4e ()

The only threads that look significantly different are the one running the test harness and the following two Mesa/Vulkan-related threads:

Thread 29 (Thread 0x7fff8f3fd6c0 (LWP 11358) "blend::tests::b"):
#0  0x00007ffff7d174ae in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7ffff00a091c) at futex-internal.c:57
#1  __futex_abstimed_wait_common (futex_word=futex_word@entry=0x7ffff00a091c, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0, cancel=cancel@entry=true) at futex-internal.c:87
#2  0x00007ffff7d1752f in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7ffff00a091c, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at futex-internal.c:139
#3  0x00007ffff7d19d40 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7ffff00a08c8, cond=0x7ffff00a08f0) at pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x7ffff00a08f0, mutex=0x7ffff00a08c8) at pthread_cond_wait.c:618
#5  0x00007ffff5ed9e11 in __gthread_cond_wait (__mutex=<optimized out>, __cond=0x7ffff00a08f0) at /usr/src/debug/gcc/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/x86_64-pc-linux-gnu/bits/gthr-default.h:865
#6  std::__condvar::wait (__m=..., this=0x7ffff00a08f0) at /usr/src/debug/gcc/gcc-build/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/std_mutex.h:171
#7  std::condition_variable::wait (this=this@entry=0x7ffff00a08f0, __lock=...) at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/condition_variable.cc:41
#8  0x00007fffcee28575 in QUEUE_STATE::NextSubmission (this=this@entry=0x7ffff00a0790) at /usr/src/debug/vulkan-validation-layers/Vulkan-ValidationLayers-vulkan-sdk-1.3.268.0/layers/state_tracker/queue_state.cpp:164
#9  0x00007fffcee2a0b8 in QUEUE_STATE::ThreadFunc (this=0x7ffff00a0790) at /usr/src/debug/vulkan-validation-layers/Vulkan-ValidationLayers-vulkan-sdk-1.3.268.0/layers/state_tracker/queue_state.cpp:200
#10 0x00007ffff5ee1943 in std::execute_native_thread_routine (__p=0x7ffff10602b0) at /usr/src/debug/gcc/gcc/libstdc++-v3/src/c++11/thread.cc:104
#11 0x00007ffff7d1a9eb in start_thread (arg=<optimized out>) at pthread_create.c:444
#12 0x00007ffff7d9e7cc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Thread 28 (Thread 0x7fffcd5ff6c0 (LWP 11357) "zaru_im:disk$0"):
#0  0x00007ffff7d174ae in __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x7ffff009f150) at futex-internal.c:57
#1  __futex_abstimed_wait_common (futex_word=futex_word@entry=0x7ffff009f150, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0, cancel=cancel@entry=true) at futex-internal.c:87
#2  0x00007ffff7d1752f in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7ffff009f150, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at futex-internal.c:139
#3  0x00007ffff7d19d40 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7ffff009f100, cond=0x7ffff009f128) at pthread_cond_wait.c:503
#4  ___pthread_cond_wait (cond=0x7ffff009f128, mutex=0x7ffff009f100) at pthread_cond_wait.c:618
#5  0x00007fffdd0162fc in cnd_wait () at ../mesa-23.2.1/src/c11/impl/threads_posix.c:135
#6  util_queue_thread_func () at ../mesa-23.2.1/src/util/u_queue.c:290
#7  0x00007fffdd03861c in impl_thrd_routine () at ../mesa-23.2.1/src/c11/impl/threads_posix.c:67
#8  0x00007ffff7d1a9eb in start_thread (arg=<optimized out>) at pthread_create.c:444
#9  0x00007ffff7d9e7cc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Platform
Arch Linux, wgpu 0.18, Mesa 23.2.1-arch1.2, Radeon RX 6700 XT

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions