-
Notifications
You must be signed in to change notification settings - Fork 269
Deadlock in tgen (liblapack) #1788
Description
On my local machine, tgen is deadlocking at startup in shadow. It starts several threads, which appear to all be in a loop calling sched_yield and rdtsc; maybe some kind of spin lock? Here's a stack trace from one of the sched_yield calls:
(gdb) bt
<snip>
#14 0x00007ffff7fb8d88 in sched_yield (a=140737336181520, b=0, c=0, d=0, e=0, f=140737104509472)
at /home/jnewsome/projects/shadow/dev/src/lib/libc_preload/syscall_wrappers.c:786
#15 0x00007ffff56f0286 in ?? () from /lib/x86_64-linux-gnu/liblapack.so.3
#16 0x00007ffff7458609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#17 0x00007ffff79ac293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
Experimentally, changing sched_yield to move time forward eventually lets tgen move forward, but it seems strange that tgen has this spin-wait. It appears to be at the start of some liblapack worker thread, but I'm not familiar with liblapack, and I don't see where in tgen it's getting invoked.
If I attach gdb to tgen when it starts, and set a breakpoint on clone, it looks like the thread is created through some global initializer:
(gdb) bt
#0 clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:50
#1 0x00007ffff74602ec in create_thread (pd=pd@entry=0x7fffe9218700, attr=attr@entry=0x7fffffffea60, stopped_start=stopped_start@entry=0x7fffffffea5e, stackaddr=stackaddr@entry=0x7fffe91f9b40, thread_ran=thread_ran@entry=0x7fffffffea5f)
at ../sysdeps/unix/sysv/linux/createthread.c:101
#2 0x00007ffff7461e10 in __pthread_create_2_1 (newthread=<optimized out>, attr=<optimized out>, start_routine=<optimized out>, arg=<optimized out>) at pthread_create.c:817
#3 0x00007ffff56f959c in blas_thread_init () from /lib/x86_64-linux-gnu/liblapack.so.3
#4 0x00007ffff4fcd083 in gotoblas_init () from /lib/x86_64-linux-gnu/liblapack.so.3
#5 0x00007ffff7fe0b8a in ?? () from /lib64/ld-linux-x86-64.so.2
#6 0x00007ffff7fe0c91 in ?? () from /lib64/ld-linux-x86-64.so.2
#7 0x00007ffff7fd013a in ?? () from /lib64/ld-linux-x86-64.so.2
#8 0x0000000000000002 in ?? ()
#9 0x00007fffffffed92 in ?? ()
#10 0x00007fffffffedb3 in ?? ()
#11 0x0000000000000000 in ?? ()
I suppose the next step is to look at the source of liblapack to find the source of the thread's starting function, which should have the wait-loop.
Btw I'm not sure why this is only happening on my machine and not in the CI. I'm running Ubuntu 20.04, same as the tor test in the CI:
$ lsb_release --all
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.3 LTS
Release: 20.04
Codename: focal