Skip to content

Deadlock in tgen (liblapack) #1788

@sporksmith

Description

@sporksmith

On my local machine, tgen is deadlocking at startup in shadow. It starts several threads, which appear to all be in a loop calling sched_yield and rdtsc; maybe some kind of spin lock? Here's a stack trace from one of the sched_yield calls:

(gdb) bt
<snip>
#14 0x00007ffff7fb8d88 in sched_yield (a=140737336181520, b=0, c=0, d=0, e=0, f=140737104509472)
    at /home/jnewsome/projects/shadow/dev/src/lib/libc_preload/syscall_wrappers.c:786
#15 0x00007ffff56f0286 in ?? () from /lib/x86_64-linux-gnu/liblapack.so.3
#16 0x00007ffff7458609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#17 0x00007ffff79ac293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Experimentally, changing sched_yield to move time forward eventually lets tgen move forward, but it seems strange that tgen has this spin-wait. It appears to be at the start of some liblapack worker thread, but I'm not familiar with liblapack, and I don't see where in tgen it's getting invoked.

If I attach gdb to tgen when it starts, and set a breakpoint on clone, it looks like the thread is created through some global initializer:

(gdb) bt
#0  clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:50
#1  0x00007ffff74602ec in create_thread (pd=pd@entry=0x7fffe9218700, attr=attr@entry=0x7fffffffea60, stopped_start=stopped_start@entry=0x7fffffffea5e, stackaddr=stackaddr@entry=0x7fffe91f9b40, thread_ran=thread_ran@entry=0x7fffffffea5f)
    at ../sysdeps/unix/sysv/linux/createthread.c:101
#2  0x00007ffff7461e10 in __pthread_create_2_1 (newthread=<optimized out>, attr=<optimized out>, start_routine=<optimized out>, arg=<optimized out>) at pthread_create.c:817
#3  0x00007ffff56f959c in blas_thread_init () from /lib/x86_64-linux-gnu/liblapack.so.3
#4  0x00007ffff4fcd083 in gotoblas_init () from /lib/x86_64-linux-gnu/liblapack.so.3
#5  0x00007ffff7fe0b8a in ?? () from /lib64/ld-linux-x86-64.so.2
#6  0x00007ffff7fe0c91 in ?? () from /lib64/ld-linux-x86-64.so.2
#7  0x00007ffff7fd013a in ?? () from /lib64/ld-linux-x86-64.so.2
#8  0x0000000000000002 in ?? ()
#9  0x00007fffffffed92 in ?? ()
#10 0x00007fffffffedb3 in ?? ()
#11 0x0000000000000000 in ?? ()

I suppose the next step is to look at the source of liblapack to find the source of the thread's starting function, which should have the wait-loop.

Btw I'm not sure why this is only happening on my machine and not in the CI. I'm running Ubuntu 20.04, same as the tor test in the CI:

$ lsb_release --all
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.3 LTS
Release:	20.04
Codename:	focal

Metadata

Metadata

Assignees

Labels

Type: BugError or flaw producing unexpected results

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions