Skip to content

globalContext() deadlock if Context is not initialized before libtorch (variable hooks) is loaded #9784

@ezyang

Description

@ezyang

Steps to reproduce:

  1. Write a patch that eliminates globalContext() initialization from the static initializers of libcaffe2.so. Here is one sample branch: https://github.com/ezyang/pytorch/tree/issue/deadlocks
  2. Build and run

It deadlocks in the following trace:

#0  0x00007ffff78eaec9 in syscall () from /lib64/libc.so.6
#1  0x00007fffcbf4c57e in __cxxabiv1::__cxa_guard_acquire (g=0x7fffdf70fb08 <guard variable for at::globalContext()::globalContext_>) at /opt/conda/conda-bld/comp
ilers_linux-64_1520532893746/work/.build/src/gcc-7.2.0/libstdc++-v3/libsupc++/guard.cc:307
#2  0x00007fffddd812d0 in at::globalContext () at ../aten/src/ATen/Context.cpp:41
#3  0x00007fffccef089b in torch::autograd::register_variable_type_for (baseType=0x5555566f78b0) at ../torch/csrc/autograd/generated/VariableType.cpp:171
#4  0x00007fffcce46588 in torch::autograd::VariableHooks::registerVariableTypeFor (this=0x5555566f7920, context=0x7fffdf70fb20 <at::globalContext()::globalContext
_>, backend=at::Backend::CPU, scalar_type=at::ScalarType::Byte) at ../torch/csrc/autograd/aten_variable_hooks.cpp:21
#5  0x00007fffddfb3e43 in at::Type::registerCPU (context=0x7fffdf70fb20 <at::globalContext()::globalContext_>) at aten/src/ATen/Type.cpp:40
#6  0x00007fffddd8120f in at::Context::Context (this=0x7fffdf70fb20 <at::globalContext()::globalContext_>) at ../aten/src/ATen/Context.cpp:37
#7  0x00007fffddd812eb in at::globalContext () at ../aten/src/ATen/Context.cpp:41
#8  0x00007fffcd04c2f5 in torch::autograd::VariableTypeRegistry::VariableTypeRegistry (this=0x7fffcdce60e8 <torch::autograd::registry>) at ../torch/csrc/autograd/
generated/VariableType.cpp:176
#9  0x00007fffcd0491ef in __static_initialization_and_destruction_0 (__initialize_p=1, __priority=65535) at ../torch/csrc/autograd/generated/VariableType.cpp:188
#10 0x00007fffcd049225 in _GLOBAL__sub_I_VariableType.cpp(void) () at ../torch/csrc/autograd/generated/VariableType.cpp:31075
#11 0x00007ffff7deab03 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
(More stack frames follow...)
(gdb) bt
#0  0x00007ffff78eaec9 in syscall () from /lib64/libc.so.6
#1  0x00007fffcbf4c57e in __cxxabiv1::__cxa_guard_acquire (g=0x7fffdf70fb08 <guard variable for at::globalContext()::globalContext_>) at /opt/conda/conda-bld/comp
ilers_linux-64_1520532893746/work/.build/src/gcc-7.2.0/libstdc++-v3/libsupc++/guard.cc:307
#2  0x00007fffddd812d0 in at::globalContext () at ../aten/src/ATen/Context.cpp:41
#3  0x00007fffccef089b in torch::autograd::register_variable_type_for (baseType=0x5555566f78b0) at ../torch/csrc/autograd/generated/VariableType.cpp:171
#4  0x00007fffcce46588 in torch::autograd::VariableHooks::registerVariableTypeFor (this=0x5555566f7920, context=0x7fffdf70fb20 <at::globalContext()::globalContext
_>, backend=at::Backend::CPU, scalar_type=at::ScalarType::Byte) at ../torch/csrc/autograd/aten_variable_hooks.cpp:21
#5  0x00007fffddfb3e43 in at::Type::registerCPU (context=0x7fffdf70fb20 <at::globalContext()::globalContext_>) at aten/src/ATen/Type.cpp:40
#6  0x00007fffddd8120f in at::Context::Context (this=0x7fffdf70fb20 <at::globalContext()::globalContext_>) at ../aten/src/ATen/Context.cpp:37
#7  0x00007fffddd812eb in at::globalContext () at ../aten/src/ATen/Context.cpp:41
#8  0x00007fffcd04c2f5 in torch::autograd::VariableTypeRegistry::VariableTypeRegistry (this=0x7fffcdce60e8 <torch::autograd::registry>) at ../torch/csrc/autograd/
generated/VariableType.cpp:176
#9  0x00007fffcd0491ef in __static_initialization_and_destruction_0 (__initialize_p=1, __priority=65535) at ../torch/csrc/autograd/generated/VariableType.cpp:188
#10 0x00007fffcd049225 in _GLOBAL__sub_I_VariableType.cpp(void) () at ../torch/csrc/autograd/generated/VariableType.cpp:31075
#11 0x00007ffff7deab03 in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#12 0x00007ffff7def6de in dl_open_worker () from /lib64/ld-linux-x86-64.so.2
#13 0x00007ffff7dea914 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#14 0x00007ffff7deeccb in _dl_open () from /lib64/ld-linux-x86-64.so.2
#15 0x00007ffff75eefbb in dlopen_doit () from /lib64/libdl.so.2
#16 0x00007ffff7dea914 in _dl_catch_error () from /lib64/ld-linux-x86-64.so.2
#17 0x00007ffff75ef5bd in _dlerror_run () from /lib64/libdl.so.2
#18 0x00007ffff75ef051 in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2

So there is a circular call of globalContext(), but ONLY if the variable registration static initializer is called before the Context static initializer. Ugh.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions