Skip to content

Merge more changes from into the forked repo#5

Merged
imaginary-person merged 27 commits intoimaginary-person:masterfrom
pytorch:master
Jan 22, 2021
Merged

Merge more changes from into the forked repo#5
imaginary-person merged 27 commits intoimaginary-person:masterfrom
pytorch:master

Conversation

@imaginary-person
Copy link
Copy Markdown
Owner

No description provided.

malfet and others added 27 commits January 21, 2021 19:57
Summary:
Base on quickwritereader 's  comment: #50439 (comment)
Those builtins were added in gcc-8 or newer

Fixes #50439

Pull Request resolved: #50640

Reviewed By: walterddr

Differential Revision: D25934384

Pulled By: malfet

fbshipit-source-id: b5dcfcf644ab92a78279c4dca5dbffbb8d8aae0c
…#50769)

Summary:
Pull Request resolved: #50769

There were a couple new of these lines added in the last couple of days but they're not necessary anymore.
This PR removes them and also adds an assertion to make sure we don't add any more.
ghstack-source-id: 120133715

Test Plan: waitforsandcastle

Reviewed By: bhosmer

Differential Revision: D25961316

fbshipit-source-id: e2befc5b6215b42decb2acedcacfb50734857e2f
…der import.h (#50795)

Summary:
Pull Request resolved: #50795

There's [a post](https://fb.workplace.com/groups/2148543255442743/permalink/2583012411995823/) about a customer having to pass in `-Wno-global-constructors` to disable warnings related to calling constructors for global objects. This is related to the initialization of `default_extra_files_mobile` in `import.h`.

It requires end users to pass in the compiler flag, since the definition is now in code (.cpp files) that they will be compiling.

In addition, it makes the API for `_load_for_mobile` non-re-entrant (i.e. can not be safely used concurrently from multiple threads without the caller taking a mutex/lock) if the `extra_files_mobile` argument is not explicitly passed in.

Instead, a better option would be to create different overloads; one which requires all 3 parameters, and one that can work with 1-2. This solves the problem without creating a static variable.

ghstack-source-id: 120127083

Test Plan: Build Lite Interpreter and sandcastle.

Reviewed By: raziel

Differential Revision: D25968216

fbshipit-source-id: fbd80dfcafb8ef7231aca301445c4a2ca9a08995
Summary:
This not only specifies the data types of these NaNs, but also indicate
that the function isn't signaling anything unusual.

Pull Request resolved: #50412

Reviewed By: mrshenli

Differential Revision: D25899828

Pulled By: ezyang

fbshipit-source-id: a8ded10954ad08cba3098aa473c6b77f2e03dc93
Summary: Pull Request resolved: #50878

Test Plan: Imported from OSS

Reviewed By: SplitInfinity, eellison

Differential Revision: D26009183

Pulled By: ansley

fbshipit-source-id: 300913ea634d9a0e5b00deb831154ef126ad4180
Summary: Pull Request resolved: #50876

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D26002640

Pulled By: ansley

fbshipit-source-id: 4de8a63ef227ae3d46fab231f739c8472289ca4d
Summary:
Pull Request resolved: #49897

Resend of #49201

Test Plan: see 49201

Reviewed By: malfet

Differential Revision: D25717102

Pulled By: ilia-cher

fbshipit-source-id: 5e794a7f5fe160ca64ac9d190c4fd3e8f1e443e6
Summary:
Fixes #48520.

cc albanD (This is a clean retry PR #49807)

Pull Request resolved: #50886

Reviewed By: ejguan

Differential Revision: D26007435

Pulled By: albanD

fbshipit-source-id: 88fe91b40dea6f72e093e6301f0f04fcc842d2f0
Summary:
ROCm 3.5 replaced hcc with hip-clang and deprecated HIP_HCC_FLAGS.
HIP_CLANG_FLAGS should be used moving forward. HIP_HCC_FLAGS will
be removed soon.

Pull Request resolved: #50917

Reviewed By: ejguan

Differential Revision: D26008094

Pulled By: walterddr

fbshipit-source-id: cfec4f96fbd9bd338834a841c37267f6a4703cab
#50739)

Summary:
Pull Request resolved: #50739

This does not turn on batched grad testing for autogenerated NewModuleTest
tests and CriterionTest tests. Those are coming later.

Test Plan: - run tests

Reviewed By: ejguan

Differential Revision: D25997677

Pulled By: zou3519

fbshipit-source-id: b4b2d68e0f99c3d573faf237e1e531d0b3fced40
Summary:
Pull Request resolved: #50838

Similar to `torch.save` and `torch.jit.save`, accept a IO-like object
instead of just a file.

Test Plan: Imported from OSS

Reviewed By: nikithamalgifb

Differential Revision: D25982719

Pulled By: suo

fbshipit-source-id: 42f3665932bbaa6897215002d116df6338edae50
Summary: Pull Request resolved: #50850

Test Plan: Imported from OSS

Reviewed By: Chillee

Differential Revision: D25985085

Pulled By: ZolotukhinM

fbshipit-source-id: e51709423c2c12b37b449a9d7bb22be04cda7ef1
Summary:
Pull Request resolved: #50288

torch::deploy will bundle the objects contained in libtorch-python together with frozenpython into a shared library.  Therefore, the libtorch-python objs can't bring with them a dependency on system python.

Buck TARGETS are added throughout the caffe2 tree to make available objects or headers that will be needed by torch::deploy but would have brought unsuitable dependencies if accessed using existing targets.

CMakeLists are modified to separate a torch-python-objs object library which lets torch::deploy compile these objs with the same compile flags as libttorch_python used, but without some of the link-time dependencies such as python.

CudaIPCTypes is moved from libtorch_python to libtorch_cuda because it is really not a python binding, and it statically registers a cuda_ipc_callback which would be duplicated if included in each copy of torch::deploy.

Test Plan: no new functionality, just ensure existing tests continue to pass

Reviewed By: malfet

Differential Revision: D25850785

fbshipit-source-id: b0b81c050cbee04e9de96888f8a09d29238a9db8
Summary:
Pull Request resolved: #50279

This allows different sample inputs to have different behavior for the same
operator. For example, `div(..., rounding_mode='true')` will promote but other
rounding modes don't. The current boolean flag is too restrictive to allow this.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25950011

Pulled By: mruberry

fbshipit-source-id: 7e82b82bedc626b2b6970d92d5b25676183ec384
Summary:
This is a benchmarking tooling to work with sparse tensors. To implement this, we extended PR `benchmarking util` [https://github.com/pytorch/pytorch/issues/38338](https://github.com/pytorch/pytorch/pull/38338) for sparse tensors.   In order to extend the proposed utility library the **FuzzedTensor** class was extended  by creating the new **FuzzedSparseTensor** class.  In addition two new operator classes were added, the `UnaryOpSparseFuzzer` and `BinaryOpSparseFuzzer`.

The class `FuzzedSparseTensor` adds new input parameters to the constructor:
1. `sparse_dim`: The number of sparse dimensions in a sparse tensor.
2. `nnz`:   Number of non-zero elements in the sparse tensor.
3. `density`: The density of the sparse tensor.
4. `coalesced`: As we know the sparse tensor format permits coalesced/uncoalesced sparse tensors.

and removes `probability_contiguous`, `max_allocation_bytes`, `roll_parameter`, `tensor_constructor` as they are dense-tensors related parameters.

In addition, I've extended the `torch.utils.benchmark.examples` to work with the new classes `FuzzedSparseTensor`, `UnaryOpSparseFuzzer` and `BinaryOpSparseFuzzer`.

Hopefully, this tooling and these examples will help to make other benchmarks in other PRs. Looking forward to your thoughts and feedback. cc robieta, mruberry,  ngimel

Pull Request resolved: #48397

Reviewed By: ejguan

Differential Revision: D26008137

Pulled By: mruberry

fbshipit-source-id: 2f37811c7c3eaa3494a0f2500e519267f2186dfb
Summary: Pull Request resolved: #50883

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D26003682

Pulled By: anjali411

fbshipit-source-id: f02967d2d236d740cd8647891f732f1d63098d3e
Summary:
Pull Request resolved: #50593

There are no equivalent to torch.FloatTensor, torch.cuda.FloatTensor for complex
types. So `get_gpu_type` and `get_cpu_type` are broken for complex tensors.

Also found a few places that explicitly cast inputs to floating point types,
which would drop the imaginary component before running the test.

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D25954050

Pulled By: mruberry

fbshipit-source-id: 1fa8e5af233aa095c839d5e2f860564baaf92aef
Summary:
Pull Request resolved: #50736

Exposes tanh and sigmoid to other backends

Test Plan: buck test caffe2/test/cpp/tensorexpr:tensorexpr -- "ATen.fast"

Reviewed By: bertmaher

Differential Revision: D25884911

fbshipit-source-id: f9a5286450331f60935cfd40bb23f4a4f4c1d087
…ort.h (#50832)

Summary:
Pull Request resolved: #50832

Please see the previous diff in this stack for the motivation to do so. This makes the same change but for the non-mobile codebase.
ghstack-source-id: 120184012

Test Plan: Sandcastle + Build

Reviewed By: raziel, iseeyuan

Differential Revision: D25979986

fbshipit-source-id: 7708f4f6a50cb16d7a23651e5655144d277d0a4f
Summary: Pull Request resolved: #50912

Test Plan: Sandcastle tests

Reviewed By: ansley

Differential Revision: D26001948

fbshipit-source-id: 3bfe6a8283a2b1882ed472f836ae1b6e720e519f
Summary: Pull Request resolved: #50583

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26010501

Pulled By: ansley

fbshipit-source-id: 947121af7e57c16c96f849fbbb3fa83e97d003b2
Summary:
Fixes #50792

fixes `count_nonzero` for tensors with requires_grad and also includes test

Pull Request resolved: #50866

Reviewed By: ejguan

Differential Revision: D25996202

Pulled By: albanD

fbshipit-source-id: 61f2d7d62dd04e574a65ad03ef3a358b141fbae7
Summary:
Fixes #49100

Pull Request resolved: #49904

Reviewed By: ezyang, mrshenli

Differential Revision: D25956761

Pulled By: mruberry

fbshipit-source-id: 86a59289d50825a0ebbd7c358b483c8d8039ffa6
Summary: Pull Request resolved: #50951

Test Plan: Imported from OSS

Reviewed By: fmassa

Differential Revision: D26021488

Pulled By: IvanKobzarev

fbshipit-source-id: 6d295762bb1160a3ed8bafac08e03e1eeb07d688
Summary:
Pull Request resolved: #50594

Fixes #50234

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D25987316

Pulled By: anjali411

fbshipit-source-id: c298b771fe52b267a86938e886ea402badecfe3e
Summary:
Fixes #49541

Reference: #24507

Pull Request resolved: #49732

Reviewed By: ejguan

Differential Revision: D25991438

Pulled By: ngimel

fbshipit-source-id: a43bd0bfe043d8e32a6cadbbf736a0eaa697e7ec
Summary:
Cant think of a reason not .gitignore test-reports folder. this can be helpful when
1. running `python test/test*.py` from github root directory since it creates the folder at root.
2. CI test report path generated by `torch/testing/_internal/common_utils.py` creates the folder in the same path where the test python file locates.

Creating a PR to make sure CI is happy. this is also needed by #50923

Pull Request resolved: #50952

Reviewed By: samestep

Differential Revision: D26022436

Pulled By: walterddr

fbshipit-source-id: 83e6296de802bd1754b802b8c70502c317f078c9
@imaginary-person imaginary-person merged commit 2db291e into imaginary-person:master Jan 22, 2021
imaginary-person pushed a commit that referenced this pull request May 26, 2021
Summary: added more statistic info for static runtime

Test Plan:
caffe2/benchmarks/static_runtime:static_runtime_cpptest

Expected output example:

Static runtime ms per iter: 0.939483. Iters per second: 1064.41
Node #0: 0.195671 ms/iter, %wide_offset.1 : Tensor = aten::add(%wide.1, %self._mu, %4)
Node #1: 0.169457 ms/iter, %wide_normalized.1 : Tensor = aten::mul(%wide_offset.1, %self._sigma)
Node #2: 0.118218 ms/iter, %wide_preproc.1 : Tensor = aten::clamp(%wide_normalized.1, %5, %6)
Node #3: 0.038814 ms/iter, %user_emb_t.1 : Tensor = aten::transpose(%user_emb.1, %4, %7)
Node #4: 0.0860747 ms/iter, %dp_unflatten.1 : Tensor = aten::bmm(%ad_emb_packed.1, %user_emb_t.1)
Node #5: 0.0102666 ms/iter, %31 : Tensor = static_runtime::flatten_copy(%dp_unflatten.1, %4, %8)
Node #6: 0.000476333 ms/iter, %19 : Tensor[] = prim::ListConstruct(%31, %wide_preproc.1)
Node #7: 0.0707332 ms/iter, %input.1 : Tensor = aten::cat(%19, %4)
Node #8: 0.123695 ms/iter, %fc1.1 : Tensor = aten::addmm(%self._fc_b, %input.1, %29, %4, %4)
Node #9: 0.0309244 ms/iter, %23 : Tensor = aten::sigmoid(%fc1.1)
Node #10: 0.0046297 ms/iter, %24 : (Tensor) = prim::TupleConstruct(%23)
Time per node type:
       0.195671 ms.    23.0483%. aten::add (1 nodes)
       0.169457 ms.    19.9605%. aten::mul (1 nodes, out variant)
       0.123695 ms.    14.5702%. aten::addmm (1 nodes, out variant)
       0.118218 ms.     13.925%. aten::clamp (1 nodes, out variant)
      0.0860747 ms.    10.1388%. aten::bmm (1 nodes, out variant)
      0.0707332 ms.    8.33175%. aten::cat (1 nodes, out variant)
       0.038814 ms.    4.57195%. aten::transpose (1 nodes)
      0.0309244 ms.    3.64263%. aten::sigmoid (1 nodes, out variant)
      0.0102666 ms.    1.20932%. static_runtime::flatten_copy (1 nodes, out variant)
      0.0046297 ms.   0.545338%. prim::TupleConstruct (1 nodes, out variant)
    0.000476333 ms.  0.0561079%. prim::ListConstruct (1 nodes, out variant)
       0.848959 ms. in Total
StaticRuntime setup time: 0.018925 ms
Memory allocation time: 0.019808 ms
Memory deallocation time: 0.0120445 ms
Outputs deallocation time: 0.0864947 ms
Total memory managed: 19328 bytes
Total number of reused tensors: 3
Total number of 'out' variant nodes/total number of nodes: 9/11 (81.8182%)

Reviewed By: hlu1

Differential Revision: D28553029

fbshipit-source-id: 55e7eab50b4b475ae219896100bdf4f6678875a4
imaginary-person pushed a commit that referenced this pull request Jul 2, 2021
Summary:
Pull Request resolved: pytorch#60987

We were seeing deadlocks as follows during shutdown:

```
Thread 1 (LWP 2432101):
#0  0x00007efca470190b in __pause_nocancel () from /lib64/libc.so.6
#1  0x00007efca49de485 in __pthread_mutex_lock_full () from /lib64/libpthread.so.0
#2  0x00007ef91d4c42c6 in __cuda_CallJitEntryPoint () from /lib64/libnvidia-ptxjitcompiler.so.1
#3  0x00007efc651ac8f1 in ?? () from /lib64/libcuda.so
#4  0x00007efc651aee03 in ?? () from /lib64/libcuda.so
#5  0x00007efc64f76b84 in ?? () from /lib64/libcuda.so
#6  0x00007efc64f77f5d in ?? () from /lib64/libcuda.so
#7  0x00007efc64eac858 in ?? () from /lib64/libcuda.so
#8  0x00007efc64eacfbc in ?? () from /lib64/libcuda.so
#9  0x00007efc7810a924 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#10 0x00007efc780fa2be in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#11 0x00007efc78111044 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#12 0x00007efc7811580a in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#13 0x00007efc78115aa4 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#14 0x00007efc781079ec in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#15 0x00007efc780e6a7a in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#16 0x00007efc7811cfa5 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#17 0x00007efc777ea98c in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#18 0x00007efc777ebd80 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#19 0x00007efc777ea2c9 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#20 0x00007efc778c2e2d in cublasDestroy_v2 () from /usr/local/cuda/lib64/libcublas.so.11
#21 0x00007efc51a3fb56 in std::_Sp_counted_ptr_inplace<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle>, std::allocator<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle> >, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /data/users/pritam/pytorch/torch/lib/libtorch_cuda.so
#22 0x00007efc51a3fc5f in std::shared_ptr<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle> >::~shared_ptr() () from /data/users/pritam/pytorch/torch/lib/libtorch_cuda.so
#23 0x00007efca4648b0c in __run_exit_handlers () from /lib64/libc.so.6
#24 0x00007efca4648c40 in exit () from /lib64/libc.so.6
#25 0x0000558c8852e5f9 in Py_Exit (sts=0) at /tmp/build/80754af9/python_1614362349910/work/Python/pylifecycle.c:2292
#26 0x0000558c8852e6a7 in handle_system_exit () at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:636
#27 0x0000558c8852e742 in PyErr_PrintEx (set_sys_last_vars=<optimized out>, set_sys_last_vars=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:646
#28 0x0000558c88540dd6 in PyRun_SimpleStringFlags (command=0x7efca4dc9050 "from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=9, pipe_handle=13)\n", flags=0x7ffe3a986110) at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:457
#29 0x0000558c88540ead in pymain_run_command (cf=0x7ffe3a986110, command=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:420
#30 pymain_run_python (pymain=0x7ffe3a986220) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:2907
#31 pymain_main (pymain=0x7ffe3a986220) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:3460
#32 0x0000558c8854122c in _Py_UnixMain (argc=<optimized out>, argv=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:3495
#33 0x00007efca4632493 in __libc_start_main () from /lib64/libc.so.6
#34 0x0000558c884e5e90 in _start () at ../sysdeps/x86_64/elf/start.S:103
```

This was likely caused due to a static singleton that wasn't leaky. Following
the guidance in https://isocpp.org/wiki/faq/ctors#construct-on-first-use-v2 to
use a leaky singleton instead.
ghstack-source-id: 132847448

Test Plan: Verified locally.

Reviewed By: malfet

Differential Revision: D29468866

fbshipit-source-id: 89250594c5cd2643417b1da584c658b742dc5a5c
imaginary-person pushed a commit that referenced this pull request Jul 20, 2021
Summary:
Pull Request resolved: pytorch#61588

As part of debugging pytorch#60290,
we discovered the following deadlock:

```
Thread 79 (Thread 0x7f52ff7fe700 (LWP 205437)):
#0  pthread_cond_timedwait@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
#1  0x0000564880199152 in PyCOND_TIMEDWAIT (cond=0x564880346080 <gil_cond>, mut=0x564880346100 <gil_mutex>, us=5000) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/condvar.h:103
#2  take_gil (tstate=0x7f5254005ef0) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/ceval_gil.h:224
#3  0x0000564880217b62 in PyEval_AcquireThread (tstate=0x7f5254005ef0) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/ceval.c:278
#4  0x00007f557d54aabd in pybind11::gil_scoped_acquire::gil_scoped_acquire() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#5  0x00007f557da7792f in (anonymous namespace)::concrete_decref_fn(c10::impl::PyInterpreter const*, _object*) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#6  0x00007f5560dadba6 in c10::TensorImpl::release_resources() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so
#7  0x00007f5574c885bc in std::_Sp_counted_ptr_inplace<torch::distributed::autograd::DistAutogradContext, std::allocator<torch::distributed::autograd::DistAutogradContext>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#8  0x00007f5574c815e9 in std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<long const, std::shared_ptr<torch::distributed::autograd::DistAutogradContext> >, false> > >::_M_deallocate_node(std::__detail::_Hash_node<std::pair<long const, std::shared_ptr<torch::distributed::autograd::DistAutogradContext> >, false>*) [clone .isra.325] () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#9  0x00007f5574c81bf1 in torch::distributed::autograd::DistAutogradContainer::eraseContextIdAndReset(torch::distributed::autograd::DistAutogradContainer::ContextsShard&, long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#10 0x00007f5574c86e83 in torch::distributed::autograd::DistAutogradContainer::releaseContextIfPresent(long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#11 0x00007f5574cc6395 in torch::distributed::rpc::RequestCallbackNoPython::processCleanupAutogradContextReq(torch::distributed::rpc::RpcCommandBase&) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#12 0x00007f5574cccf15 in torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so

Thread 72 (Thread 0x7f53077fe700 (LWP 205412)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f55bc62adbd in __GI___pthread_mutex_lock (mutex=0x564884396440) at ../nptl/pthread_mutex_lock.c:80
#2  0x00007f5574c82a2f in torch::distributed::autograd::DistAutogradContainer::retrieveContext(long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#3  0x00007f557de9bb2f in pybind11::cpp_function::initialize<torch::distributed::autograd::(anonymous namespace)::dist_autograd_init(_object*, _object*)::{lambda(long)#11}, pybind11::dict, long, pybind11::name, pybind11::scope, pybind11::sibling, char [931], pybind11::arg>(torch::distributed::autograd::(anonymous namespace)::dist_autograd_init(_object*, _object*)::{lambda(long)#11}&&, pybind11::dict (*)(long), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [931], pybind11::arg const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so

```

Basically Thread 72, holds GIL and tries to acquire the lock for
DistAutogradContainer to perform a lookup on a map. On the other hand,
Thread 79 holds the lock on DistAutogradContainer to remove a Tensor and as
part of TensorImpl destructor, concrete_decref_fn is called which waits for
GIL. As a result, we have a deadlock.

To fix this issue, I've ensured we release GIL when we call `retrieveContext`
and acquire it later when needed.
ghstack-source-id: 133493659

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D29682624

fbshipit-source-id: f68a1fb39040ca0447a26e456a97bce64af6b79c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.