Skip to content

Get changes from main repo#9

Merged
imaginary-person merged 31 commits intoenable_min_max_half_typesfrom
master
Jan 27, 2021
Merged

Get changes from main repo#9
imaginary-person merged 31 commits intoenable_min_max_half_typesfrom
master

Conversation

@imaginary-person
Copy link
Copy Markdown
Owner

Get changes from main repo

mattip and others added 30 commits January 26, 2021 16:19
Summary:
Fixes pytorch#3307

Previously, `self.grad` was not ~cloned~ deepcopied to the returned tensor in `deepcopy`. Added a test and an implementation.

Pull Request resolved: pytorch#50663

Reviewed By: heitorschueroff

Differential Revision: D26074811

Pulled By: albanD

fbshipit-source-id: 536dad36415f1d03714b4ce57453f406ad802b8c
Summary:
In order to enable FC int8 quantization in P2C2, we are trying to run the caffe2 op Int8FCPackWeight in the model transformation pipeline.

The net is being generated from the python side, and passed back into C++ and run here: https://fburl.com/diffusion/3zt1mp03,  with these dependencies included: https://fburl.com/diffusion/rdjtdtcf

However, when the net is executed, it errors out with:
```
Cannot create operator of type 'Int8FCPackWeight' on the device 'CPU'
```

This diff attempts to fix this issue.

Test Plan:
To reproduce, just this test without
```
buck test //aiplatform/modelstore/transformation/tests:pyper_to_caffe2_dispatcher_test
```

Reviewed By: jspark1105

Differential Revision: D25965167

fbshipit-source-id: a7414669abb8731177c14e8792de58f400970732
Summary:
as in title

resolves D25791248 (pytorch@069602e)

Test Plan: buck test //caffe2/aten:vitals

Reviewed By: EscapeZero, malfet

Differential Revision: D26090442

fbshipit-source-id: 07937f246ec0a6eb338d21208ada61758237ae42
)

Summary:
Fixes pytorch#50378.

Additionally, this has some minor fixes:
 - [x] Fix mean for half-cauchy to return `inf` instead of `nan`.
 - [x] Fix constraints/support for the relaxed categorical distribution.

Pull Request resolved: pytorch#51053

Reviewed By: heitorschueroff

Differential Revision: D26077966

Pulled By: neerajprad

fbshipit-source-id: ca0213baa9bbdbc661aebbb901ab5e7fded38a5f
Summary: Pull Request resolved: pytorch#50884

Test Plan: Imported from OSS

Reviewed By: heitorschueroff

Differential Revision: D26086963

fbshipit-source-id: f103f7f529d63d701c4f17862e30eafbab7d0c68
Summary:
On Ampere GPU, matmuls are computed by default with TF32 when the dtype is `torch.float`:  https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices, which results in reduced precision in results. However, linear algebra usually need higher precision, therefore lots of tests in `test_linalg.py` are failing on Ampere GPU because of precision issue.

To fix this issue:
- Most linear algebra methods, except for matmuls, should add `NoTF32Guard`
- Expected results in unit tests should compute matmuls using numpy instead of pytorch cuda.

Pull Request resolved: pytorch#50453

Reviewed By: glaringlee

Differential Revision: D26023005

Pulled By: ngimel

fbshipit-source-id: f0ea533494fee322b07925565b57e3b0db2570c5
Summary:
Fixes #{issue number}
This is not really a new issue, just a proposed minor fix to a recent previous issue (now closed) pytorch#50640 which was a fix for pytorch#50439.

That fix added inlining for vec_signed (and others) but in one case the return was accidentally omitted.  This results in a build error:
```                 from �[01m�[K../aten/src/ATen/cpu/vec256/vec256.h:19�[m�[K,
                 from �[01m�[Katen/src/ATen/native/cpu/FillKernel.cpp.VSX.cpp:3�[m�[K:
�[01m�[K../aten/src/ATen/cpu/vec256/vsx/vsx_helpers.h:�[m�[K In function ‘�[01m�[Kvint32 vec_signed(const vfloat32&)�[m�[K’:
�[01m�[K../aten/src/ATen/cpu/vec256/vsx/vsx_helpers.h:33:1:�[m�[K �[01;31m�[Kerror: �[m�[Kno return statement in function returning non-void [�[01;31m�[K-Werror=return-type�[m�[K]
```

I've confirmed that the error disappears after this one-line fix.  (Note: There is another issue encountered later in the build unrelated to this particular fix, as I noted in a separate comment in the original issue.  I'm trying to make some sense of that one, but in any event it would be a subject for another issue/PR).

Pull Request resolved: pytorch#51116

Reviewed By: heitorschueroff

Differential Revision: D26078213

Pulled By: malfet

fbshipit-source-id: 59b2ee19138fa1b8d8ec1d35ca4a5ef0a67bc123
…ytorch#51162)

Summary:
Pull Request resolved: pytorch#51162

It's unused.
ghstack-source-id: 120427120

Test Plan: CI

Reviewed By: bhosmer

Differential Revision: D25859010

fbshipit-source-id: 7bb21312843debaedaa6a969727c171b2bb0e6b2
Summary:
Fixes pytorch#51105 by adding back the `import pycuda.autoinit`.

Pull Request resolved: pytorch#51106

Reviewed By: mingzhe09088

Differential Revision: D26086808

Pulled By: heitorschueroff

fbshipit-source-id: 88d98796c87a44cedaa1f6666e9f71a424293641
Summary:
A tiny PR to update the links in the lefthand navbar under Libraries. The canonical link for vision and text is `https://pytorch.org/vision/stable` and `https://pytorch.org/text/stable` respectively. The link without the `/stable` works via a redirect, this is cleaner.

Pull Request resolved: pytorch#51103

Reviewed By: izdeby

Differential Revision: D26079760

Pulled By: heitorschueroff

fbshipit-source-id: df1fa64d7895831f4e6242445bae02c1faa5e4dc
Summary:
Add const to static variable inside `__host__ __device__` function.

Pull Request resolved: pytorch#50970

Reviewed By: izdeby

Differential Revision: D26081478

Pulled By: heitorschueroff

fbshipit-source-id: 77cf145f7e0570359aa00aec4c8b82c950815f81
Summary:
Pull Request resolved: pytorch#51081

Pull Request resolved: pytorch#51001

fix tests in TestQuantizeJitOps

Test Plan:
Imported from OSS
python test/test_quantization.py

Reviewed By: raghuramank100

Differential Revision: D26038759

Pulled By: lyoka

fbshipit-source-id: 0977ba7b8b26a9f654f20f5c698a7a20ec078c35
Summary: Pull Request resolved: pytorch#51129

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D26094947

Pulled By: anjali411

fbshipit-source-id: 4e1cdf8915a8c6a86ac3462685cdce881e1bcffa
…pytorch#50915)

Summary:
Pull Request resolved: pytorch#50915

Fixes pytorch#50584
Add a vectorize flag to torch.autograd.functional.jacobian and
torch.autograd.functional.hessian (default: False). Under the hood, the
vectorize flag uses vmap as the backend to compute the jacobian and
hessian, respectively, providing speedups to users.

Test Plan:
- I updated all of the jacobian and hessian tests to also use
vectorized=True
- I added some simple sanity check tests that check e.g. jacobian with
vectorized=False vs
jacobian with vectorized=True.
- The mechanism for vectorized=True goes through batched gradient
computation. We have separate tests for those (see other PRs in this
stack).

Reviewed By: heitorschueroff

Differential Revision: D26057674

Pulled By: zou3519

fbshipit-source-id: a8ae7ca0d2028ffb478abd1b377f5b49ee39e4a1
…ytorch#50356)

Summary:
Fixes pytorch#50330

- Encapsulate the `make html` call and capture the stdout/stderr with a `tee` command
- If the buld fails, print out the `WARNING:` lines of the build log and finish up with a message

I tried it out on my branch, but did not write a test.

Pull Request resolved: pytorch#50356

Reviewed By: ezyang

Differential Revision: D26101762

Pulled By: brianjo

fbshipit-source-id: ba2b704d3244ef5139ca9026c5250537bf45734f
Summary:
Fixes pytorch#49824

## Background

When creating a view of a view, there was a possibility that the new view would be less restrictive than the previous view, incorrectly sidestepping the error that should be thrown when using in-place operations on the new view.

The fix addresses this by propagating `CreationMeta` from the previous view to the new view. Currently, the old view's `creation_meta` is only propagated when the new view's `creation_meta == CreationMeta::DEFAULT`. This ensures that the new view is not less restrictive than the previous view wrt. allowing in-place operations.

Pull Request resolved: pytorch#51061

Test Plan:
```
python test/test_autograd.py TestAutogradDeviceTypeCPU.test_inplace_view_of_multiple_output_view_cpu
python test/test_autograd.py TestAutogradDeviceTypeCUDA.test_inplace_view_of_multiple_output_view_cuda
python test/test_autograd.py TestAutogradDeviceTypeCPU.test_inplace_multiple_output_view_of_view_cpu
python test/test_autograd.py TestAutogradDeviceTypeCUDA.test_inplace_multiple_output_view_of_view_cuda
```

Reviewed By: heitorschueroff

Differential Revision: D26076434

Pulled By: jbschlosser

fbshipit-source-id: c47f0ddcef9b8449427b671aff9ad08edca70fcd
Summary: Pull Request resolved: pytorch#51148

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D26102025

Pulled By: anjali411

fbshipit-source-id: b1b6fd12fda03c4520a3c3200226edf352496188
Summary:
Followup of pytorch#50927

Pull Request resolved: pytorch#51045

Reviewed By: mruberry

Differential Revision: D26089204

Pulled By: ngimel

fbshipit-source-id: 77291dd83fba32d6f80a8540910b112a1d85a892
Summary:
Pull Request resolved: pytorch#51168

Adds types to function I/O for numeric suite.  This is for readability
and static type checking with mypy.

Test Plan:
```
mypy torch/quantization/
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D26092454

fbshipit-source-id: d37cf61e4d9604f4bc550b392f55fb59165f7624
Summary: Pull Request resolved: pytorch#50901

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D26083289

Pulled By: mruberry

fbshipit-source-id: 7e14ff37bba46dd456e0bc0aa9c4e0a632d0734c
Summary:
Mostly replace `global Foo` with `make_global(Foo)`
The only real fix is generating Subscript annotation, which is a follow up from pytorch#48676

Fixes pytorch#49617

Pull Request resolved: pytorch#51182

Reviewed By: gmagogsfm

Differential Revision: D26095244

Pulled By: malfet

fbshipit-source-id: 0e043d9a2cf43fff71dfbb341f708cd7af87c39a
Summary:
Pull Request resolved: pytorch#51079

Added support for functional conv2d + relu, will add conv1d and conv3d in future PR

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_functional_conv

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26089964

fbshipit-source-id: 8703de17de1469f7076651c386c83fb5922a56eb
Summary:
Pull Request resolved: pytorch#50760

The SHM transport uses shared-memory-backed ringbuffers to transfer small payloads between processes on the same machine.

It was disabled in v1.6 due to a CMake mishap but we've since realized that it also doesn't work that well in docker and other setups. Enabling it here to see whether CircleCI fails.
ghstack-source-id: 120470890

Test Plan: Exported three times to CircleCI with tests consistently passing

Reviewed By: mrshenli

Differential Revision: D23814828

fbshipit-source-id: f355cb6515776debad536924de4f4d3fbb05a874
Summary:
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Fixes #{issue number}

Pull Request resolved: pytorch#50633

Reviewed By: samestep

Differential Revision: D26083492

Pulled By: seemethere

fbshipit-source-id: c133671b9cf5074539133ee79fca5c680793a85d
…ytorch#51155)

Summary:
Pull Request resolved: pytorch#51155

This PR added support for quantizing functional conv1d, conv3d,  conv1d_relu and conv3d_relu

Test Plan:
python test/test_quantization.py TestQuantizeFxOps.test_functional_conv

Imported from OSS

Reviewed By: vkuzo

Differential Revision: D26089965

fbshipit-source-id: 4aea507d05b744807e993f6d3711ab308fb7591b
Summary:
Reference: pytorch#42515

Pull Request resolved: pytorch#50140

Reviewed By: mrshenli

Differential Revision: D25951094

Pulled By: mruberry

fbshipit-source-id: e53f1dbddff889710f05d43dbc9587382d3decb0
…ytorch#44859)

Summary:
Pull Request resolved: pytorch#44859

TensorPipe's `set_device_map` option was applied during the forward
pass. However, if we ran the backward pass for the graph we would not
automatically pick up the reverse device mapping.

As a result, users had to specify both forward and backward device mapping
which is very tedious to do.

In this PR, I've added this functionality such that TensorPipe automatically
picks up the reverse device mapping during the backward pass. This is done by
storing the appropriate device mapping in the "recv" autograd function for
distributed autograd.

#Closes: pytorch#44170
ghstack-source-id: 119950842

Test Plan:
1) waitforbuildbot
2) Unit test added.

Reviewed By: mrshenli

Differential Revision: D23751975

fbshipit-source-id: 2717d0ef5bde3db029a6172d98aad95734d52140
…onv1d and conv3d

Test Plan: revert-hammer

Differential Revision:
D26089965 (pytorch@dd1a97b)

Original commit changeset: 4aea507d05b7

fbshipit-source-id: f54184cafb9dd07858683489d8bd147474e7e4b3
Summary:
Pull Request resolved: pytorch#51180

This fast path still did a refcount bump because it copied the inner intrusive_ptr to the stack. Now it's moved.
ghstack-source-id: 120460258

Test Plan:
1) profile empty benchmark & inspect assembly to verify move
2) run framework overhead benchmarks

Reviewed By: bhosmer

Differential Revision: D26094951

fbshipit-source-id: b2e09f9ad885cb633402885ca1e61a370723f6b8
Summary:
Fixes pytorch#49542

Pull Request resolved: pytorch#50039

Reviewed By: heitorschueroff

Differential Revision: D26096247

Pulled By: ngimel

fbshipit-source-id: ec1810d3412e0d7ab6b950265a3123519ad886c1
Get latest changes from the main repo
@imaginary-person imaginary-person merged this pull request into enable_min_max_half_types Jan 27, 2021
imaginary-person pushed a commit that referenced this pull request May 26, 2021
Summary: added more statistic info for static runtime

Test Plan:
caffe2/benchmarks/static_runtime:static_runtime_cpptest

Expected output example:

Static runtime ms per iter: 0.939483. Iters per second: 1064.41
Node #0: 0.195671 ms/iter, %wide_offset.1 : Tensor = aten::add(%wide.1, %self._mu, %4)
Node #1: 0.169457 ms/iter, %wide_normalized.1 : Tensor = aten::mul(%wide_offset.1, %self._sigma)
Node #2: 0.118218 ms/iter, %wide_preproc.1 : Tensor = aten::clamp(%wide_normalized.1, %5, %6)
Node #3: 0.038814 ms/iter, %user_emb_t.1 : Tensor = aten::transpose(%user_emb.1, %4, %7)
Node #4: 0.0860747 ms/iter, %dp_unflatten.1 : Tensor = aten::bmm(%ad_emb_packed.1, %user_emb_t.1)
Node #5: 0.0102666 ms/iter, %31 : Tensor = static_runtime::flatten_copy(%dp_unflatten.1, %4, %8)
Node #6: 0.000476333 ms/iter, %19 : Tensor[] = prim::ListConstruct(%31, %wide_preproc.1)
Node #7: 0.0707332 ms/iter, %input.1 : Tensor = aten::cat(%19, %4)
Node #8: 0.123695 ms/iter, %fc1.1 : Tensor = aten::addmm(%self._fc_b, %input.1, %29, %4, %4)
Node #9: 0.0309244 ms/iter, %23 : Tensor = aten::sigmoid(%fc1.1)
Node #10: 0.0046297 ms/iter, %24 : (Tensor) = prim::TupleConstruct(%23)
Time per node type:
       0.195671 ms.    23.0483%. aten::add (1 nodes)
       0.169457 ms.    19.9605%. aten::mul (1 nodes, out variant)
       0.123695 ms.    14.5702%. aten::addmm (1 nodes, out variant)
       0.118218 ms.     13.925%. aten::clamp (1 nodes, out variant)
      0.0860747 ms.    10.1388%. aten::bmm (1 nodes, out variant)
      0.0707332 ms.    8.33175%. aten::cat (1 nodes, out variant)
       0.038814 ms.    4.57195%. aten::transpose (1 nodes)
      0.0309244 ms.    3.64263%. aten::sigmoid (1 nodes, out variant)
      0.0102666 ms.    1.20932%. static_runtime::flatten_copy (1 nodes, out variant)
      0.0046297 ms.   0.545338%. prim::TupleConstruct (1 nodes, out variant)
    0.000476333 ms.  0.0561079%. prim::ListConstruct (1 nodes, out variant)
       0.848959 ms. in Total
StaticRuntime setup time: 0.018925 ms
Memory allocation time: 0.019808 ms
Memory deallocation time: 0.0120445 ms
Outputs deallocation time: 0.0864947 ms
Total memory managed: 19328 bytes
Total number of reused tensors: 3
Total number of 'out' variant nodes/total number of nodes: 9/11 (81.8182%)

Reviewed By: hlu1

Differential Revision: D28553029

fbshipit-source-id: 55e7eab50b4b475ae219896100bdf4f6678875a4
imaginary-person pushed a commit that referenced this pull request Jul 2, 2021
Summary:
Pull Request resolved: pytorch#60987

We were seeing deadlocks as follows during shutdown:

```
Thread 1 (LWP 2432101):
#0  0x00007efca470190b in __pause_nocancel () from /lib64/libc.so.6
#1  0x00007efca49de485 in __pthread_mutex_lock_full () from /lib64/libpthread.so.0
#2  0x00007ef91d4c42c6 in __cuda_CallJitEntryPoint () from /lib64/libnvidia-ptxjitcompiler.so.1
#3  0x00007efc651ac8f1 in ?? () from /lib64/libcuda.so
#4  0x00007efc651aee03 in ?? () from /lib64/libcuda.so
#5  0x00007efc64f76b84 in ?? () from /lib64/libcuda.so
#6  0x00007efc64f77f5d in ?? () from /lib64/libcuda.so
#7  0x00007efc64eac858 in ?? () from /lib64/libcuda.so
#8  0x00007efc64eacfbc in ?? () from /lib64/libcuda.so
#9  0x00007efc7810a924 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#10 0x00007efc780fa2be in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#11 0x00007efc78111044 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#12 0x00007efc7811580a in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#13 0x00007efc78115aa4 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#14 0x00007efc781079ec in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#15 0x00007efc780e6a7a in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#16 0x00007efc7811cfa5 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#17 0x00007efc777ea98c in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#18 0x00007efc777ebd80 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#19 0x00007efc777ea2c9 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#20 0x00007efc778c2e2d in cublasDestroy_v2 () from /usr/local/cuda/lib64/libcublas.so.11
#21 0x00007efc51a3fb56 in std::_Sp_counted_ptr_inplace<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle>, std::allocator<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle> >, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /data/users/pritam/pytorch/torch/lib/libtorch_cuda.so
#22 0x00007efc51a3fc5f in std::shared_ptr<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle> >::~shared_ptr() () from /data/users/pritam/pytorch/torch/lib/libtorch_cuda.so
#23 0x00007efca4648b0c in __run_exit_handlers () from /lib64/libc.so.6
#24 0x00007efca4648c40 in exit () from /lib64/libc.so.6
#25 0x0000558c8852e5f9 in Py_Exit (sts=0) at /tmp/build/80754af9/python_1614362349910/work/Python/pylifecycle.c:2292
#26 0x0000558c8852e6a7 in handle_system_exit () at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:636
#27 0x0000558c8852e742 in PyErr_PrintEx (set_sys_last_vars=<optimized out>, set_sys_last_vars=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:646
#28 0x0000558c88540dd6 in PyRun_SimpleStringFlags (command=0x7efca4dc9050 "from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=9, pipe_handle=13)\n", flags=0x7ffe3a986110) at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:457
#29 0x0000558c88540ead in pymain_run_command (cf=0x7ffe3a986110, command=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:420
#30 pymain_run_python (pymain=0x7ffe3a986220) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:2907
#31 pymain_main (pymain=0x7ffe3a986220) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:3460
#32 0x0000558c8854122c in _Py_UnixMain (argc=<optimized out>, argv=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:3495
#33 0x00007efca4632493 in __libc_start_main () from /lib64/libc.so.6
#34 0x0000558c884e5e90 in _start () at ../sysdeps/x86_64/elf/start.S:103
```

This was likely caused due to a static singleton that wasn't leaky. Following
the guidance in https://isocpp.org/wiki/faq/ctors#construct-on-first-use-v2 to
use a leaky singleton instead.
ghstack-source-id: 132847448

Test Plan: Verified locally.

Reviewed By: malfet

Differential Revision: D29468866

fbshipit-source-id: 89250594c5cd2643417b1da584c658b742dc5a5c
imaginary-person pushed a commit that referenced this pull request Jul 20, 2021
Summary:
Pull Request resolved: pytorch#61588

As part of debugging pytorch#60290,
we discovered the following deadlock:

```
Thread 79 (Thread 0x7f52ff7fe700 (LWP 205437)):
#0  pthread_cond_timedwait@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
#1  0x0000564880199152 in PyCOND_TIMEDWAIT (cond=0x564880346080 <gil_cond>, mut=0x564880346100 <gil_mutex>, us=5000) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/condvar.h:103
#2  take_gil (tstate=0x7f5254005ef0) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/ceval_gil.h:224
#3  0x0000564880217b62 in PyEval_AcquireThread (tstate=0x7f5254005ef0) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/ceval.c:278
#4  0x00007f557d54aabd in pybind11::gil_scoped_acquire::gil_scoped_acquire() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#5  0x00007f557da7792f in (anonymous namespace)::concrete_decref_fn(c10::impl::PyInterpreter const*, _object*) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#6  0x00007f5560dadba6 in c10::TensorImpl::release_resources() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so
#7  0x00007f5574c885bc in std::_Sp_counted_ptr_inplace<torch::distributed::autograd::DistAutogradContext, std::allocator<torch::distributed::autograd::DistAutogradContext>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#8  0x00007f5574c815e9 in std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<long const, std::shared_ptr<torch::distributed::autograd::DistAutogradContext> >, false> > >::_M_deallocate_node(std::__detail::_Hash_node<std::pair<long const, std::shared_ptr<torch::distributed::autograd::DistAutogradContext> >, false>*) [clone .isra.325] () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#9  0x00007f5574c81bf1 in torch::distributed::autograd::DistAutogradContainer::eraseContextIdAndReset(torch::distributed::autograd::DistAutogradContainer::ContextsShard&, long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#10 0x00007f5574c86e83 in torch::distributed::autograd::DistAutogradContainer::releaseContextIfPresent(long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#11 0x00007f5574cc6395 in torch::distributed::rpc::RequestCallbackNoPython::processCleanupAutogradContextReq(torch::distributed::rpc::RpcCommandBase&) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#12 0x00007f5574cccf15 in torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so

Thread 72 (Thread 0x7f53077fe700 (LWP 205412)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f55bc62adbd in __GI___pthread_mutex_lock (mutex=0x564884396440) at ../nptl/pthread_mutex_lock.c:80
#2  0x00007f5574c82a2f in torch::distributed::autograd::DistAutogradContainer::retrieveContext(long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#3  0x00007f557de9bb2f in pybind11::cpp_function::initialize<torch::distributed::autograd::(anonymous namespace)::dist_autograd_init(_object*, _object*)::{lambda(long)#11}, pybind11::dict, long, pybind11::name, pybind11::scope, pybind11::sibling, char [931], pybind11::arg>(torch::distributed::autograd::(anonymous namespace)::dist_autograd_init(_object*, _object*)::{lambda(long)#11}&&, pybind11::dict (*)(long), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [931], pybind11::arg const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so

```

Basically Thread 72, holds GIL and tries to acquire the lock for
DistAutogradContainer to perform a lookup on a map. On the other hand,
Thread 79 holds the lock on DistAutogradContainer to remove a Tensor and as
part of TensorImpl destructor, concrete_decref_fn is called which waits for
GIL. As a result, we have a deadlock.

To fix this issue, I've ensured we release GIL when we call `retrieveContext`
and acquire it later when needed.
ghstack-source-id: 133493659

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D29682624

fbshipit-source-id: f68a1fb39040ca0447a26e456a97bce64af6b79c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.