Skip to content

Get latest code from main repo#7

Merged
imaginary-person merged 52 commits intoimaginary-person:masterfrom
pytorch:master
Jan 27, 2021
Merged

Get latest code from main repo#7
imaginary-person merged 52 commits intoimaginary-person:masterfrom
pytorch:master

Conversation

@imaginary-person
Copy link
Copy Markdown
Owner

Get latest code from main repo

jfix71 and others added 30 commits January 23, 2021 17:20
Summary:
Pull Request resolved: pytorch/glow#5257

- Add RescaleQuantized parallelization support to graph opts' parallelization code
- On NNPI, mirror Rescale parallelization for FC/Relus that come before it
- Sink Reshapes below Quantize and ConvertTo
- Remove unnecessary ConvertTo when following a Dequantize (i.e. just change the elem kind of the Dequantize instead)

Test Plan: Added unit tests

Reviewed By: hyuen, mjanderson09

Differential Revision: D25947824

fbshipit-source-id: 771abd36a1bc7270bf1f901d1ec6cb6d78e9fd1f
…8436)

Summary:
This PR adds cusolver `gesvdj` and `gesvdjBatched` to the backend of `torch.svd`.

I've tested the performance using cuda 11.1 on 2070, V100, and A100. The cusolver gesvdj and gesvdjBatched performances are better than magma in all square matrix cases. So cusolver backend will replace magma backend when available.

When both matrix dimensions are no greater than 32, `gesvdjBatched` is used. Otherwise, `gesvdj` is used.

Detailed benchmark is available at https://github.com/xwang233/code-snippet/tree/master/svd.

Some relevant code and discussions
- https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/linalg/svd_op_gpu.cu.cc
- https://github.com/google/jax/blob/master/jaxlib/cusolver.cc
- cupy/cupy#3174
- tensorflow/tensorflow#13603
- https://www.nvidia.com/en-us/on-demand/session/gtcsiliconvalley2019-s9226/

See also #42666 #47953

Close #50516

Pull Request resolved: #48436

Reviewed By: ejguan

Differential Revision: D25977046

Pulled By: heitorschueroff

fbshipit-source-id: c27e705cd29b6fd7c8ac674c1f9f490fa26ee1bf
Summary:
Do it by removing extraneous header dependencies.
None of the at::vec256 primitive depend on notion of Tensor, therefore none of the headers that vec256 depends on should include <ATen/Tensor.h>

Implicity test it be removing c10 and tensor dependency when building `vec256_test_all_types`
Split affine_quantizer into affine_quantizer_base (that contains methods operating on raw types) and affine_quantizer (which contains Tensor specific methods)

Fixes #50567

Pull Request resolved: #50708

Reviewed By: walterddr

Differential Revision: D25949168

Pulled By: malfet

fbshipit-source-id: c3323be7252865a52c7d94026a5a39b494e44efb
Summary:
Now we can remove `_th_orgqr`!

Compared to the original TH-based `orgqr`, complex (#33152) and batched inputs are now supported.
CUDA support will be added in a follow-up PR.

Closes #24747

Ref. #49421, #42666

Pull Request resolved: #50502

Reviewed By: mrshenli

Differential Revision: D25953300

Pulled By: mruberry

fbshipit-source-id: f52a74e1c8f51b5e24f7b461430ca8fc96e4d149
…ke CLANGFORMAT`

Reviewed By: zertosh

Differential Revision: D26043955

fbshipit-source-id: 0a5740a82bdd3ac7bd1665a325ff7fe79488ccea
…0501)

Summary:
Follow up to #50435

I have confirmed this works by running
```
pytest test_ops.py -k test_fn_gradgrad_fft`
```
with normally and with `PYTORCH_TEST_WITH_SLOW=1 PYTORCH_TEST_SKIP_FAST=1`. In the first case all tests are skipped, in the second they all run as they should.

Pull Request resolved: #50501

Reviewed By: ezyang

Differential Revision: D25956416

Pulled By: mruberry

fbshipit-source-id: c896a8cec5f19b8ffb9b168835f3743b6986dad7
Summary:
Relate to #50483.

Everything except ONNX, detectron and release notes tests are moved to use common_utils.run_tests() to ensure CI reports XML correctly.

Pull Request resolved: #50923

Reviewed By: samestep

Differential Revision: D26027621

Pulled By: walterddr

fbshipit-source-id: b04c03f10d1fe96181b720c4c3868e86e4c6281a
Summary:
Pull Request resolved: #50817

Replace some longs with int64_t.  Thanks Tom Heaven for contributing
this patch.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D25975915

Pulled By: ezyang

fbshipit-source-id: c1061a85f80ad17fa4fb313da797bc6d5ba203c2
Summary:
Fixes #47117

Pull Request resolved: #51031

Reviewed By: bdhirsh

Differential Revision: D26047498

Pulled By: albanD

fbshipit-source-id: dd0a7d9f97c0f6469b3050d2e3b4473f1bee3820
…n for upsample_nearest2d" (#50794)

Summary:
Pull Request resolved: #50794

Original commit changeset: b4a7948088c0

There are some subtle extra tweaks on top of the original. I can unbundle them, but I've opted to keep it with the port because it's the easiest way to make sure the changes are exercised.

* There's a bugfix in the codegen to test if a dispatch key is structured *before* short circuiting because the dispatch key was missing in the table. This accounts for mixed structured-nonstructured situations where the dispatch table is present, but the relevant structured key isn't (because the dispatch table only exists to register, e.g., QuantizedCPU)
* Dispatch tables for functions which delegate to structured kernels don't have Math entries from generated for them.
* It's now illegal to specify a structured dispatch key in a delegated structured kernel (it will be ignored!) add is now fixed to follow this
* There are some extra sanity checks for NativeFunctions validation
* Finally, unlike the original PR, I switched the .vec variant of upsample_nearest2d to also be DefaultBackend, bringing it inline with upsample_nearest1d.
ghstack-source-id: 120038038

Test Plan:
```
buck test mode/dev //coreai/tiefenrausch:python_tests -- --exact 'coreai/tiefenrausch:python_tests - test_can_run_local_async_inference_cpu (coreai.tiefenrausch.tests.python_test.TiefenrauschPY)' --run-disabled
```

Reviewed By: ngimel

Differential Revision: D25962873

fbshipit-source-id: d29a9c97f15151db3066ae5efe7a0701e6dc05a3
Summary:
This simplifies our handling and allows passing CompilationUnits from Python to C++ defined functions via PyBind easily.

Discussed on Slack with SplitInfinity

Pull Request resolved: #50614

Reviewed By: anjali411

Differential Revision: D25938005

Pulled By: SplitInfinity

fbshipit-source-id: 94aadf0c063ddfef7ca9ea17bfa998d8e7b367ad
Summary:
The test is flaky on ROCM when deadline is set to 1 second. This is affecting builds as it is failing randomly.
Disabling for now.

Signed-off-by: Arindam Roy <rarindam@gmail.com>

Pull Request resolved: #50964

Reviewed By: houseroad

Differential Revision: D26049370

Pulled By: BIT-silence

fbshipit-source-id: 22337590a8896ad75f1281e56fbbeae897f5c3b2
Summary:
Updates the docstrings, that `jvp` and `vjp` both return the primal `func_output` first as part of the return tuple,
in line with the docstrings of [hvp](https://github.com/niklasschmitz/pytorch/blob/c620572a3477143df33128318dc0d7d10fab811d/torch/autograd/functional.py#L671) and [vhp](https://github.com/niklasschmitz/pytorch/blob/c620572a3477143df33128318dc0d7d10fab811d/torch/autograd/functional.py#L583).

Pull Request resolved: #51035

Reviewed By: bdhirsh

Differential Revision: D26047693

Pulled By: albanD

fbshipit-source-id: 5f2957a858826b4c1884590b6be7a8bed0791efd
… {functional relu/module relu} (#50975)

Summary: Pull Request resolved: #50975

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D26032532

fbshipit-source-id: a084fb4fd711ad52b2da1c6378cbcc2b352976c6
Summary:
- Resolves ngimel's review comments in #49109
- Move `ConvolutionArgs` from `ConvShared.h` to `Conv_v7.cpp`, because cuDNN v8 uses different descriptors therefore will not share the same `ConvolutionArgs`.
- Refactor the `ConvolutionParams` (the hash key for benchmark):
  - Remove `input_stride`
  - Add `input_dim`
  - Add `memory_format`
- Make `repro_from_args` to take `ConvolutionParams` instead of `ConvolutionArgs` as arguments so that it can be shared for v7 and v8
- Rename some `layout` to `memory_format`. `layout` should be sparse/strided and `memory_format` should be contiguous/channels_last. They are different things.

Pull Request resolved: #50827

Reviewed By: bdhirsh

Differential Revision: D26048274

Pulled By: ezyang

fbshipit-source-id: f71aa02d90ffa581c17ab05b171759904b311517
Summary:
This is a follow up PR of #48493.

Fixes #48492

Pull Request resolved: #50824

Reviewed By: bdhirsh

Differential Revision: D26050736

Pulled By: ezyang

fbshipit-source-id: 049605fd271cff28c8b6e300c163e9df3b3ea23b
…0957)

Summary:
Pull Request resolved: #50957

MAGMA has an off-by-one error in their batched cholesky implementation which is causing illegal memory access for certain inputs. The workaround implemented in this PR is to pad the input to MAGMA with 1 extra element.

**Benchmark**
Ran the script below for both before and after my PR and got similar results.

*Script*
```
import torch
from torch.utils import benchmark

DTYPE = torch.float32
BATCHSIZE = 512 * 512
MATRIXSIZE = 16

a = torch.eye(MATRIXSIZE, device='cuda', dtype=DTYPE)

t0 = benchmark.Timer(
    stmt='torch.cholesky(a)',
    globals={'a': a},
    label='Single'
)

t1 = benchmark.Timer(
    stmt='torch.cholesky(a)',
    globals={'a': a.expand(BATCHSIZE, -1, -1)},
    label='Batched'
)

print(t0.timeit(100))
print(t1.timeit(100))
```

*Results before*
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400>
Single
  2.08 ms
  1 measurement, 100 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400>
Batched
  7.68 ms
  1 measurement, 100 runs , 1 thread
```

*Results after*
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400>
Single
  2.10 ms
  1 measurement, 100 runs , 1 thread
<torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400>
Batched
  7.56 ms
  1 measurement, 100 runs , 1 thread
```

Fixes #41394, #26996, #48996

See also #42666, #26789

TODO
 ---
- [x] Benchmark to check for perf regressions

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D26050978

Pulled By: heitorschueroff

fbshipit-source-id: 7a5ba7e34c9d74b58568b2a0c631cc6d7ba63f86
Summary:
Pull Request resolved: #50791

Add a dedicated pipeline parallelism doc page explaining the APIs and
the overall value of the module.
ghstack-source-id: 120257168

Test Plan:
1) View locally
2) waitforbuildbot

Reviewed By: rohan-varma

Differential Revision: D25967981

fbshipit-source-id: b607b788703173a5fa4e3526471140506171632b
…50868)

Summary:
Pull Request resolved: #50868

Ensures that `FakeQuantize` respects device affinity when loading from
state_dict, and knows how to resize scale and zero_point values
(which is necessary for FQ classes wrapping per channel observers).

This is same as #44537, but for
`FakeQuantize`.

Test Plan:
```
python test/test_quantization.py TestObserver.test_state_dict_respects_device_affinity
```

Imported from OSS

Reviewed By: jerryzh168

Differential Revision: D25991570

fbshipit-source-id: 1193a6cd350bddabd625aafa0682e2e101223bb1
Summary:
**BC-breaking note:**

torch.svd() added support for complex inputs in PyTorch 1.7, but was not documented as doing so. The complex "V" tensor returned was actually the complex conjugate of what's expected. This PR fixes the discrepancy.

This will silently break all users of torch.svd() with complex inputs.

**Original PR Summary:**

This PR resolves #45821.

The problem was that when introducing the support of complex inputs for `torch.svd` it was overlooked that LAPACK/MAGMA returns the conjugate transpose of V matrix, not just the transpose of V. So `torch.svd` was silently returning U, S, V.conj() instead of U, S, V.

Behavior of `torch.linalg.pinv`, `torch.pinverse` and `torch.linalg.svd` (they depend on `torch.svd`) is not changed in this PR.

Pull Request resolved: #51012

Reviewed By: bdhirsh

Differential Revision: D26047593

Pulled By: albanD

fbshipit-source-id: d1e08dbc3aab9ce1150a95806ef3b5da98b5d3ca
Summary:
Pull Request resolved: #50458

libinterpreter.so contains a frozen python distribution including
torch-python bindings.

Freezing refers to serializing bytecode of python standard library modules as
well as the torch python library and embedding them in the library code.  This
library can then be dlopened multiple times in one process context, each
interpreter having its own python state and GIL.  In addition, each python
environment is sealed off from the filesystem and can only import the frozen
modules included in the distribution.

This change relies on newly added frozenpython, a cpython 3.8.6 fork built for this purpose.  Frozenpython provides libpython3.8-frozen.a which
contains frozen bytecode and object code for the python standard library.

Building on top of frozen python, the frozen torch-python bindings are added in
this diff, providing each embedded interpreter with a copy of the torch
bindings.  Each interpreter is intended to share one instance of libtorch and
the underlying tensor libraries.

Known issues

- Autograd is not expected to work with the embedded interpreter currently, as it manages
its own python interactions and needs to coordinate with the duplicated python
states in each of the interpreters.
- Distributed and cuda stuff is disabled in libinterpreter.so build, needs to be revisited
- __file__ is not supported in the context of embedded python since there are no
files for the underlying library modules.
using __file__
- __version__ is not properly supported in the embedded torch-python, just a
workaround for now

Test Plan: tested locally and on CI with cmake and buck builds running torch::deploy interpreter_test

Reviewed By: ailzhang

Differential Revision: D25850783

fbshipit-source-id: a4656377caff25b73913daae7ae2f88bcab8fd88
Summary:
Pull Request resolved: #50622

1. Define a DDPLoggingData struct that is the placeholder for all the ddp related logging fields
2. Put the DDPLoggingData struct in the C10 directory so that it can be easily imported by c10 and torch files
3. Expose get_ddp_logging_data() method in python so that users can get the logging data and dump in their applications
4. Unit test tested the logging data can be set and got as expected
5. Follow up will add more logging fields such as perf stats, internal states, env variables and etc
ghstack-source-id: 120275870

Test Plan: unit tests

Reviewed By: SciPioneer

Differential Revision: D25930527

fbshipit-source-id: 290c200161019c58e28eed9a5a2a7a8153113f99
Summary:
Fixes #50496
Fixes #34859
Fixes #21596

This fixes many bugs involving `TransformedDistribution` and `ComposeTransform` when the component transforms changed their event shapes. Part of the fix is to introduce an `IndependentTransform` analogous to `distributions.Independent` and `constraints.independent`, and to introduce methods `Transform.forward_shape()` and `.inverse_shape()`. I have followed fehiepsi's suggestion and replaced `.input_event_dim` -> `.domain.event_dim` and `.output_event_dim` -> `.codomain.event_dim`. This allows us to deprecate `.event_dim` as an attribute.

## Summary of changes

- Fixes `TransformDistribution` and `ComposeTransform` shape errors.
- Fixes a behavior bug in `LogisticNormal`.
- Fixes `kl_divergence(TransformedDistribution, TransformedDistribution)`
- Adds methods `Transform.forward_shape()`, `.inverse_shape()` which are required for correct shape computations in `TransformedDistribution` and `ComposeTransform`.
- Adds an `IndependentTransform`.
- Adds a `ReshapeTransform` which is invaluable in testing shape logic in `ComposeTransform` and `TransformedDistribution` and which will be used by stefanwebb flowtorch.
- Fixes incorrect default values in `constraints.dependent.event_dim`.
- Documents the `.event_dim` and `.is_discrete` attributes.

## Changes planned for follow-up PRs

- Memoize `constraints.dependent_property` as we do with `lazy_property`, since we now consult those properties much more often.

## Tested
- [x] added a test for `Dist.support` vs `Dist(**params).support` to ensure static and dynamic attributes agree.
- [x] refactoring is covered by existing tests
- [x] add test cases for `ReshapedTransform`
- [x] add a test for `TransformedDistribution` on a wide grid of input shapes
- [x] added a regression test for #34859

cc fehiepsi feynmanliang stefanwebb

Pull Request resolved: #50581

Reviewed By: ezyang, glaringlee, jpchen

Differential Revision: D26024247

Pulled By: neerajprad

fbshipit-source-id: f0b9a296f780ff49659b132409e11a29985dde9b
Summary:
Pull Request resolved: #51043

This PR makes `fast_nvcc` stop at failing commands, rather than continuing on to run commands that would otherwise run after those commands. It is still possible for `fast_nvcc` to run more commands than `nvcc` would run if there's no dependency between them, but this should still help to reduce noise from failing `fast_nvcc` runs.

Test Plan: Unfortunately the test suite for this script is FB-internal. It would probably be a good idea to move it into the PyTorch GitHub repo, but I'm not entirely sure how to do so, since I don't believe we currently have a good place to put tests for things in `tools`.

Reviewed By: malfet

Differential Revision: D26007788

fbshipit-source-id: 8fe1e7d020a29d32d08fe55fb59229af5cdfbcaa
Summary:
Pull Request resolved: #51051

Disable input pointer caching on ios. We are seeing some issues with this on some ios devices.

Test Plan:
FB:
Test this in of IG with BT effect.

Reviewed By: IvanKobzarev, AshkanAliabadi

Differential Revision: D25984429

fbshipit-source-id: f6ceef606994b22de9cdd9752115b3481cd7bd96
Summary:
See above.

Pull Request resolved: #51046

Reviewed By: ZolotukhinM

Differential Revision: D26053419

Pulled By: Chillee

fbshipit-source-id: 9cc2dc434239a1ad77d30a1e5c0a9592be4944dc
Summary:
Related to issue #42666

Pull Request resolved: #49168

Reviewed By: mrshenli

Differential Revision: D25954027

Pulled By: mruberry

fbshipit-source-id: e429f9587efff5e638bfd0e4de864c06f41c63b1
…e first K iterations (#50973)

Summary:
Pull Request resolved: #50973

This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup.

Also add more comments on the fields in `PowerSGDState`.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120257202

Test Plan:
buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl

buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook

Reviewed By: rohan-varma

Differential Revision: D26031478

fbshipit-source-id: d72e70bb28ba018f53223c2a4345306980b3084e
Summary:
Pull Request resolved: #50974

Typo fixes.

Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202
ghstack-source-id: 120257221

Test Plan: N/A

Reviewed By: rohan-varma

Differential Revision: D26031679

fbshipit-source-id: 9d049b50419a3e40e53f7f1275a441e31b87717b
…#50854)

Summary: Pull Request resolved: #50854

Test Plan: Imported from OSS

Reviewed By: bhosmer

Differential Revision: D26008542

Pulled By: ailzhang

fbshipit-source-id: e9c0aa97ac2537ff612f5faf348fcb613da09479
Mike Ruberry and others added 22 commits January 26, 2021 02:07
…eter

Test Plan: revert-hammer

Differential Revision:
D25850783 (3192f9e)

Original commit changeset: a4656377caff

fbshipit-source-id: 1c7133627da28fb12848da7a9a46de6d3b2b67c6
Summary:
Pull Request resolved: #50744

This PR adds a `check_batched_grad=True` option to CriterionTest and
turns it on by default for all CriterionTest-generated tests

Test Plan: - run tests

Reviewed By: ejguan

Differential Revision: D25997676

Pulled By: zou3519

fbshipit-source-id: cc730731e6fae2bddc01bc93800fd0e3de28b32d
Summary:
Closes #40702, Fixes #40690

Currently wip. But I would appreciate some feedback. Functions should be double-differentiable.

Contrary to https://github.com/pytorch/pytorch/blob/b35cdc5200af963e410c0a25400fd07f30b89bca/torch/nn/parallel/_functions.py
This PR generates list of tensors instead of aggregating the received data in a single tensor. Is this behavior correct?

Thanks!

Pull Request resolved: #40762

Reviewed By: glaringlee

Differential Revision: D24758889

Pulled By: mrshenli

fbshipit-source-id: 79285fb4b791cae3d248f34e2aadb11c9ab10cce
Summary:
Removed skipCUDAIfRocm to re-enable tests for
ROCM platform.

Initially, Only 4799 cases were being run.
Out of those, 882 cases were being skipped.
After removing skipCUDAIfRocm from two places
in test_ops.py, now more than 8000 cases are
being executed, out of which only 282 cases
are bing skipped, which are FFT related tests.

Signed-off-by: Arindam Roy <rarindam@gmail.com>

Fixes #{issue number}

Pull Request resolved: #50500

Reviewed By: albanD

Differential Revision: D25920303

Pulled By: mrshenli

fbshipit-source-id: b2d17b7e2d1de4f9fdd6f1660fb4cad5841edaa0
Summary:
This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe).

New submodule commit: pytorch/tensorpipe@f463e0e

Pull Request resolved: #50946

Test Plan: Ensure that CI jobs succeed on GitHub before landing.

Reviewed By: lw

Differential Revision: D26018916

fbshipit-source-id: dc8aaa98d4e002e972d5c6783f2351c29f7db239
Summary:
This fixes the following flaky test on machine with gpus of different arch:
```
_________________________________________________________________________________________________________________ TestCppExtensionJIT.test_jit_cuda_archflags __________________________________________________________________________________________________________________

self = <test_cpp_extensions_jit.TestCppExtensionJIT testMethod=test_jit_cuda_archflags>

    unittest.skipIf(not TEST_CUDA, "CUDA not found")
    unittest.skipIf(TEST_ROCM, "disabled on rocm")
    def test_jit_cuda_archflags(self):
        # Test a number of combinations:
        #   - the default for the machine we're testing on
        #   - Separators, can be ';' (most common) or ' '
        #   - Architecture names
        #   - With/without '+PTX'

        capability = torch.cuda.get_device_capability()
        # expected values is length-2 tuple: (list of ELF, list of PTX)
        # note: there should not be more than one PTX value
        archflags = {
            '': (['{}{}'.format(capability[0], capability[1])], None),
            "Maxwell+Tegra;6.1": (['53', '61'], None),
            "Pascal 3.5": (['35', '60', '61'], None),
            "Volta": (['70'], ['70']),
        }
        if int(torch.version.cuda.split('.')[0]) >= 10:
            # CUDA 9 only supports compute capability <= 7.2
            archflags["7.5+PTX"] = (['75'], ['75'])
            archflags["5.0;6.0+PTX;7.0;7.5"] = (['50', '60', '70', '75'], ['60'])

        for flags, expected in archflags.items():
>           self._run_jit_cuda_archflags(flags, expected)

test_cpp_extensions_jit.py:198:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test_cpp_extensions_jit.py:158: in _run_jit_cuda_archflags
    _check_cuobjdump_output(expected[0])
test_cpp_extensions_jit.py:134: in _check_cuobjdump_output
    self.assertEqual(actual_arches, expected_arches,
../../.local/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py:1211: in assertEqual
    super().assertEqual(len(x), len(y), msg=self._get_assert_msg(msg, debug_msg=debug_msg))
E   AssertionError: 2 != 1 : Attempted to compare the lengths of [iterable] types: Expected: 2; Actual: 1.
E   Flags: ,  Actual: ['sm_75', 'sm_86'],  Expected: ['sm_86']
E   Stderr:
E   Output: ELF file    1: cudaext_archflags.1.sm_75.cubin
E   ELF file    2: cudaext_archflags.2.sm_86.cubin

```

Pull Request resolved: #50405

Reviewed By: albanD

Differential Revision: D25920200

Pulled By: mrshenli

fbshipit-source-id: 1042a984142108f954a283407334d39e3ec328ce
Summary:
`ResolutionCallback` returns `py::object` (i.e. `Any`) rather than `py::function` (i.e. `Callable`)

Discovered while debugging test failures after updating pybind11

This also makes resolution code slightly faster, as it eliminates casts from object to function and back for every `py::object obj = rcb_(name);` statement.

Pull Request resolved: #51089

Reviewed By: jamesr66a

Differential Revision: D26069295

Pulled By: malfet

fbshipit-source-id: 6876caf9b4653c8dc8e568aefb6778895decea05
)

Summary:
Closes #50513 by resolving all four checkboxes. If this PR is merged, I will also modify one or both of the following wiki pages to add instructions on how to use this `mypy` wrapper for VS Code editor integration:

- [Guide for adding type annotations to PyTorch](https://github.com/pytorch/pytorch/wiki/Guide-for-adding-type-annotations-to-PyTorch)
- [Lint as you type](https://github.com/pytorch/pytorch/wiki/Lint-as-you-type)

Pull Request resolved: #50826

Test Plan:
Unit tests for globbing function:
```
python test/test_testing.py TestMypyWrapper -v
```

Manual checks:

- Uninstall `mypy` and run `python test/test_type_hints.py` to verify that it still works when `mypy` is absent.
- Reinstall `mypy` and run `python test/test_type_hints.py` to verify that this didn't break the `TestTypeHints` suite.
- Run `python test/test_type_hints.py` again (should finish quickly) to verify that this didn't break `mypy` caching.
- Run `torch/testing/_internal/mypy_wrapper.py` on a few Python files in this repo to verify that it doesn't give any additional warnings when the `TestTypeHints` suite passes. Some examples (compare with the behavior of just running `mypy` on these files):
  ```sh
  torch/testing/_internal/mypy_wrapper.py $PWD/README.md
  torch/testing/_internal/mypy_wrapper.py $PWD/tools/fast_nvcc/fast_nvcc.py
  torch/testing/_internal/mypy_wrapper.py $PWD/test/test_type_hints.py
  torch/testing/_internal/mypy_wrapper.py $PWD/torch/random.py
  torch/testing/_internal/mypy_wrapper.py $PWD/torch/testing/_internal/mypy_wrapper.py
  ```
- Remove type hints from `torch.testing._internal.mypy_wrapper` and verify that running `mypy_wrapper.py` on that file gives type errors.
- Remove the path to `mypy_wrapper.py` from the `files` setting in `mypy-strict.ini` and verify that running it again on itself no longer gives type errors.
- Add `test/test_type_hints.py` to the `files` setting in `mypy-strict.ini` and verify that running the `mypy` wrapper on it again now gives type errors.
- Change a return type in `torch/random.py` and verify that running the `mypy` wrapper on it again now gives type errors.
- Add the suggested JSON from the docstring of `torch.testing._internal.mypy_wrapper.main` to your `.vscode/settings.json` and verify that VS Code gives the same results (inline, while editing any Python file in the repo) as running the `mypy` wrapper on the command line, in all the above cases.

Reviewed By: walterddr

Differential Revision: D26049052

Pulled By: samestep

fbshipit-source-id: 0b35162fc78976452b5ea20d4ab63937b3c7695d
Summary:
Pull Request resolved: #50630

Add a warning log to distributed optimizer, to warn user the optimizer
is created without TorchScript support.

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D25932777

Pulled By: wanchaol

fbshipit-source-id: 8db3b98bdd27fc04c5a3b8d910b028c0c37f138d
Summary:
Fixes #{issue number}

Pull Request resolved: #50442

Reviewed By: bdhirsh

Differential Revision: D26044981

Pulled By: mruberry

fbshipit-source-id: 65c42f2c1de8d24e4852a1b5bd8f4b1735b2230e
Summary: Pull Request resolved: #50976

Test Plan: Imported from OSS

Reviewed By: supriyar

Differential Revision: D26032531

fbshipit-source-id: 9725bab8f70ac79652e7bf9f94376917438d60e0
Test Plan: revert-hammer

Differential Revision:
D26018916 (5f297cc)

Original commit changeset: dc8aaa98d4e0

fbshipit-source-id: cd81a7950c7141e0711faabf03292098a8cf14d3
Test Plan:
buck test //caffe2/test:test_fx_experimental
buck test //glow/fb/fx_nnpi_importer:test_importer

Reviewed By: jfix71

Differential Revision: D25675618

fbshipit-source-id: 55636bb2d3d6102b400f2044118a450906954083
Summary:
In Python-3.9 and above `inspect.getsource` of a local class does not work if it was marked as default, see https://bugs.python.org/issue42666 #49617
Workaround by defining `make_global` function that programmatically accomplishes the same

Partially addresses issue raised in #49617

Pull Request resolved: #51088

Reviewed By: gmagogsfm

Differential Revision: D26069189

Pulled By: malfet

fbshipit-source-id: 7cf14b88ae5d2b95d2b0fd852717a9202b86356e
Summary:
Pull Request resolved: #51113

toTensor() on an lvalue IValue returns a reference; no need to copy.
ghstack-source-id: 120317233

Test Plan:
fitsships

Compared `perf stat` results before/after (was on top of a diff stack
so don't take baseline as where master is)

Before:
```
         74,178.77 msec task-clock                #    0.999 CPUs utilized            ( +-  0.31% )
            17,125      context-switches          #    0.231 K/sec                    ( +-  3.41% )
                 3      cpu-migrations            #    0.000 K/sec
           109,535      page-faults               #    0.001 M/sec                    ( +-  1.04% )
   146,803,364,372      cycles                    #    1.979 GHz                      ( +-  0.30% )  (50.03%)
   277,726,600,254      instructions              #    1.89  insn per cycle           ( +-  0.02% )  (50.03%)
    43,299,659,815      branches                  #  583.720 M/sec                    ( +-  0.03% )  (50.03%)
       130,504,094      branch-misses             #    0.30% of all branches          ( +-  1.14% )  (50.03%)
```

After:
```
         72,695.01 msec task-clock                #    0.999 CPUs utilized            ( +-  1.18% )
            15,994      context-switches          #    0.220 K/sec                    ( +-  5.21% )
                 3      cpu-migrations            #    0.000 K/sec
           107,743      page-faults               #    0.001 M/sec                    ( +-  1.55% )
   145,647,684,269      cycles                    #    2.004 GHz                      ( +-  0.30% )  (50.05%)
   277,341,084,993      instructions              #    1.90  insn per cycle           ( +-  0.02% )  (50.04%)
    43,200,717,263      branches                  #  594.273 M/sec                    ( +-  0.02% )  (50.05%)
       143,873,086      branch-misses             #    0.33% of all branches          ( +-  0.59% )  (50.05%)
```

Looks like an 0.7% cycles win (barely outside the noise) and an 0.1%
instructions win.

Reviewed By: hlu1

Differential Revision: D26051766

fbshipit-source-id: 05f8d71d8120d79f7cd80aca747dfc537bf7d382
Summary:
Pull Request resolved: #51047

If the environment variable `TORCH_VITAL` is set to a non-zero length string, the vitals a dumped at program end.

The API is very similar to google's logging

Test Plan: buck test //caffe2/aten:vitals

Reviewed By: bitfort

Differential Revision: D25791248

fbshipit-source-id: 0b40da7d22c31d2c4b2094f0dcb1229a35338ac2
Summary:
Update pybind repo to include `gil_scoped_acquire::disarm()` methods
In python_engine allocate scoped_acquire as unique_ptr and leak it if engine is finalizing for Python-3.9+

Fixes #50014 and #50893

Pull Request resolved: #50998

Reviewed By: ezyang

Differential Revision: D26038314

Pulled By: malfet

fbshipit-source-id: 035411e22825e8fdcf1348fed36da0bc33e16f60
Summary: Adding a set of benchmarks for key operators

Test Plan:
buck build mode/opt -c 'fbcode.caffe2_gpu_type=none' caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench
OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 numactl -C 3 ./buck-out/gen/caffe2/benchmarks/cpp/tensorexpr/tensorexpr_bench

Reviewed By: ZolotukhinM

Differential Revision: D25981260

fbshipit-source-id: 17681fc1527f43ccf9bcc80704415653a627b396
Summary:
Pull Request resolved: #51093

Operator level benchmarks comparing eager-mode PyTorch to
NNC-generated fused kernels.  We wouldn't normally see these in isolation, but
it points out where NNC is falling short (or doing well).

I threw in a composed hardswish for fun, because it's my favorite activation
function.

Notably, it exposes a bug in our build process that's preventing vectorization
from using `sleef`, so we're using scalar calls to libm with predictably lousy
performance.  Fix incoming.

This benchmark is similar to the pure NNC approach in `microbenchmarks.py`, but
will include the overhead of dispatching the fused kernel through TorchScript.
ghstack-source-id: 120403675

Test Plan:
```
op                        eager        nnc    speedup
hardswish                 0.187      0.051       3.70
hardswish                 0.052      0.052       1.00
sigmoid                   0.148      1.177       0.13
reciprocal                0.049      0.050       0.98
neg                       0.038      0.037       1.02
relu                      0.037      0.036       1.03
isnan                     0.119      0.020       5.86
log                       0.082      1.330       0.06
log10                     0.148      1.848       0.08
log1p                     0.204      1.413       0.14
log2                      0.285      1.167       0.24
exp                       0.063      1.123       0.06
expm1                     0.402      1.417       0.28
erf                       0.167      0.852       0.20
erfc                      0.181      1.098       0.16
cos                       0.124      0.793       0.16
sin                       0.126      0.838       0.15
tan                       0.285      1.777       0.16
acos                      0.144      1.358       0.11
asin                      0.126      1.193       0.11
cosh                      0.384      1.761       0.22
sinh                      0.390      2.279       0.17
atan                      0.240      1.564       0.15
tanh                      0.320      2.259       0.14
sqrt                      0.043      0.069       0.63
rsqrt                     0.118      0.117       1.01
abs                       0.038      0.037       1.03
ceil                      0.038      0.038       1.01
floor                     0.039      0.039       1.00
round                     0.039      0.292       0.13
trunc                     0.040      0.036       1.12
lgamma                    2.045      2.721       0.75
```

Reviewed By: zheng-xq

Differential Revision: D26069791

fbshipit-source-id: 236e7287ba1b3f67fdcb938949a92bbbdfa13dba
)

Summary:
Fixes #50695.

Rather than maintain a LICENSE_BUNDLED.txt by hand, this build it out of the subrepos.

I ~copied and adapted the sdist handling from Numpy~ added a separate file, so the LICENSE.txt file of the repo remains in pristine condition and the GitHub website still recognizes it. If we modify the file, the website will no longer recognize the license.

This is not enough, since the license in the ~wheel~ wheel and sdist is not modified. Numpy has a [separate step](https://github.com/MacPython/numpy-wheels/blob/master/patch_code.sh) when preparing the wheel to concatenate the licenses. I am not sure where/if the [conda-forge numpy-feedstock](https://github.com/conda-forge/numpy-feedstock/) also fixes up the license.

~Should~ I ~commit~ commited the artifact to the repo and ~add~ added a test that reproducing the file is consistent.

Edit: now the file is part of the repo.

Edit: rework the mention of sdist. After this is merged another PR is needed to make the sdist and wheel ship the proper merged license.

Pull Request resolved: #50745

Reviewed By: seemethere, heitorschueroff

Differential Revision: D26074974

Pulled By: walterddr

fbshipit-source-id: bacd5d6870e9dbb419a31a3e3d2fdde286ff2c94
Test Plan: revert-hammer

Differential Revision:
D25675618 (c8a24eb)

Original commit changeset: 55636bb2d3d6

fbshipit-source-id: 7b196f7c32830061eca9c89bbcb346cdd66a211e
Summary:
Introduced by D25981260 (f08464f)

Pull Request resolved: #51157

Reviewed By: bwasti

Differential Revision: D26090008

Pulled By: malfet

fbshipit-source-id: b63f1bb1683c7261902de7eaab24a05a5159ce7e
@imaginary-person imaginary-person merged commit 657946d into imaginary-person:master Jan 27, 2021
imaginary-person pushed a commit that referenced this pull request May 26, 2021
Summary: added more statistic info for static runtime

Test Plan:
caffe2/benchmarks/static_runtime:static_runtime_cpptest

Expected output example:

Static runtime ms per iter: 0.939483. Iters per second: 1064.41
Node #0: 0.195671 ms/iter, %wide_offset.1 : Tensor = aten::add(%wide.1, %self._mu, %4)
Node #1: 0.169457 ms/iter, %wide_normalized.1 : Tensor = aten::mul(%wide_offset.1, %self._sigma)
Node #2: 0.118218 ms/iter, %wide_preproc.1 : Tensor = aten::clamp(%wide_normalized.1, %5, %6)
Node #3: 0.038814 ms/iter, %user_emb_t.1 : Tensor = aten::transpose(%user_emb.1, %4, %7)
Node #4: 0.0860747 ms/iter, %dp_unflatten.1 : Tensor = aten::bmm(%ad_emb_packed.1, %user_emb_t.1)
Node #5: 0.0102666 ms/iter, %31 : Tensor = static_runtime::flatten_copy(%dp_unflatten.1, %4, %8)
Node #6: 0.000476333 ms/iter, %19 : Tensor[] = prim::ListConstruct(%31, %wide_preproc.1)
Node #7: 0.0707332 ms/iter, %input.1 : Tensor = aten::cat(%19, %4)
Node #8: 0.123695 ms/iter, %fc1.1 : Tensor = aten::addmm(%self._fc_b, %input.1, %29, %4, %4)
Node #9: 0.0309244 ms/iter, %23 : Tensor = aten::sigmoid(%fc1.1)
Node #10: 0.0046297 ms/iter, %24 : (Tensor) = prim::TupleConstruct(%23)
Time per node type:
       0.195671 ms.    23.0483%. aten::add (1 nodes)
       0.169457 ms.    19.9605%. aten::mul (1 nodes, out variant)
       0.123695 ms.    14.5702%. aten::addmm (1 nodes, out variant)
       0.118218 ms.     13.925%. aten::clamp (1 nodes, out variant)
      0.0860747 ms.    10.1388%. aten::bmm (1 nodes, out variant)
      0.0707332 ms.    8.33175%. aten::cat (1 nodes, out variant)
       0.038814 ms.    4.57195%. aten::transpose (1 nodes)
      0.0309244 ms.    3.64263%. aten::sigmoid (1 nodes, out variant)
      0.0102666 ms.    1.20932%. static_runtime::flatten_copy (1 nodes, out variant)
      0.0046297 ms.   0.545338%. prim::TupleConstruct (1 nodes, out variant)
    0.000476333 ms.  0.0561079%. prim::ListConstruct (1 nodes, out variant)
       0.848959 ms. in Total
StaticRuntime setup time: 0.018925 ms
Memory allocation time: 0.019808 ms
Memory deallocation time: 0.0120445 ms
Outputs deallocation time: 0.0864947 ms
Total memory managed: 19328 bytes
Total number of reused tensors: 3
Total number of 'out' variant nodes/total number of nodes: 9/11 (81.8182%)

Reviewed By: hlu1

Differential Revision: D28553029

fbshipit-source-id: 55e7eab50b4b475ae219896100bdf4f6678875a4
imaginary-person pushed a commit that referenced this pull request Jul 2, 2021
Summary:
Pull Request resolved: pytorch#60987

We were seeing deadlocks as follows during shutdown:

```
Thread 1 (LWP 2432101):
#0  0x00007efca470190b in __pause_nocancel () from /lib64/libc.so.6
#1  0x00007efca49de485 in __pthread_mutex_lock_full () from /lib64/libpthread.so.0
#2  0x00007ef91d4c42c6 in __cuda_CallJitEntryPoint () from /lib64/libnvidia-ptxjitcompiler.so.1
#3  0x00007efc651ac8f1 in ?? () from /lib64/libcuda.so
#4  0x00007efc651aee03 in ?? () from /lib64/libcuda.so
#5  0x00007efc64f76b84 in ?? () from /lib64/libcuda.so
#6  0x00007efc64f77f5d in ?? () from /lib64/libcuda.so
#7  0x00007efc64eac858 in ?? () from /lib64/libcuda.so
#8  0x00007efc64eacfbc in ?? () from /lib64/libcuda.so
#9  0x00007efc7810a924 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#10 0x00007efc780fa2be in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#11 0x00007efc78111044 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#12 0x00007efc7811580a in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#13 0x00007efc78115aa4 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#14 0x00007efc781079ec in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#15 0x00007efc780e6a7a in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#16 0x00007efc7811cfa5 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#17 0x00007efc777ea98c in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#18 0x00007efc777ebd80 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#19 0x00007efc777ea2c9 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#20 0x00007efc778c2e2d in cublasDestroy_v2 () from /usr/local/cuda/lib64/libcublas.so.11
#21 0x00007efc51a3fb56 in std::_Sp_counted_ptr_inplace<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle>, std::allocator<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle> >, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /data/users/pritam/pytorch/torch/lib/libtorch_cuda.so
#22 0x00007efc51a3fc5f in std::shared_ptr<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle> >::~shared_ptr() () from /data/users/pritam/pytorch/torch/lib/libtorch_cuda.so
#23 0x00007efca4648b0c in __run_exit_handlers () from /lib64/libc.so.6
#24 0x00007efca4648c40 in exit () from /lib64/libc.so.6
#25 0x0000558c8852e5f9 in Py_Exit (sts=0) at /tmp/build/80754af9/python_1614362349910/work/Python/pylifecycle.c:2292
#26 0x0000558c8852e6a7 in handle_system_exit () at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:636
#27 0x0000558c8852e742 in PyErr_PrintEx (set_sys_last_vars=<optimized out>, set_sys_last_vars=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:646
#28 0x0000558c88540dd6 in PyRun_SimpleStringFlags (command=0x7efca4dc9050 "from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=9, pipe_handle=13)\n", flags=0x7ffe3a986110) at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:457
#29 0x0000558c88540ead in pymain_run_command (cf=0x7ffe3a986110, command=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:420
#30 pymain_run_python (pymain=0x7ffe3a986220) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:2907
#31 pymain_main (pymain=0x7ffe3a986220) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:3460
#32 0x0000558c8854122c in _Py_UnixMain (argc=<optimized out>, argv=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:3495
#33 0x00007efca4632493 in __libc_start_main () from /lib64/libc.so.6
#34 0x0000558c884e5e90 in _start () at ../sysdeps/x86_64/elf/start.S:103
```

This was likely caused due to a static singleton that wasn't leaky. Following
the guidance in https://isocpp.org/wiki/faq/ctors#construct-on-first-use-v2 to
use a leaky singleton instead.
ghstack-source-id: 132847448

Test Plan: Verified locally.

Reviewed By: malfet

Differential Revision: D29468866

fbshipit-source-id: 89250594c5cd2643417b1da584c658b742dc5a5c
imaginary-person pushed a commit that referenced this pull request Jul 20, 2021
Summary:
Pull Request resolved: pytorch#61588

As part of debugging pytorch#60290,
we discovered the following deadlock:

```
Thread 79 (Thread 0x7f52ff7fe700 (LWP 205437)):
#0  pthread_cond_timedwait@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
#1  0x0000564880199152 in PyCOND_TIMEDWAIT (cond=0x564880346080 <gil_cond>, mut=0x564880346100 <gil_mutex>, us=5000) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/condvar.h:103
#2  take_gil (tstate=0x7f5254005ef0) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/ceval_gil.h:224
#3  0x0000564880217b62 in PyEval_AcquireThread (tstate=0x7f5254005ef0) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/ceval.c:278
#4  0x00007f557d54aabd in pybind11::gil_scoped_acquire::gil_scoped_acquire() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#5  0x00007f557da7792f in (anonymous namespace)::concrete_decref_fn(c10::impl::PyInterpreter const*, _object*) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#6  0x00007f5560dadba6 in c10::TensorImpl::release_resources() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so
#7  0x00007f5574c885bc in std::_Sp_counted_ptr_inplace<torch::distributed::autograd::DistAutogradContext, std::allocator<torch::distributed::autograd::DistAutogradContext>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#8  0x00007f5574c815e9 in std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<long const, std::shared_ptr<torch::distributed::autograd::DistAutogradContext> >, false> > >::_M_deallocate_node(std::__detail::_Hash_node<std::pair<long const, std::shared_ptr<torch::distributed::autograd::DistAutogradContext> >, false>*) [clone .isra.325] () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#9  0x00007f5574c81bf1 in torch::distributed::autograd::DistAutogradContainer::eraseContextIdAndReset(torch::distributed::autograd::DistAutogradContainer::ContextsShard&, long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#10 0x00007f5574c86e83 in torch::distributed::autograd::DistAutogradContainer::releaseContextIfPresent(long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#11 0x00007f5574cc6395 in torch::distributed::rpc::RequestCallbackNoPython::processCleanupAutogradContextReq(torch::distributed::rpc::RpcCommandBase&) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#12 0x00007f5574cccf15 in torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so

Thread 72 (Thread 0x7f53077fe700 (LWP 205412)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f55bc62adbd in __GI___pthread_mutex_lock (mutex=0x564884396440) at ../nptl/pthread_mutex_lock.c:80
#2  0x00007f5574c82a2f in torch::distributed::autograd::DistAutogradContainer::retrieveContext(long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#3  0x00007f557de9bb2f in pybind11::cpp_function::initialize<torch::distributed::autograd::(anonymous namespace)::dist_autograd_init(_object*, _object*)::{lambda(long)#11}, pybind11::dict, long, pybind11::name, pybind11::scope, pybind11::sibling, char [931], pybind11::arg>(torch::distributed::autograd::(anonymous namespace)::dist_autograd_init(_object*, _object*)::{lambda(long)#11}&&, pybind11::dict (*)(long), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [931], pybind11::arg const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so

```

Basically Thread 72, holds GIL and tries to acquire the lock for
DistAutogradContainer to perform a lookup on a map. On the other hand,
Thread 79 holds the lock on DistAutogradContainer to remove a Tensor and as
part of TensorImpl destructor, concrete_decref_fn is called which waits for
GIL. As a result, we have a deadlock.

To fix this issue, I've ensured we release GIL when we call `retrieveContext`
and acquire it later when needed.
ghstack-source-id: 133493659

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D29682624

fbshipit-source-id: f68a1fb39040ca0447a26e456a97bce64af6b79c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.