Get latest code from main repo by imaginary-person · Pull Request #7 · imaginary-person/pytorch-1

imaginary-person · 2021-01-27T00:18:54Z

Get latest code from main repo

Summary: Pull Request resolved: pytorch/glow#5257 - Add RescaleQuantized parallelization support to graph opts' parallelization code - On NNPI, mirror Rescale parallelization for FC/Relus that come before it - Sink Reshapes below Quantize and ConvertTo - Remove unnecessary ConvertTo when following a Dequantize (i.e. just change the elem kind of the Dequantize instead) Test Plan: Added unit tests Reviewed By: hyuen, mjanderson09 Differential Revision: D25947824 fbshipit-source-id: 771abd36a1bc7270bf1f901d1ec6cb6d78e9fd1f

…8436) Summary: This PR adds cusolver `gesvdj` and `gesvdjBatched` to the backend of `torch.svd`. I've tested the performance using cuda 11.1 on 2070, V100, and A100. The cusolver gesvdj and gesvdjBatched performances are better than magma in all square matrix cases. So cusolver backend will replace magma backend when available. When both matrix dimensions are no greater than 32, `gesvdjBatched` is used. Otherwise, `gesvdj` is used. Detailed benchmark is available at https://github.com/xwang233/code-snippet/tree/master/svd. Some relevant code and discussions - https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/linalg/svd_op_gpu.cu.cc - https://github.com/google/jax/blob/master/jaxlib/cusolver.cc - cupy/cupy#3174 - tensorflow/tensorflow#13603 - https://www.nvidia.com/en-us/on-demand/session/gtcsiliconvalley2019-s9226/ See also #42666 #47953 Close #50516 Pull Request resolved: #48436 Reviewed By: ejguan Differential Revision: D25977046 Pulled By: heitorschueroff fbshipit-source-id: c27e705cd29b6fd7c8ac674c1f9f490fa26ee1bf

Summary: Do it by removing extraneous header dependencies. None of the at::vec256 primitive depend on notion of Tensor, therefore none of the headers that vec256 depends on should include <ATen/Tensor.h> Implicity test it be removing c10 and tensor dependency when building `vec256_test_all_types` Split affine_quantizer into affine_quantizer_base (that contains methods operating on raw types) and affine_quantizer (which contains Tensor specific methods) Fixes #50567 Pull Request resolved: #50708 Reviewed By: walterddr Differential Revision: D25949168 Pulled By: malfet fbshipit-source-id: c3323be7252865a52c7d94026a5a39b494e44efb

Summary: Now we can remove `_th_orgqr`! Compared to the original TH-based `orgqr`, complex (#33152) and batched inputs are now supported. CUDA support will be added in a follow-up PR. Closes #24747 Ref. #49421, #42666 Pull Request resolved: #50502 Reviewed By: mrshenli Differential Revision: D25953300 Pulled By: mruberry fbshipit-source-id: f52a74e1c8f51b5e24f7b461430ca8fc96e4d149

…ke CLANGFORMAT` Reviewed By: zertosh Differential Revision: D26043955 fbshipit-source-id: 0a5740a82bdd3ac7bd1665a325ff7fe79488ccea

…0501) Summary: Follow up to #50435 I have confirmed this works by running ``` pytest test_ops.py -k test_fn_gradgrad_fft` ``` with normally and with `PYTORCH_TEST_WITH_SLOW=1 PYTORCH_TEST_SKIP_FAST=1`. In the first case all tests are skipped, in the second they all run as they should. Pull Request resolved: #50501 Reviewed By: ezyang Differential Revision: D25956416 Pulled By: mruberry fbshipit-source-id: c896a8cec5f19b8ffb9b168835f3743b6986dad7

Summary: Relate to #50483. Everything except ONNX, detectron and release notes tests are moved to use common_utils.run_tests() to ensure CI reports XML correctly. Pull Request resolved: #50923 Reviewed By: samestep Differential Revision: D26027621 Pulled By: walterddr fbshipit-source-id: b04c03f10d1fe96181b720c4c3868e86e4c6281a

Summary: Pull Request resolved: #50817 Replace some longs with int64_t. Thanks Tom Heaven for contributing this patch. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: bdhirsh Differential Revision: D25975915 Pulled By: ezyang fbshipit-source-id: c1061a85f80ad17fa4fb313da797bc6d5ba203c2

Summary: Fixes #47117 Pull Request resolved: #51031 Reviewed By: bdhirsh Differential Revision: D26047498 Pulled By: albanD fbshipit-source-id: dd0a7d9f97c0f6469b3050d2e3b4473f1bee3820

…n for upsample_nearest2d" (#50794) Summary: Pull Request resolved: #50794 Original commit changeset: b4a7948088c0 There are some subtle extra tweaks on top of the original. I can unbundle them, but I've opted to keep it with the port because it's the easiest way to make sure the changes are exercised. * There's a bugfix in the codegen to test if a dispatch key is structured *before* short circuiting because the dispatch key was missing in the table. This accounts for mixed structured-nonstructured situations where the dispatch table is present, but the relevant structured key isn't (because the dispatch table only exists to register, e.g., QuantizedCPU) * Dispatch tables for functions which delegate to structured kernels don't have Math entries from generated for them. * It's now illegal to specify a structured dispatch key in a delegated structured kernel (it will be ignored!) add is now fixed to follow this * There are some extra sanity checks for NativeFunctions validation * Finally, unlike the original PR, I switched the .vec variant of upsample_nearest2d to also be DefaultBackend, bringing it inline with upsample_nearest1d. ghstack-source-id: 120038038 Test Plan: ``` buck test mode/dev //coreai/tiefenrausch:python_tests -- --exact 'coreai/tiefenrausch:python_tests - test_can_run_local_async_inference_cpu (coreai.tiefenrausch.tests.python_test.TiefenrauschPY)' --run-disabled ``` Reviewed By: ngimel Differential Revision: D25962873 fbshipit-source-id: d29a9c97f15151db3066ae5efe7a0701e6dc05a3

Summary: This simplifies our handling and allows passing CompilationUnits from Python to C++ defined functions via PyBind easily. Discussed on Slack with SplitInfinity Pull Request resolved: #50614 Reviewed By: anjali411 Differential Revision: D25938005 Pulled By: SplitInfinity fbshipit-source-id: 94aadf0c063ddfef7ca9ea17bfa998d8e7b367ad

Summary: The test is flaky on ROCM when deadline is set to 1 second. This is affecting builds as it is failing randomly. Disabling for now. Signed-off-by: Arindam Roy <rarindam@gmail.com> Pull Request resolved: #50964 Reviewed By: houseroad Differential Revision: D26049370 Pulled By: BIT-silence fbshipit-source-id: 22337590a8896ad75f1281e56fbbeae897f5c3b2

Summary: Updates the docstrings, that `jvp` and `vjp` both return the primal `func_output` first as part of the return tuple, in line with the docstrings of [hvp](https://github.com/niklasschmitz/pytorch/blob/c620572a3477143df33128318dc0d7d10fab811d/torch/autograd/functional.py#L671) and [vhp](https://github.com/niklasschmitz/pytorch/blob/c620572a3477143df33128318dc0d7d10fab811d/torch/autograd/functional.py#L583). Pull Request resolved: #51035 Reviewed By: bdhirsh Differential Revision: D26047693 Pulled By: albanD fbshipit-source-id: 5f2957a858826b4c1884590b6be7a8bed0791efd

… {functional relu/module relu} (#50975) Summary: Pull Request resolved: #50975 Test Plan: Imported from OSS Reviewed By: supriyar Differential Revision: D26032532 fbshipit-source-id: a084fb4fd711ad52b2da1c6378cbcc2b352976c6

Summary: - Resolves ngimel's review comments in #49109 - Move `ConvolutionArgs` from `ConvShared.h` to `Conv_v7.cpp`, because cuDNN v8 uses different descriptors therefore will not share the same `ConvolutionArgs`. - Refactor the `ConvolutionParams` (the hash key for benchmark): - Remove `input_stride` - Add `input_dim` - Add `memory_format` - Make `repro_from_args` to take `ConvolutionParams` instead of `ConvolutionArgs` as arguments so that it can be shared for v7 and v8 - Rename some `layout` to `memory_format`. `layout` should be sparse/strided and `memory_format` should be contiguous/channels_last. They are different things. Pull Request resolved: #50827 Reviewed By: bdhirsh Differential Revision: D26048274 Pulled By: ezyang fbshipit-source-id: f71aa02d90ffa581c17ab05b171759904b311517

Summary: This is a follow up PR of #48493. Fixes #48492 Pull Request resolved: #50824 Reviewed By: bdhirsh Differential Revision: D26050736 Pulled By: ezyang fbshipit-source-id: 049605fd271cff28c8b6e300c163e9df3b3ea23b

…0957) Summary: Pull Request resolved: #50957 MAGMA has an off-by-one error in their batched cholesky implementation which is causing illegal memory access for certain inputs. The workaround implemented in this PR is to pad the input to MAGMA with 1 extra element. **Benchmark** Ran the script below for both before and after my PR and got similar results. *Script* ``` import torch from torch.utils import benchmark DTYPE = torch.float32 BATCHSIZE = 512 * 512 MATRIXSIZE = 16 a = torch.eye(MATRIXSIZE, device='cuda', dtype=DTYPE) t0 = benchmark.Timer( stmt='torch.cholesky(a)', globals={'a': a}, label='Single' ) t1 = benchmark.Timer( stmt='torch.cholesky(a)', globals={'a': a.expand(BATCHSIZE, -1, -1)}, label='Batched' ) print(t0.timeit(100)) print(t1.timeit(100)) ``` *Results before* ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400> Single 2.08 ms 1 measurement, 100 runs , 1 thread <torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400> Batched 7.68 ms 1 measurement, 100 runs , 1 thread ``` *Results after* ``` <torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400> Single 2.10 ms 1 measurement, 100 runs , 1 thread <torch.utils.benchmark.utils.common.Measurement object at 0x7faf9bc63400> Batched 7.56 ms 1 measurement, 100 runs , 1 thread ``` Fixes #41394, #26996, #48996 See also #42666, #26789 TODO --- - [x] Benchmark to check for perf regressions Test Plan: Imported from OSS Reviewed By: bdhirsh Differential Revision: D26050978 Pulled By: heitorschueroff fbshipit-source-id: 7a5ba7e34c9d74b58568b2a0c631cc6d7ba63f86

Summary: Pull Request resolved: #50791 Add a dedicated pipeline parallelism doc page explaining the APIs and the overall value of the module. ghstack-source-id: 120257168 Test Plan: 1) View locally 2) waitforbuildbot Reviewed By: rohan-varma Differential Revision: D25967981 fbshipit-source-id: b607b788703173a5fa4e3526471140506171632b

…50868) Summary: Pull Request resolved: #50868 Ensures that `FakeQuantize` respects device affinity when loading from state_dict, and knows how to resize scale and zero_point values (which is necessary for FQ classes wrapping per channel observers). This is same as #44537, but for `FakeQuantize`. Test Plan: ``` python test/test_quantization.py TestObserver.test_state_dict_respects_device_affinity ``` Imported from OSS Reviewed By: jerryzh168 Differential Revision: D25991570 fbshipit-source-id: 1193a6cd350bddabd625aafa0682e2e101223bb1

Summary: **BC-breaking note:** torch.svd() added support for complex inputs in PyTorch 1.7, but was not documented as doing so. The complex "V" tensor returned was actually the complex conjugate of what's expected. This PR fixes the discrepancy. This will silently break all users of torch.svd() with complex inputs. **Original PR Summary:** This PR resolves #45821. The problem was that when introducing the support of complex inputs for `torch.svd` it was overlooked that LAPACK/MAGMA returns the conjugate transpose of V matrix, not just the transpose of V. So `torch.svd` was silently returning U, S, V.conj() instead of U, S, V. Behavior of `torch.linalg.pinv`, `torch.pinverse` and `torch.linalg.svd` (they depend on `torch.svd`) is not changed in this PR. Pull Request resolved: #51012 Reviewed By: bdhirsh Differential Revision: D26047593 Pulled By: albanD fbshipit-source-id: d1e08dbc3aab9ce1150a95806ef3b5da98b5d3ca

Summary: Pull Request resolved: #50458 libinterpreter.so contains a frozen python distribution including torch-python bindings. Freezing refers to serializing bytecode of python standard library modules as well as the torch python library and embedding them in the library code. This library can then be dlopened multiple times in one process context, each interpreter having its own python state and GIL. In addition, each python environment is sealed off from the filesystem and can only import the frozen modules included in the distribution. This change relies on newly added frozenpython, a cpython 3.8.6 fork built for this purpose. Frozenpython provides libpython3.8-frozen.a which contains frozen bytecode and object code for the python standard library. Building on top of frozen python, the frozen torch-python bindings are added in this diff, providing each embedded interpreter with a copy of the torch bindings. Each interpreter is intended to share one instance of libtorch and the underlying tensor libraries. Known issues - Autograd is not expected to work with the embedded interpreter currently, as it manages its own python interactions and needs to coordinate with the duplicated python states in each of the interpreters. - Distributed and cuda stuff is disabled in libinterpreter.so build, needs to be revisited - __file__ is not supported in the context of embedded python since there are no files for the underlying library modules. using __file__ - __version__ is not properly supported in the embedded torch-python, just a workaround for now Test Plan: tested locally and on CI with cmake and buck builds running torch::deploy interpreter_test Reviewed By: ailzhang Differential Revision: D25850783 fbshipit-source-id: a4656377caff25b73913daae7ae2f88bcab8fd88

Summary: Pull Request resolved: #50622 1. Define a DDPLoggingData struct that is the placeholder for all the ddp related logging fields 2. Put the DDPLoggingData struct in the C10 directory so that it can be easily imported by c10 and torch files 3. Expose get_ddp_logging_data() method in python so that users can get the logging data and dump in their applications 4. Unit test tested the logging data can be set and got as expected 5. Follow up will add more logging fields such as perf stats, internal states, env variables and etc ghstack-source-id: 120275870 Test Plan: unit tests Reviewed By: SciPioneer Differential Revision: D25930527 fbshipit-source-id: 290c200161019c58e28eed9a5a2a7a8153113f99

Summary: Fixes #50496 Fixes #34859 Fixes #21596 This fixes many bugs involving `TransformedDistribution` and `ComposeTransform` when the component transforms changed their event shapes. Part of the fix is to introduce an `IndependentTransform` analogous to `distributions.Independent` and `constraints.independent`, and to introduce methods `Transform.forward_shape()` and `.inverse_shape()`. I have followed fehiepsi's suggestion and replaced `.input_event_dim` -> `.domain.event_dim` and `.output_event_dim` -> `.codomain.event_dim`. This allows us to deprecate `.event_dim` as an attribute. ## Summary of changes - Fixes `TransformDistribution` and `ComposeTransform` shape errors. - Fixes a behavior bug in `LogisticNormal`. - Fixes `kl_divergence(TransformedDistribution, TransformedDistribution)` - Adds methods `Transform.forward_shape()`, `.inverse_shape()` which are required for correct shape computations in `TransformedDistribution` and `ComposeTransform`. - Adds an `IndependentTransform`. - Adds a `ReshapeTransform` which is invaluable in testing shape logic in `ComposeTransform` and `TransformedDistribution` and which will be used by stefanwebb flowtorch. - Fixes incorrect default values in `constraints.dependent.event_dim`. - Documents the `.event_dim` and `.is_discrete` attributes. ## Changes planned for follow-up PRs - Memoize `constraints.dependent_property` as we do with `lazy_property`, since we now consult those properties much more often. ## Tested - [x] added a test for `Dist.support` vs `Dist(**params).support` to ensure static and dynamic attributes agree. - [x] refactoring is covered by existing tests - [x] add test cases for `ReshapedTransform` - [x] add a test for `TransformedDistribution` on a wide grid of input shapes - [x] added a regression test for #34859 cc fehiepsi feynmanliang stefanwebb Pull Request resolved: #50581 Reviewed By: ezyang, glaringlee, jpchen Differential Revision: D26024247 Pulled By: neerajprad fbshipit-source-id: f0b9a296f780ff49659b132409e11a29985dde9b

Summary: Pull Request resolved: #51043 This PR makes `fast_nvcc` stop at failing commands, rather than continuing on to run commands that would otherwise run after those commands. It is still possible for `fast_nvcc` to run more commands than `nvcc` would run if there's no dependency between them, but this should still help to reduce noise from failing `fast_nvcc` runs. Test Plan: Unfortunately the test suite for this script is FB-internal. It would probably be a good idea to move it into the PyTorch GitHub repo, but I'm not entirely sure how to do so, since I don't believe we currently have a good place to put tests for things in `tools`. Reviewed By: malfet Differential Revision: D26007788 fbshipit-source-id: 8fe1e7d020a29d32d08fe55fb59229af5cdfbcaa

Summary: Pull Request resolved: #51051 Disable input pointer caching on ios. We are seeing some issues with this on some ios devices. Test Plan: FB: Test this in of IG with BT effect. Reviewed By: IvanKobzarev, AshkanAliabadi Differential Revision: D25984429 fbshipit-source-id: f6ceef606994b22de9cdd9752115b3481cd7bd96

Summary: See above. Pull Request resolved: #51046 Reviewed By: ZolotukhinM Differential Revision: D26053419 Pulled By: Chillee fbshipit-source-id: 9cc2dc434239a1ad77d30a1e5c0a9592be4944dc

Summary: Related to issue #42666 Pull Request resolved: #49168 Reviewed By: mrshenli Differential Revision: D25954027 Pulled By: mruberry fbshipit-source-id: e429f9587efff5e638bfd0e4de864c06f41c63b1

…e first K iterations (#50973) Summary: Pull Request resolved: #50973 This can extend the original PowerSGD method to a hybrid approach: vanilla allreduce + PowerSGD. This can help further improve the accuracy, at the cost of a lower speedup. Also add more comments on the fields in `PowerSGDState`. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120257202 Test Plan: buck test mode/dev-nosan caffe2/test/distributed:c10d -- test_powerSGD_ddp_comm_hook_nccl buck test mode/dev-nosan caffe2/test/distributed:distributed_nccl_fork -- test_DistributedDataParallel_powerSGD_ddp_comm_hook Reviewed By: rohan-varma Differential Revision: D26031478 fbshipit-source-id: d72e70bb28ba018f53223c2a4345306980b3084e

Summary: Pull Request resolved: #50974 Typo fixes. Original PR issue: Investigate Applying PowerSGD to Communication Hook for Gradient Compression #47202 ghstack-source-id: 120257221 Test Plan: N/A Reviewed By: rohan-varma Differential Revision: D26031679 fbshipit-source-id: 9d049b50419a3e40e53f7f1275a441e31b87717b

…#50854) Summary: Pull Request resolved: #50854 Test Plan: Imported from OSS Reviewed By: bhosmer Differential Revision: D26008542 Pulled By: ailzhang fbshipit-source-id: e9c0aa97ac2537ff612f5faf348fcb613da09479

…eter Test Plan: revert-hammer Differential Revision: D25850783 (3192f9e) Original commit changeset: a4656377caff fbshipit-source-id: 1c7133627da28fb12848da7a9a46de6d3b2b67c6

Summary: Pull Request resolved: #50744 This PR adds a `check_batched_grad=True` option to CriterionTest and turns it on by default for all CriterionTest-generated tests Test Plan: - run tests Reviewed By: ejguan Differential Revision: D25997676 Pulled By: zou3519 fbshipit-source-id: cc730731e6fae2bddc01bc93800fd0e3de28b32d

Summary: Closes #40702, Fixes #40690 Currently wip. But I would appreciate some feedback. Functions should be double-differentiable. Contrary to https://github.com/pytorch/pytorch/blob/b35cdc5200af963e410c0a25400fd07f30b89bca/torch/nn/parallel/_functions.py This PR generates list of tensors instead of aggregating the received data in a single tensor. Is this behavior correct? Thanks! Pull Request resolved: #40762 Reviewed By: glaringlee Differential Revision: D24758889 Pulled By: mrshenli fbshipit-source-id: 79285fb4b791cae3d248f34e2aadb11c9ab10cce

Summary: Removed skipCUDAIfRocm to re-enable tests for ROCM platform. Initially, Only 4799 cases were being run. Out of those, 882 cases were being skipped. After removing skipCUDAIfRocm from two places in test_ops.py, now more than 8000 cases are being executed, out of which only 282 cases are bing skipped, which are FFT related tests. Signed-off-by: Arindam Roy <rarindam@gmail.com> Fixes #{issue number} Pull Request resolved: #50500 Reviewed By: albanD Differential Revision: D25920303 Pulled By: mrshenli fbshipit-source-id: b2d17b7e2d1de4f9fdd6f1660fb4cad5841edaa0

Summary: This is an automated pull request to update the first-party submodule for [pytorch/tensorpipe](https://github.com/pytorch/tensorpipe). New submodule commit: pytorch/tensorpipe@f463e0e Pull Request resolved: #50946 Test Plan: Ensure that CI jobs succeed on GitHub before landing. Reviewed By: lw Differential Revision: D26018916 fbshipit-source-id: dc8aaa98d4e002e972d5c6783f2351c29f7db239

Summary: This fixes the following flaky test on machine with gpus of different arch: ``` _________________________________________________________________________________________________________________ TestCppExtensionJIT.test_jit_cuda_archflags __________________________________________________________________________________________________________________ self = <test_cpp_extensions_jit.TestCppExtensionJIT testMethod=test_jit_cuda_archflags> unittest.skipIf(not TEST_CUDA, "CUDA not found") unittest.skipIf(TEST_ROCM, "disabled on rocm") def test_jit_cuda_archflags(self): # Test a number of combinations: # - the default for the machine we're testing on # - Separators, can be ';' (most common) or ' ' # - Architecture names # - With/without '+PTX' capability = torch.cuda.get_device_capability() # expected values is length-2 tuple: (list of ELF, list of PTX) # note: there should not be more than one PTX value archflags = { '': (['{}{}'.format(capability[0], capability[1])], None), "Maxwell+Tegra;6.1": (['53', '61'], None), "Pascal 3.5": (['35', '60', '61'], None), "Volta": (['70'], ['70']), } if int(torch.version.cuda.split('.')[0]) >= 10: # CUDA 9 only supports compute capability <= 7.2 archflags["7.5+PTX"] = (['75'], ['75']) archflags["5.0;6.0+PTX;7.0;7.5"] = (['50', '60', '70', '75'], ['60']) for flags, expected in archflags.items(): > self._run_jit_cuda_archflags(flags, expected) test_cpp_extensions_jit.py:198: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ test_cpp_extensions_jit.py:158: in _run_jit_cuda_archflags _check_cuobjdump_output(expected[0]) test_cpp_extensions_jit.py:134: in _check_cuobjdump_output self.assertEqual(actual_arches, expected_arches, ../../.local/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py:1211: in assertEqual super().assertEqual(len(x), len(y), msg=self._get_assert_msg(msg, debug_msg=debug_msg)) E AssertionError: 2 != 1 : Attempted to compare the lengths of [iterable] types: Expected: 2; Actual: 1. E Flags: , Actual: ['sm_75', 'sm_86'], Expected: ['sm_86'] E Stderr: E Output: ELF file 1: cudaext_archflags.1.sm_75.cubin E ELF file 2: cudaext_archflags.2.sm_86.cubin ``` Pull Request resolved: #50405 Reviewed By: albanD Differential Revision: D25920200 Pulled By: mrshenli fbshipit-source-id: 1042a984142108f954a283407334d39e3ec328ce

Summary: `ResolutionCallback` returns `py::object` (i.e. `Any`) rather than `py::function` (i.e. `Callable`) Discovered while debugging test failures after updating pybind11 This also makes resolution code slightly faster, as it eliminates casts from object to function and back for every `py::object obj = rcb_(name);` statement. Pull Request resolved: #51089 Reviewed By: jamesr66a Differential Revision: D26069295 Pulled By: malfet fbshipit-source-id: 6876caf9b4653c8dc8e568aefb6778895decea05

) Summary: Closes #50513 by resolving all four checkboxes. If this PR is merged, I will also modify one or both of the following wiki pages to add instructions on how to use this `mypy` wrapper for VS Code editor integration: - [Guide for adding type annotations to PyTorch](https://github.com/pytorch/pytorch/wiki/Guide-for-adding-type-annotations-to-PyTorch) - [Lint as you type](https://github.com/pytorch/pytorch/wiki/Lint-as-you-type) Pull Request resolved: #50826 Test Plan: Unit tests for globbing function: ``` python test/test_testing.py TestMypyWrapper -v ``` Manual checks: - Uninstall `mypy` and run `python test/test_type_hints.py` to verify that it still works when `mypy` is absent. - Reinstall `mypy` and run `python test/test_type_hints.py` to verify that this didn't break the `TestTypeHints` suite. - Run `python test/test_type_hints.py` again (should finish quickly) to verify that this didn't break `mypy` caching. - Run `torch/testing/_internal/mypy_wrapper.py` on a few Python files in this repo to verify that it doesn't give any additional warnings when the `TestTypeHints` suite passes. Some examples (compare with the behavior of just running `mypy` on these files): ```sh torch/testing/_internal/mypy_wrapper.py $PWD/README.md torch/testing/_internal/mypy_wrapper.py $PWD/tools/fast_nvcc/fast_nvcc.py torch/testing/_internal/mypy_wrapper.py $PWD/test/test_type_hints.py torch/testing/_internal/mypy_wrapper.py $PWD/torch/random.py torch/testing/_internal/mypy_wrapper.py $PWD/torch/testing/_internal/mypy_wrapper.py ``` - Remove type hints from `torch.testing._internal.mypy_wrapper` and verify that running `mypy_wrapper.py` on that file gives type errors. - Remove the path to `mypy_wrapper.py` from the `files` setting in `mypy-strict.ini` and verify that running it again on itself no longer gives type errors. - Add `test/test_type_hints.py` to the `files` setting in `mypy-strict.ini` and verify that running the `mypy` wrapper on it again now gives type errors. - Change a return type in `torch/random.py` and verify that running the `mypy` wrapper on it again now gives type errors. - Add the suggested JSON from the docstring of `torch.testing._internal.mypy_wrapper.main` to your `.vscode/settings.json` and verify that VS Code gives the same results (inline, while editing any Python file in the repo) as running the `mypy` wrapper on the command line, in all the above cases. Reviewed By: walterddr Differential Revision: D26049052 Pulled By: samestep fbshipit-source-id: 0b35162fc78976452b5ea20d4ab63937b3c7695d

Summary: Pull Request resolved: #50630 Add a warning log to distributed optimizer, to warn user the optimizer is created without TorchScript support. Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D25932777 Pulled By: wanchaol fbshipit-source-id: 8db3b98bdd27fc04c5a3b8d910b028c0c37f138d

Summary: Fixes #{issue number} Pull Request resolved: #50442 Reviewed By: bdhirsh Differential Revision: D26044981 Pulled By: mruberry fbshipit-source-id: 65c42f2c1de8d24e4852a1b5bd8f4b1735b2230e

Summary: Pull Request resolved: #50976 Test Plan: Imported from OSS Reviewed By: supriyar Differential Revision: D26032531 fbshipit-source-id: 9725bab8f70ac79652e7bf9f94376917438d60e0

Test Plan: revert-hammer Differential Revision: D26018916 (5f297cc) Original commit changeset: dc8aaa98d4e0 fbshipit-source-id: cd81a7950c7141e0711faabf03292098a8cf14d3

Test Plan: buck test //caffe2/test:test_fx_experimental buck test //glow/fb/fx_nnpi_importer:test_importer Reviewed By: jfix71 Differential Revision: D25675618 fbshipit-source-id: 55636bb2d3d6102b400f2044118a450906954083

Summary: In Python-3.9 and above `inspect.getsource` of a local class does not work if it was marked as default, see https://bugs.python.org/issue42666 #49617 Workaround by defining `make_global` function that programmatically accomplishes the same Partially addresses issue raised in #49617 Pull Request resolved: #51088 Reviewed By: gmagogsfm Differential Revision: D26069189 Pulled By: malfet fbshipit-source-id: 7cf14b88ae5d2b95d2b0fd852717a9202b86356e

Summary: Pull Request resolved: #51113 toTensor() on an lvalue IValue returns a reference; no need to copy. ghstack-source-id: 120317233 Test Plan: fitsships Compared `perf stat` results before/after (was on top of a diff stack so don't take baseline as where master is) Before: ``` 74,178.77 msec task-clock # 0.999 CPUs utilized ( +- 0.31% ) 17,125 context-switches # 0.231 K/sec ( +- 3.41% ) 3 cpu-migrations # 0.000 K/sec 109,535 page-faults # 0.001 M/sec ( +- 1.04% ) 146,803,364,372 cycles # 1.979 GHz ( +- 0.30% ) (50.03%) 277,726,600,254 instructions # 1.89 insn per cycle ( +- 0.02% ) (50.03%) 43,299,659,815 branches # 583.720 M/sec ( +- 0.03% ) (50.03%) 130,504,094 branch-misses # 0.30% of all branches ( +- 1.14% ) (50.03%) ``` After: ``` 72,695.01 msec task-clock # 0.999 CPUs utilized ( +- 1.18% ) 15,994 context-switches # 0.220 K/sec ( +- 5.21% ) 3 cpu-migrations # 0.000 K/sec 107,743 page-faults # 0.001 M/sec ( +- 1.55% ) 145,647,684,269 cycles # 2.004 GHz ( +- 0.30% ) (50.05%) 277,341,084,993 instructions # 1.90 insn per cycle ( +- 0.02% ) (50.04%) 43,200,717,263 branches # 594.273 M/sec ( +- 0.02% ) (50.05%) 143,873,086 branch-misses # 0.33% of all branches ( +- 0.59% ) (50.05%) ``` Looks like an 0.7% cycles win (barely outside the noise) and an 0.1% instructions win. Reviewed By: hlu1 Differential Revision: D26051766 fbshipit-source-id: 05f8d71d8120d79f7cd80aca747dfc537bf7d382

Summary: Pull Request resolved: #51047 If the environment variable `TORCH_VITAL` is set to a non-zero length string, the vitals a dumped at program end. The API is very similar to google's logging Test Plan: buck test //caffe2/aten:vitals Reviewed By: bitfort Differential Revision: D25791248 fbshipit-source-id: 0b40da7d22c31d2c4b2094f0dcb1229a35338ac2

Summary: Update pybind repo to include `gil_scoped_acquire::disarm()` methods In python_engine allocate scoped_acquire as unique_ptr and leak it if engine is finalizing for Python-3.9+ Fixes #50014 and #50893 Pull Request resolved: #50998 Reviewed By: ezyang Differential Revision: D26038314 Pulled By: malfet fbshipit-source-id: 035411e22825e8fdcf1348fed36da0bc33e16f60

Summary: Adding a set of benchmarks for key operators Test Plan: buck build mode/opt -c 'fbcode.caffe2_gpu_type=none' caffe2/benchmarks/cpp/tensorexpr:tensorexpr_bench OMP_NUM_THREADS=1 MKL_NUM_THREADS=1 numactl -C 3 ./buck-out/gen/caffe2/benchmarks/cpp/tensorexpr/tensorexpr_bench Reviewed By: ZolotukhinM Differential Revision: D25981260 fbshipit-source-id: 17681fc1527f43ccf9bcc80704415653a627b396

Summary: Pull Request resolved: #51093 Operator level benchmarks comparing eager-mode PyTorch to NNC-generated fused kernels. We wouldn't normally see these in isolation, but it points out where NNC is falling short (or doing well). I threw in a composed hardswish for fun, because it's my favorite activation function. Notably, it exposes a bug in our build process that's preventing vectorization from using `sleef`, so we're using scalar calls to libm with predictably lousy performance. Fix incoming. This benchmark is similar to the pure NNC approach in `microbenchmarks.py`, but will include the overhead of dispatching the fused kernel through TorchScript. ghstack-source-id: 120403675 Test Plan: ``` op eager nnc speedup hardswish 0.187 0.051 3.70 hardswish 0.052 0.052 1.00 sigmoid 0.148 1.177 0.13 reciprocal 0.049 0.050 0.98 neg 0.038 0.037 1.02 relu 0.037 0.036 1.03 isnan 0.119 0.020 5.86 log 0.082 1.330 0.06 log10 0.148 1.848 0.08 log1p 0.204 1.413 0.14 log2 0.285 1.167 0.24 exp 0.063 1.123 0.06 expm1 0.402 1.417 0.28 erf 0.167 0.852 0.20 erfc 0.181 1.098 0.16 cos 0.124 0.793 0.16 sin 0.126 0.838 0.15 tan 0.285 1.777 0.16 acos 0.144 1.358 0.11 asin 0.126 1.193 0.11 cosh 0.384 1.761 0.22 sinh 0.390 2.279 0.17 atan 0.240 1.564 0.15 tanh 0.320 2.259 0.14 sqrt 0.043 0.069 0.63 rsqrt 0.118 0.117 1.01 abs 0.038 0.037 1.03 ceil 0.038 0.038 1.01 floor 0.039 0.039 1.00 round 0.039 0.292 0.13 trunc 0.040 0.036 1.12 lgamma 2.045 2.721 0.75 ``` Reviewed By: zheng-xq Differential Revision: D26069791 fbshipit-source-id: 236e7287ba1b3f67fdcb938949a92bbbdfa13dba

) Summary: Fixes #50695. Rather than maintain a LICENSE_BUNDLED.txt by hand, this build it out of the subrepos. I ~copied and adapted the sdist handling from Numpy~ added a separate file, so the LICENSE.txt file of the repo remains in pristine condition and the GitHub website still recognizes it. If we modify the file, the website will no longer recognize the license. This is not enough, since the license in the ~wheel~ wheel and sdist is not modified. Numpy has a [separate step](https://github.com/MacPython/numpy-wheels/blob/master/patch_code.sh) when preparing the wheel to concatenate the licenses. I am not sure where/if the [conda-forge numpy-feedstock](https://github.com/conda-forge/numpy-feedstock/) also fixes up the license. ~Should~ I ~commit~ commited the artifact to the repo and ~add~ added a test that reproducing the file is consistent. Edit: now the file is part of the repo. Edit: rework the mention of sdist. After this is merged another PR is needed to make the sdist and wheel ship the proper merged license. Pull Request resolved: #50745 Reviewed By: seemethere, heitorschueroff Differential Revision: D26074974 Pulled By: walterddr fbshipit-source-id: bacd5d6870e9dbb419a31a3e3d2fdde286ff2c94

Test Plan: revert-hammer Differential Revision: D25675618 (c8a24eb) Original commit changeset: 55636bb2d3d6 fbshipit-source-id: 7b196f7c32830061eca9c89bbcb346cdd66a211e

Summary: Introduced by D25981260 (f08464f) Pull Request resolved: #51157 Reviewed By: bwasti Differential Revision: D26090008 Pulled By: malfet fbshipit-source-id: b63f1bb1683c7261902de7eaab24a05a5159ce7e

Summary: added more statistic info for static runtime Test Plan: caffe2/benchmarks/static_runtime:static_runtime_cpptest Expected output example: Static runtime ms per iter: 0.939483. Iters per second: 1064.41 Node #0: 0.195671 ms/iter, %wide_offset.1 : Tensor = aten::add(%wide.1, %self._mu, %4) Node #1: 0.169457 ms/iter, %wide_normalized.1 : Tensor = aten::mul(%wide_offset.1, %self._sigma) Node #2: 0.118218 ms/iter, %wide_preproc.1 : Tensor = aten::clamp(%wide_normalized.1, %5, %6) Node #3: 0.038814 ms/iter, %user_emb_t.1 : Tensor = aten::transpose(%user_emb.1, %4, %7) Node #4: 0.0860747 ms/iter, %dp_unflatten.1 : Tensor = aten::bmm(%ad_emb_packed.1, %user_emb_t.1) Node #5: 0.0102666 ms/iter, %31 : Tensor = static_runtime::flatten_copy(%dp_unflatten.1, %4, %8) Node #6: 0.000476333 ms/iter, %19 : Tensor[] = prim::ListConstruct(%31, %wide_preproc.1) Node #7: 0.0707332 ms/iter, %input.1 : Tensor = aten::cat(%19, %4) Node #8: 0.123695 ms/iter, %fc1.1 : Tensor = aten::addmm(%self._fc_b, %input.1, %29, %4, %4) Node #9: 0.0309244 ms/iter, %23 : Tensor = aten::sigmoid(%fc1.1) Node #10: 0.0046297 ms/iter, %24 : (Tensor) = prim::TupleConstruct(%23) Time per node type: 0.195671 ms. 23.0483%. aten::add (1 nodes) 0.169457 ms. 19.9605%. aten::mul (1 nodes, out variant) 0.123695 ms. 14.5702%. aten::addmm (1 nodes, out variant) 0.118218 ms. 13.925%. aten::clamp (1 nodes, out variant) 0.0860747 ms. 10.1388%. aten::bmm (1 nodes, out variant) 0.0707332 ms. 8.33175%. aten::cat (1 nodes, out variant) 0.038814 ms. 4.57195%. aten::transpose (1 nodes) 0.0309244 ms. 3.64263%. aten::sigmoid (1 nodes, out variant) 0.0102666 ms. 1.20932%. static_runtime::flatten_copy (1 nodes, out variant) 0.0046297 ms. 0.545338%. prim::TupleConstruct (1 nodes, out variant) 0.000476333 ms. 0.0561079%. prim::ListConstruct (1 nodes, out variant) 0.848959 ms. in Total StaticRuntime setup time: 0.018925 ms Memory allocation time: 0.019808 ms Memory deallocation time: 0.0120445 ms Outputs deallocation time: 0.0864947 ms Total memory managed: 19328 bytes Total number of reused tensors: 3 Total number of 'out' variant nodes/total number of nodes: 9/11 (81.8182%) Reviewed By: hlu1 Differential Revision: D28553029 fbshipit-source-id: 55e7eab50b4b475ae219896100bdf4f6678875a4

Summary: Pull Request resolved: pytorch#60987 We were seeing deadlocks as follows during shutdown: ``` Thread 1 (LWP 2432101): #0 0x00007efca470190b in __pause_nocancel () from /lib64/libc.so.6 #1 0x00007efca49de485 in __pthread_mutex_lock_full () from /lib64/libpthread.so.0 #2 0x00007ef91d4c42c6 in __cuda_CallJitEntryPoint () from /lib64/libnvidia-ptxjitcompiler.so.1 #3 0x00007efc651ac8f1 in ?? () from /lib64/libcuda.so #4 0x00007efc651aee03 in ?? () from /lib64/libcuda.so #5 0x00007efc64f76b84 in ?? () from /lib64/libcuda.so #6 0x00007efc64f77f5d in ?? () from /lib64/libcuda.so #7 0x00007efc64eac858 in ?? () from /lib64/libcuda.so #8 0x00007efc64eacfbc in ?? () from /lib64/libcuda.so #9 0x00007efc7810a924 in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #10 0x00007efc780fa2be in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #11 0x00007efc78111044 in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #12 0x00007efc7811580a in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #13 0x00007efc78115aa4 in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #14 0x00007efc781079ec in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #15 0x00007efc780e6a7a in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #16 0x00007efc7811cfa5 in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #17 0x00007efc777ea98c in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #18 0x00007efc777ebd80 in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #19 0x00007efc777ea2c9 in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #20 0x00007efc778c2e2d in cublasDestroy_v2 () from /usr/local/cuda/lib64/libcublas.so.11 #21 0x00007efc51a3fb56 in std::_Sp_counted_ptr_inplace<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle>, std::allocator<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle> >, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /data/users/pritam/pytorch/torch/lib/libtorch_cuda.so #22 0x00007efc51a3fc5f in std::shared_ptr<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle> >::~shared_ptr() () from /data/users/pritam/pytorch/torch/lib/libtorch_cuda.so #23 0x00007efca4648b0c in __run_exit_handlers () from /lib64/libc.so.6 #24 0x00007efca4648c40 in exit () from /lib64/libc.so.6 #25 0x0000558c8852e5f9 in Py_Exit (sts=0) at /tmp/build/80754af9/python_1614362349910/work/Python/pylifecycle.c:2292 #26 0x0000558c8852e6a7 in handle_system_exit () at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:636 #27 0x0000558c8852e742 in PyErr_PrintEx (set_sys_last_vars=<optimized out>, set_sys_last_vars=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:646 #28 0x0000558c88540dd6 in PyRun_SimpleStringFlags (command=0x7efca4dc9050 "from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=9, pipe_handle=13)\n", flags=0x7ffe3a986110) at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:457 #29 0x0000558c88540ead in pymain_run_command (cf=0x7ffe3a986110, command=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:420 #30 pymain_run_python (pymain=0x7ffe3a986220) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:2907 #31 pymain_main (pymain=0x7ffe3a986220) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:3460 #32 0x0000558c8854122c in _Py_UnixMain (argc=<optimized out>, argv=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:3495 #33 0x00007efca4632493 in __libc_start_main () from /lib64/libc.so.6 #34 0x0000558c884e5e90 in _start () at ../sysdeps/x86_64/elf/start.S:103 ``` This was likely caused due to a static singleton that wasn't leaky. Following the guidance in https://isocpp.org/wiki/faq/ctors#construct-on-first-use-v2 to use a leaky singleton instead. ghstack-source-id: 132847448 Test Plan: Verified locally. Reviewed By: malfet Differential Revision: D29468866 fbshipit-source-id: 89250594c5cd2643417b1da584c658b742dc5a5c

Summary: Pull Request resolved: pytorch#61588 As part of debugging pytorch#60290, we discovered the following deadlock: ``` Thread 79 (Thread 0x7f52ff7fe700 (LWP 205437)): #0 pthread_cond_timedwait@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225 #1 0x0000564880199152 in PyCOND_TIMEDWAIT (cond=0x564880346080 <gil_cond>, mut=0x564880346100 <gil_mutex>, us=5000) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/condvar.h:103 #2 take_gil (tstate=0x7f5254005ef0) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/ceval_gil.h:224 #3 0x0000564880217b62 in PyEval_AcquireThread (tstate=0x7f5254005ef0) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/ceval.c:278 #4 0x00007f557d54aabd in pybind11::gil_scoped_acquire::gil_scoped_acquire() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so #5 0x00007f557da7792f in (anonymous namespace)::concrete_decref_fn(c10::impl::PyInterpreter const*, _object*) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so #6 0x00007f5560dadba6 in c10::TensorImpl::release_resources() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so #7 0x00007f5574c885bc in std::_Sp_counted_ptr_inplace<torch::distributed::autograd::DistAutogradContext, std::allocator<torch::distributed::autograd::DistAutogradContext>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so #8 0x00007f5574c815e9 in std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<long const, std::shared_ptr<torch::distributed::autograd::DistAutogradContext> >, false> > >::_M_deallocate_node(std::__detail::_Hash_node<std::pair<long const, std::shared_ptr<torch::distributed::autograd::DistAutogradContext> >, false>*) [clone .isra.325] () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so #9 0x00007f5574c81bf1 in torch::distributed::autograd::DistAutogradContainer::eraseContextIdAndReset(torch::distributed::autograd::DistAutogradContainer::ContextsShard&, long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so #10 0x00007f5574c86e83 in torch::distributed::autograd::DistAutogradContainer::releaseContextIfPresent(long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so #11 0x00007f5574cc6395 in torch::distributed::rpc::RequestCallbackNoPython::processCleanupAutogradContextReq(torch::distributed::rpc::RpcCommandBase&) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so #12 0x00007f5574cccf15 in torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so Thread 72 (Thread 0x7f53077fe700 (LWP 205412)): #0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007f55bc62adbd in __GI___pthread_mutex_lock (mutex=0x564884396440) at ../nptl/pthread_mutex_lock.c:80 #2 0x00007f5574c82a2f in torch::distributed::autograd::DistAutogradContainer::retrieveContext(long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so #3 0x00007f557de9bb2f in pybind11::cpp_function::initialize<torch::distributed::autograd::(anonymous namespace)::dist_autograd_init(_object*, _object*)::{lambda(long)#11}, pybind11::dict, long, pybind11::name, pybind11::scope, pybind11::sibling, char [931], pybind11::arg>(torch::distributed::autograd::(anonymous namespace)::dist_autograd_init(_object*, _object*)::{lambda(long)#11}&&, pybind11::dict (*)(long), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [931], pybind11::arg const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so ``` Basically Thread 72, holds GIL and tries to acquire the lock for DistAutogradContainer to perform a lookup on a map. On the other hand, Thread 79 holds the lock on DistAutogradContainer to remove a Tensor and as part of TensorImpl destructor, concrete_decref_fn is called which waits for GIL. As a result, we have a deadlock. To fix this issue, I've ensured we release GIL when we call `retrieveContext` and acquire it later when needed. ghstack-source-id: 133493659 Test Plan: waitforbuildbot Reviewed By: mrshenli Differential Revision: D29682624 fbshipit-source-id: f68a1fb39040ca0447a26e456a97bce64af6b79c

jfix71 and others added 30 commits January 23, 2021 17:20

[AutoAccept][Codemod][FBSourceClangFormatLinter] Daily `arc lint --ta…

5a5bca8

…ke CLANGFORMAT` Reviewed By: zertosh Differential Revision: D26043955 fbshipit-source-id: 0a5740a82bdd3ac7bd1665a325ff7fe79488ccea

Clarify wording around overrides subclasses. (#51031)

f7b339d

Summary: Fixes #47117 Pull Request resolved: #51031 Reviewed By: bdhirsh Differential Revision: D26047498 Pulled By: albanD fbshipit-source-id: dd0a7d9f97c0f6469b3050d2e3b4473f1bee3820

Add type annotations to torch.overrides (#50824)

9dfbfe9

Summary: This is a follow up PR of #48493. Fixes #48492 Pull Request resolved: #50824 Reviewed By: bdhirsh Differential Revision: D26050736 Pulled By: ezyang fbshipit-source-id: 049605fd271cff28c8b6e300c163e9df3b3ea23b

Added cuda bindings for NNC (#51046)

502ca01

Summary: See above. Pull Request resolved: #51046 Reviewed By: ZolotukhinM Differential Revision: D26053419 Pulled By: Chillee fbshipit-source-id: 9cc2dc434239a1ad77d30a1e5c0a9592be4944dc

Add torch.eig complex forward (CPU, CUDA) (#49168)

880f007

Summary: Related to issue #42666 Pull Request resolved: #49168 Reviewed By: mrshenli Differential Revision: D25954027 Pulled By: mruberry fbshipit-source-id: e429f9587efff5e638bfd0e4de864c06f41c63b1

Mike Ruberry and others added 22 commits January 26, 2021 02:07

Revert D25850783: Add torch::deploy, an embedded torch-python interpr…

e843974

…eter Test Plan: revert-hammer Differential Revision: D25850783 (3192f9e) Original commit changeset: a4656377caff fbshipit-source-id: 1c7133627da28fb12848da7a9a46de6d3b2b67c6

Enable BFloat support for gemms on arch other than ampere (#50442)

b822aba

Summary: Fixes #{issue number} Pull Request resolved: #50442 Reviewed By: bdhirsh Differential Revision: D26044981 Pulled By: mruberry fbshipit-source-id: 65c42f2c1de8d24e4852a1b5bd8f4b1735b2230e

[quant][graphmode][fx] cleanup linear module test case (#50976)

afa79a4

Summary: Pull Request resolved: #50976 Test Plan: Imported from OSS Reviewed By: supriyar Differential Revision: D26032531 fbshipit-source-id: 9725bab8f70ac79652e7bf9f94376917438d60e0

Revert D26018916: [pytorch][PR] Automated submodule update: tensorpipe

81ae8ed

Test Plan: revert-hammer Differential Revision: D26018916 (5f297cc) Original commit changeset: dc8aaa98d4e0 fbshipit-source-id: cd81a7950c7141e0711faabf03292098a8cf14d3

Move AcceleratedGraphModule out of graph_manipulation.

c8a24eb

Test Plan: buck test //caffe2/test:test_fx_experimental buck test //glow/fb/fx_nnpi_importer:test_importer Reviewed By: jfix71 Differential Revision: D25675618 fbshipit-source-id: 55636bb2d3d6102b400f2044118a450906954083

Revert D25675618: Move AcceleratedGraphModule out of graph_manipulation.

5748410

Test Plan: revert-hammer Differential Revision: D25675618 (c8a24eb) Original commit changeset: 55636bb2d3d6 fbshipit-source-id: 7b196f7c32830061eca9c89bbcb346cdd66a211e

Delete tabs from becnh_approx.cpp (#51157)

97ea95d

Summary: Introduced by D25981260 (f08464f) Pull Request resolved: #51157 Reviewed By: bwasti Differential Revision: D26090008 Pulled By: malfet fbshipit-source-id: b63f1bb1683c7261902de7eaab24a05a5159ce7e

imaginary-person merged commit 657946d into imaginary-person:master Jan 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get latest code from main repo#7

Get latest code from main repo#7
imaginary-person merged 52 commits intoimaginary-person:masterfrom
pytorch:master

imaginary-person commented Jan 27, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

imaginary-person commented Jan 27, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants