Merge more changes into forked repo#6
Merged
imaginary-person merged 25 commits intoimaginary-person:masterfrom Jan 23, 2021
Merged
Merge more changes into forked repo#6imaginary-person merged 25 commits intoimaginary-person:masterfrom
imaginary-person merged 25 commits intoimaginary-person:masterfrom
Conversation
Summary: Pull Request resolved: #49505 I have a problem which is that static runtime needs a way to bypass dispatch and call into kernels directly. Previously, it used native:: bindings to do this; but these bindings no longer exist for structured kernels! Enter at::cpu: a namespace of exactly at:: compatible functions that assume all of their arguments are CPU and non-autograd! The header looks like this: ``` namespace at { namespace cpu { CAFFE2_API Tensor & add_out(Tensor & out, const Tensor & self, const Tensor & other, Scalar alpha=1); CAFFE2_API Tensor add(const Tensor & self, const Tensor & other, Scalar alpha=1); CAFFE2_API Tensor & add_(Tensor & self, const Tensor & other, Scalar alpha=1); CAFFE2_API Tensor & upsample_nearest1d_out(Tensor & out, const Tensor & self, IntArrayRef output_size, c10::optional<double> scales=c10::nullopt); CAFFE2_API Tensor upsample_nearest1d(const Tensor & self, IntArrayRef output_size, c10::optional<double> scales=c10::nullopt); CAFFE2_API Tensor & upsample_nearest1d_backward_out(Tensor & grad_input, const Tensor & grad_output, IntArrayRef output_size, IntArrayRef input_size, c10::optional<double> scales=c10::nullopt); CAFFE2_API Tensor upsample_nearest1d_backward(const Tensor & grad_output, IntArrayRef output_size, IntArrayRef input_size, c10::optional<double> scales=c10::nullopt); }} ``` This slows down static runtime because these are not the "allow resize of nonzero tensor" variant binding (unlike the ones I had manually written). We can restore this: it's a matter of adding codegen smarts to do this, but I haven't done it just yet since it's marginally more complicated. In principle, non-structured kernels could get this treatment too. But, like an evil mastermind, I'm withholding it from this patch, as an extra carrot to get people to migrate to structured muahahahaha. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Test Plan: Imported from OSS Reviewed By: smessmer Differential Revision: D25616105 Pulled By: ezyang fbshipit-source-id: 84955ae09d0b373ca1ed05e0e4e0074a18d1a0b5
Summary: We added this option in #48248, but it would be good to document it somewhere as well, hence adding it to this contributing doc. Pull Request resolved: #50861 Reviewed By: mrshenli Differential Revision: D26014505 Pulled By: rohan-varma fbshipit-source-id: c1321679f01dd52038131ff571362ad36884510a
Summary: Closes #50513 by resolving the first three checkboxes. If this PR is merged, I will also modify one or both of the following wiki pages to add instructions on how to use this `mypy` wrapper for VS Code editor integration: - [Guide for adding type annotations to PyTorch](https://github.com/pytorch/pytorch/wiki/Guide-for-adding-type-annotations-to-PyTorch) - [Lint as you type](https://github.com/pytorch/pytorch/wiki/Lint-as-you-type) The test plan below is fairly manual, so let me know if I should add more automated tests to this PR. Pull Request resolved: #50826 Test Plan: Unit tests for globbing function: ``` python test/test_testing.py TestMypyWrapper -v ``` Manual checks: - Uninstall `mypy` and run `python test/test_type_hints.py` to verify that it still works when `mypy` is absent. - Reinstall `mypy` and run `python test/test_type_hints.py` to verify that this didn't break the `TestTypeHints` suite. - Run `python test/test_type_hints.py` again (should finish quickly) to verify that this didn't break `mypy` caching. - Run `torch/testing/_internal/mypy_wrapper.py` on a few Python files in this repo to verify that it doesn't give any additional warnings when the `TestTypeHints` suite passes. Some examples (compare with the behavior of just running `mypy` on these files): ```sh torch/testing/_internal/mypy_wrapper.py README.md torch/testing/_internal/mypy_wrapper.py tools/fast_nvcc/fast_nvcc.py torch/testing/_internal/mypy_wrapper.py test/test_type_hints.py torch/testing/_internal/mypy_wrapper.py torch/random.py torch/testing/_internal/mypy_wrapper.py torch/testing/_internal/mypy_wrapper.py ``` - Remove type hints from `torch.testing._internal.mypy_wrapper` and verify that running `mypy_wrapper.py` on that file gives type errors. - Remove the path to `mypy_wrapper.py` from the `files` setting in `mypy-strict.ini` and verify that running it again on itself no longer gives type errors. - Add `test/test_type_hints.py` to the `files` setting in `mypy-strict.ini` and verify that running the `mypy` wrapper on it again now gives type errors. - Remove type hints from `torch/random.py` and verify that running the `mypy` wrapper on it again now gives type errors. - Add the suggested JSON from the docstring of `torch.testing._internal.mypy_wrapper.main` to your `.vscode/settings.json` and verify that VS Code gives the same results (inline, while editing any Python file in the repo) as running the `mypy` wrapper on the command line, in all the above cases. Reviewed By: glaringlee, walterddr Differential Revision: D25977352 Pulled By: samestep fbshipit-source-id: 4b3a5e8a9071fcad65a19f193bf3dc7dc3ba1b96
Summary: This PR adds a simple debugging helper which exports the AliasDb state as a [GraphViz](http://www.graphviz.org/) graph definition. The generated files can be viewed with any Graphviz viewer (including online based, for example http://viz-js.com) Usage: 1. Call `AliasDb::dumpToGraphvizFile()` from a debugger. Using gdb for example: `call aliasDb_->dumpToGraphvizFile("alias.dot")` 2. Add explicit calls to `AliasDb::dumpToGraphvizFile()`, which returns `true` if it succeeds. An example output file is attached: [example.zip](https://github.com/pytorch/pytorch/files/5805840/example.zip) Pull Request resolved: #50452 Reviewed By: ngimel Differential Revision: D25980222 Pulled By: eellison fbshipit-source-id: 47805a0a81ce73c6ba859340d37b9a806f9000d5
Summary:
Used Caffe2 Swish implemenmtation to implement the operator. Will need
to resolve the error introduced.
```
test_quantized_swish_2D (tests.operators.testQuantizedSilu.TestSiLU) ... input:
(tensor([[-6.0000, -5.9961, -5.9922, ..., -5.7734, -5.7695, -5.7656],
[-5.7617, -5.7539, -5.7500, ..., -5.5352, -5.5312, -5.5234],
[-5.5195, -5.5156, -5.5117, ..., -5.2930, -5.2891, -5.2852],
...,
[ 5.2852, 5.2891, 5.2930, ..., 5.5117, 5.5156, 5.5195],
[ 5.5234, 5.5312, 5.5352, ..., 5.7500, 5.7539, 5.7617],
[ 5.7656, 5.7695, 5.7734, ..., 5.9922, 5.9961, 6.0000]]),)
base_res:
tensor([[-0.0148, -0.0149, -0.0149, ..., -0.0179, -0.0180, -0.0180],
[-0.0181, -0.0182, -0.0182, ..., -0.0218, -0.0218, -0.0220],
[-0.0220, -0.0221, -0.0222, ..., -0.0265, -0.0266, -0.0266],
...,
[ 5.2585, 5.2625, 5.2665, ..., 5.4895, 5.4935, 5.4975],
[ 5.5015, 5.5094, 5.5134, ..., 5.7318, 5.7357, 5.7437],
[ 5.7476, 5.7516, 5.7555, ..., 5.9773, 5.9812, 5.9852]])
tnco_res:
tensor([[-0.0148, -0.0149, -0.0149, ..., -0.0179, -0.0180, -0.0180],
[-0.0181, -0.0182, -0.0182, ..., -0.0218, -0.0218, -0.0220],
[-0.0220, -0.0221, -0.0222, ..., -0.0265, -0.0265, -0.0266],
...,
[ 5.2578, 5.2617, 5.2656, ..., 5.4922, 5.4922, 5.4961],
[ 5.5000, 5.5078, 5.5156, ..., 5.7305, 5.7383, 5.7422],
[ 5.7461, 5.7500, 5.7539, ..., 5.9766, 5.9805, 5.9844]])
nnpi_res:
tensor([[-0.0148, -0.0149, -0.0149, ..., -0.0179, -0.0180, -0.0180],
[-0.0181, -0.0182, -0.0182, ..., -0.0218, -0.0218, -0.0220],
[-0.0220, -0.0221, -0.0222, ..., -0.0265, -0.0266, -0.0266],
...,
[ 5.2585, 5.2625, 5.2665, ..., 5.4895, 5.4935, 5.4975],
[ 5.5015, 5.5094, 5.5134, ..., 5.7318, 5.7357, 5.7437],
[ 5.7476, 5.7516, 5.7555, ..., 5.9773, 5.9812, 5.9852]])
diff:
tensor([[4.1956e-06, 9.8441e-07, 6.0154e-06, ..., 4.2785e-06, 7.6480e-06,
1.0842e-05],
[1.3988e-06, 4.1034e-06, 6.5863e-06, ..., 5.3961e-06, 2.9635e-06,
1.0209e-05],
[1.2219e-06, 7.9758e-06, 1.7386e-05, ..., 3.0547e-07, 2.2141e-05,
1.4316e-05],
...,
[7.0286e-04, 7.8678e-04, 8.7023e-04, ..., 2.6422e-03, 1.3347e-03,
1.4052e-03],
[1.4753e-03, 1.6141e-03, 2.2225e-03, ..., 1.2884e-03, 2.5592e-03,
1.4634e-03],
[1.5216e-03, 1.5793e-03, 1.6365e-03, ..., 6.9284e-04, 7.4100e-04,
7.8964e-04]])
nnpi traced graph:
graph(%self : __torch__.tests.operators.testQuantizedSilu.SiLUModel,
%x : Float(*, *, requires_grad=0, device=cpu)):
%3 : None = prim::Constant()
%4 : bool = prim::Constant[value=0]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
%5 : Device = prim::Constant[value="cpu"]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
%6 : int = prim::Constant[value=0]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
%7 : int = prim::Constant[value=6]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
%8 : Float(*, *, requires_grad=0, device=cpu) = aten::zeros_like(%x, %7, %6, %5, %4, %3) # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
%input : Float(*, *, requires_grad=0, device=cpu) = glow::FusionGroup_0(%x, %8)
%10 : Tensor = aten::silu(%input) # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/torch/nn/functional.py:1804:0
return (%10)
with glow::FusionGroup_0 = graph(%0 : Float(*, *, requires_grad=0, device=cpu),
%1 : Float(*, *, requires_grad=0, device=cpu)):
%2 : int = prim::Constant[value=1]()
%input : Float(*, *, requires_grad=0, device=cpu) = aten::add(%0, %1, %2) # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
%4 : int = prim::Constant[value=1]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
return (%input)
tnco traced graph:
graph(%self : __torch__.tests.operators.testQuantizedSilu.___torch_mangle_0.SiLUModel,
%x : Float(*, *, requires_grad=0, device=cpu)):
%2 : int = prim::Constant[value=1]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
%3 : None = prim::Constant()
%4 : bool = prim::Constant[value=0]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
%5 : Device = prim::Constant[value="cpu"]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
%6 : int = prim::Constant[value=0]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
%7 : int = prim::Constant[value=6]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
%8 : Float(*, *, requires_grad=0, device=cpu) = aten::zeros_like(%x, %7, %6, %5, %4, %3) # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
%12 : Tensor = fakeNNPI::addFP16(%x, %8, %2)
%11 : Tensor = fakeNNPI::siluFP16(%12)
return (%11)
FAIL
======================================================================
FAIL: test_quantized_swish_2D (tests.operators.testQuantizedSilu.TestSiLU)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py", line 26, in test_quantized_swish_2D
validate_nnpi_model(model, (x,), expected_ops, [])
File "/data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/utils.py", line 73, in validate_nnpi_model
assert is_equal
AssertionError
```
Test Plan:
Run test with buck test mode/dev
//glow/fb/torch_glow/custom_nnpi_ops:testQuantizedSilu
Reviewed By: hyuen
Differential Revision: D25981369
fbshipit-source-id: dd0f3686b3cbf6fc575c959c7661125ecbf0b0db
Summary: Pull Request resolved: #50949 When converting RPC Message into Python objects, we were not using a CUDAFuture for the chained Future. As a result, the streams are not synchronized when calling `rpc_async(...).wait()`. This commit uses `Future::then` API to create the chained Future, which will be creating a CUDAFuture if the existing Future is a CUDA one. fixes #50881 fixes #50839 Test Plan: Imported from OSS Reviewed By: pritamdamania87 Differential Revision: D26020458 Pulled By: mrshenli fbshipit-source-id: 25195fbc10b99f4c401ec3ed7a382128464b5f08
Summary: Pull Request resolved: #47667 Test Plan: Imported from OSS Reviewed By: anjali411, ngimel Differential Revision: D25255572 Pulled By: Krovatkin fbshipit-source-id: d0152c9ef5b1994e27be9888bcb123dca3ecd88f
…rt (#50793) Summary: This contains some improvements and refactoring to how patching is done in `torch.fx.symbolic_trace`. 1) Functions from `math.*` are now supported without needing to call `torch.fx.wrap()`. `wrap()` actually errors on some of these function because they are written in C and don't have `__code__` requiring use of the string version. `math` usage is relatively common, for example [BERT uses math.sqrt here](https://github.com/pytorch/benchmark/blob/6f79061bd145eeaa9b4a75847939901fd245ddf9/torchbenchmark/models/BERT_pytorch/bert_pytorch/model/attention/single.py#L16). Both `math.sqrt()` and `from math import sqrt` (copying to module namespace) are supported. When modules are called FX now searches the module's global scope to find methods to patch. 2) [Guarded behind `env FX_PATCH_GETITEM=1`] Fixes a failed trace of [PositionalEmbedding from BERT](https://github.com/pytorch/benchmark/blob/6f79061bd145eeaa9b4a75847939901fd245ddf9/torchbenchmark/models/BERT_pytorch/bert_pytorch/model/embedding/position.py#L24), which failed to trace with the error `TypeError: slice indices must be integers or None or have an __index__ method` (a Proxy() is getting passed into `Tensor.__getitem__`). See #50710 for why this is disabled by default. 3) Support for automatically wrapping methods that may have been copied to a different module scope via an import like `from foo import wrapped_function`. This also isn't exposed in `torch.fx.wrap`, but is used to implement `math.*` support. Pull Request resolved: #50793 Test Plan: Added unittests to check each feature Reviewed By: jamesr66a Differential Revision: D25999788 Pulled By: jansel fbshipit-source-id: f1ce11a69b7d97f26c9e2741c6acf9c513a84467
Summary: Also, get rid of MSVC specific `_USE_MATH_DEFINES` Test at compile time that c10::pi<double> == M_PI Pull Request resolved: #50819 Reviewed By: albanD Differential Revision: D25976330 Pulled By: malfet fbshipit-source-id: 8f3ddfd58a5aa4bd382da64ad6ecc679706d1284
Summary: Pull Request resolved: #50740 This PR adds a `check_batched_grad=True` option to NewModuleTest-generated NN tests. Test Plan: - run tests (`pytest test/test_nn.py -v -rf`) Reviewed By: ejguan Differential Revision: D25997679 Pulled By: zou3519 fbshipit-source-id: b75e73d7e86fd3af9bad6efed7127b36551587b3
…50916) Summary: `philox_engine_inputs()` is deprecated. Callers should refactor to use `philox_cuda_state()`, and afaik all call sites in aten have already been refactored, but in the meantime on behalf of other consumers (ie extensions, possibly some lingering call sites in jit), `philox_engine_inputs` should handle the increment the same way `philox_cuda_state` does. Pull Request resolved: #50916 Reviewed By: mrshenli Differential Revision: D26022618 Pulled By: ngimel fbshipit-source-id: 17178ad099ddc17d3596b9508ae4dce729b44f57
…r-friendly wrapper Test Plan: revert-hammer Differential Revision: D25977352 (73dffc8) Original commit changeset: 4b3a5e8a9071 fbshipit-source-id: a0383ea4158f54be6f128b9ddb2cd12fc3a3ea53
Summary: Pull Request resolved: #50966 Test Plan: Imported from OSS Reviewed By: suo Differential Revision: D26029101 Pulled By: jamesr66a fbshipit-source-id: 4374771be74d0a4d05fdd29107be5357130c2a76
Summary: Because it is shorter, faster, and does not have TF32 issue. Benchmark: https://github.com/zasdfgbnm/things/blob/master/2021Q1/kron.ipynb Pull Request resolved: #50927 Reviewed By: glaringlee Differential Revision: D26022385 Pulled By: ngimel fbshipit-source-id: 513c9e9138c35c70d3a475a8407728af21321dae
…t the extra_files map. (#50932) Summary: Pull Request resolved: #50932 After the change to split `_load_for_mobile()` into multiple methods, one which takes in the `extra_files` map, and one which doesn't, we can change the implementation of the `deserialize()` method with different overloads as well. Suggested by raziel on D25968216 (bb909d2). ghstack-source-id: 120185089 Test Plan: Build/Sandcastle. Reviewed By: JacobSzwejbka Differential Revision: D26014084 fbshipit-source-id: 914142137346a6246def1acf38a3204dd4c4f52f
Summary: Pull Request resolved: #50859 Test Plan: Unit test: ``` buck test //caffe2/test:torch ``` Benchmark: ``` MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 13 \ ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench \ --scripted_model=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge_v2/traced_precomputation.pt \ --pt_inputs=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge_v2/container_precomputation_bs20.pt \ --iters=10000 --warmup_iters=10000 --num_threads=1 --pt_enable_static_runtime=true \ --pt_cleanup_activations=true --pt_enable_out_variant=true --do_profile=true ``` Reduces the total time spent on flatten from 1.22% to 0.97% (net 0.25% reduction). ``` Before: Static runtime ms per iter: 0.0725054. Iters per second: 13792.1 0.000857179 ms. 1.21862%. aten::flatten (1 nodes) After: Static runtime ms per iter: 0.0720371. Iters per second: 13881.7 0.000686155 ms. 0.97151%. aten::flatten (1 nodes) ``` Reviewed By: ajyu Differential Revision: D25986759 fbshipit-source-id: dc0f542c56a688d331d349845b78084577970476
) Summary: Pull Request resolved: #50851 Improves upon the previous unittest to ensure allreduce_hook results in the same gradients as vanilla allreduce in DDP. ghstack-source-id: 120229103 Test Plan: buck build mode/dev-nosan //caffe2/test/distributed:distributed_nccl_fork --keep-going BACKEND=nccl WORLD_SIZE=2 ~/fbcode/buck-out/dev/gen/caffe2/test/distributed/distributed_nccl_fork#binary.par -r test_ddp_hook_parity Reviewed By: SciPioneer Differential Revision: D25963654 fbshipit-source-id: d55eee0aee9cf1da52aa0c4ba1066718aa8fd9a4
Summary: Pull Request resolved: #50624 Add TorchScript compatible Adam functional optimizer to distributed optimizer Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D25932770 Pulled By: wanchaol fbshipit-source-id: cab3f1164c76186969c284a2c52481b79bbb7190
Summary: Pull Request resolved: #50618 Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D25932778 Pulled By: wanchaol fbshipit-source-id: 8df3567b477bc5ba3556b8c5294cd3da5db963ad
Summary: Pull Request resolved: #50623 Add TorchScript compatible Adadelta functional optimizer to distributed optimizer Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D25932772 Pulled By: wanchaol fbshipit-source-id: d59b04e5f0b6bab7e0d1c5f68e66249a65958e0b
Summary: Pull Request resolved: #50619 Add TorchScript compatible RMSprop functional optimizer to distributed optimizer Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D25932775 Pulled By: wanchaol fbshipit-source-id: bd4854f9f95a740e02a1bebe24f780488460ba4d
Summary: Pull Request resolved: #50620 Add TorchScript compatible AdamW functional optimizer to distributed optimizer Test Plan: Imported from OSS Reviewed By: rohan-varma Differential Revision: D25932774 Pulled By: wanchaol fbshipit-source-id: 64eb4aeaa3cab208d0ebbec7c4d91a9d43951947
Summary: Pull Request resolved: #50380 Test Plan: Imported from OSS Reviewed By: ezyang Differential Revision: D25949361 Pulled By: anjali411 fbshipit-source-id: 9910bc5b532c9bf3add530221d643b2c41c62d01
Summary: Pull Request resolved: #50321 Quantization team reported that when there are two empty tensors are replicated among ranks, the two empty tensors start to share storage after resizing. The root cause is unflatten_dense_tensor unflattened the empty tensor as view of flat tensor and thus share storage with other tensors. This PR is trying to avoid unflatten the empty tensor as view of flat tensor so that empty tensor will not share storage with other tensors. Test Plan: unit test Reviewed By: pritamdamania87 Differential Revision: D25859503 fbshipit-source-id: 5b760b31af6ed2b66bb22954cba8d1514f389cca
imaginary-person
pushed a commit
that referenced
this pull request
May 26, 2021
Summary: added more statistic info for static runtime Test Plan: caffe2/benchmarks/static_runtime:static_runtime_cpptest Expected output example: Static runtime ms per iter: 0.939483. Iters per second: 1064.41 Node #0: 0.195671 ms/iter, %wide_offset.1 : Tensor = aten::add(%wide.1, %self._mu, %4) Node #1: 0.169457 ms/iter, %wide_normalized.1 : Tensor = aten::mul(%wide_offset.1, %self._sigma) Node #2: 0.118218 ms/iter, %wide_preproc.1 : Tensor = aten::clamp(%wide_normalized.1, %5, %6) Node #3: 0.038814 ms/iter, %user_emb_t.1 : Tensor = aten::transpose(%user_emb.1, %4, %7) Node #4: 0.0860747 ms/iter, %dp_unflatten.1 : Tensor = aten::bmm(%ad_emb_packed.1, %user_emb_t.1) Node #5: 0.0102666 ms/iter, %31 : Tensor = static_runtime::flatten_copy(%dp_unflatten.1, %4, %8) Node #6: 0.000476333 ms/iter, %19 : Tensor[] = prim::ListConstruct(%31, %wide_preproc.1) Node #7: 0.0707332 ms/iter, %input.1 : Tensor = aten::cat(%19, %4) Node #8: 0.123695 ms/iter, %fc1.1 : Tensor = aten::addmm(%self._fc_b, %input.1, %29, %4, %4) Node #9: 0.0309244 ms/iter, %23 : Tensor = aten::sigmoid(%fc1.1) Node #10: 0.0046297 ms/iter, %24 : (Tensor) = prim::TupleConstruct(%23) Time per node type: 0.195671 ms. 23.0483%. aten::add (1 nodes) 0.169457 ms. 19.9605%. aten::mul (1 nodes, out variant) 0.123695 ms. 14.5702%. aten::addmm (1 nodes, out variant) 0.118218 ms. 13.925%. aten::clamp (1 nodes, out variant) 0.0860747 ms. 10.1388%. aten::bmm (1 nodes, out variant) 0.0707332 ms. 8.33175%. aten::cat (1 nodes, out variant) 0.038814 ms. 4.57195%. aten::transpose (1 nodes) 0.0309244 ms. 3.64263%. aten::sigmoid (1 nodes, out variant) 0.0102666 ms. 1.20932%. static_runtime::flatten_copy (1 nodes, out variant) 0.0046297 ms. 0.545338%. prim::TupleConstruct (1 nodes, out variant) 0.000476333 ms. 0.0561079%. prim::ListConstruct (1 nodes, out variant) 0.848959 ms. in Total StaticRuntime setup time: 0.018925 ms Memory allocation time: 0.019808 ms Memory deallocation time: 0.0120445 ms Outputs deallocation time: 0.0864947 ms Total memory managed: 19328 bytes Total number of reused tensors: 3 Total number of 'out' variant nodes/total number of nodes: 9/11 (81.8182%) Reviewed By: hlu1 Differential Revision: D28553029 fbshipit-source-id: 55e7eab50b4b475ae219896100bdf4f6678875a4
imaginary-person
pushed a commit
that referenced
this pull request
Jul 2, 2021
Summary: Pull Request resolved: pytorch#60987 We were seeing deadlocks as follows during shutdown: ``` Thread 1 (LWP 2432101): #0 0x00007efca470190b in __pause_nocancel () from /lib64/libc.so.6 #1 0x00007efca49de485 in __pthread_mutex_lock_full () from /lib64/libpthread.so.0 #2 0x00007ef91d4c42c6 in __cuda_CallJitEntryPoint () from /lib64/libnvidia-ptxjitcompiler.so.1 #3 0x00007efc651ac8f1 in ?? () from /lib64/libcuda.so #4 0x00007efc651aee03 in ?? () from /lib64/libcuda.so #5 0x00007efc64f76b84 in ?? () from /lib64/libcuda.so #6 0x00007efc64f77f5d in ?? () from /lib64/libcuda.so #7 0x00007efc64eac858 in ?? () from /lib64/libcuda.so #8 0x00007efc64eacfbc in ?? () from /lib64/libcuda.so #9 0x00007efc7810a924 in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #10 0x00007efc780fa2be in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #11 0x00007efc78111044 in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #12 0x00007efc7811580a in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #13 0x00007efc78115aa4 in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #14 0x00007efc781079ec in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #15 0x00007efc780e6a7a in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #16 0x00007efc7811cfa5 in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #17 0x00007efc777ea98c in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #18 0x00007efc777ebd80 in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #19 0x00007efc777ea2c9 in ?? () from /usr/local/cuda/lib64/libcublas.so.11 #20 0x00007efc778c2e2d in cublasDestroy_v2 () from /usr/local/cuda/lib64/libcublas.so.11 #21 0x00007efc51a3fb56 in std::_Sp_counted_ptr_inplace<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle>, std::allocator<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle> >, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /data/users/pritam/pytorch/torch/lib/libtorch_cuda.so #22 0x00007efc51a3fc5f in std::shared_ptr<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle> >::~shared_ptr() () from /data/users/pritam/pytorch/torch/lib/libtorch_cuda.so #23 0x00007efca4648b0c in __run_exit_handlers () from /lib64/libc.so.6 #24 0x00007efca4648c40 in exit () from /lib64/libc.so.6 #25 0x0000558c8852e5f9 in Py_Exit (sts=0) at /tmp/build/80754af9/python_1614362349910/work/Python/pylifecycle.c:2292 #26 0x0000558c8852e6a7 in handle_system_exit () at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:636 #27 0x0000558c8852e742 in PyErr_PrintEx (set_sys_last_vars=<optimized out>, set_sys_last_vars=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:646 #28 0x0000558c88540dd6 in PyRun_SimpleStringFlags (command=0x7efca4dc9050 "from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=9, pipe_handle=13)\n", flags=0x7ffe3a986110) at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:457 #29 0x0000558c88540ead in pymain_run_command (cf=0x7ffe3a986110, command=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:420 #30 pymain_run_python (pymain=0x7ffe3a986220) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:2907 #31 pymain_main (pymain=0x7ffe3a986220) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:3460 #32 0x0000558c8854122c in _Py_UnixMain (argc=<optimized out>, argv=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:3495 #33 0x00007efca4632493 in __libc_start_main () from /lib64/libc.so.6 #34 0x0000558c884e5e90 in _start () at ../sysdeps/x86_64/elf/start.S:103 ``` This was likely caused due to a static singleton that wasn't leaky. Following the guidance in https://isocpp.org/wiki/faq/ctors#construct-on-first-use-v2 to use a leaky singleton instead. ghstack-source-id: 132847448 Test Plan: Verified locally. Reviewed By: malfet Differential Revision: D29468866 fbshipit-source-id: 89250594c5cd2643417b1da584c658b742dc5a5c
imaginary-person
pushed a commit
that referenced
this pull request
Jul 20, 2021
Summary: Pull Request resolved: pytorch#61588 As part of debugging pytorch#60290, we discovered the following deadlock: ``` Thread 79 (Thread 0x7f52ff7fe700 (LWP 205437)): #0 pthread_cond_timedwait@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225 #1 0x0000564880199152 in PyCOND_TIMEDWAIT (cond=0x564880346080 <gil_cond>, mut=0x564880346100 <gil_mutex>, us=5000) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/condvar.h:103 #2 take_gil (tstate=0x7f5254005ef0) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/ceval_gil.h:224 #3 0x0000564880217b62 in PyEval_AcquireThread (tstate=0x7f5254005ef0) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/ceval.c:278 #4 0x00007f557d54aabd in pybind11::gil_scoped_acquire::gil_scoped_acquire() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so #5 0x00007f557da7792f in (anonymous namespace)::concrete_decref_fn(c10::impl::PyInterpreter const*, _object*) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so #6 0x00007f5560dadba6 in c10::TensorImpl::release_resources() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so #7 0x00007f5574c885bc in std::_Sp_counted_ptr_inplace<torch::distributed::autograd::DistAutogradContext, std::allocator<torch::distributed::autograd::DistAutogradContext>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so #8 0x00007f5574c815e9 in std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<long const, std::shared_ptr<torch::distributed::autograd::DistAutogradContext> >, false> > >::_M_deallocate_node(std::__detail::_Hash_node<std::pair<long const, std::shared_ptr<torch::distributed::autograd::DistAutogradContext> >, false>*) [clone .isra.325] () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so #9 0x00007f5574c81bf1 in torch::distributed::autograd::DistAutogradContainer::eraseContextIdAndReset(torch::distributed::autograd::DistAutogradContainer::ContextsShard&, long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so #10 0x00007f5574c86e83 in torch::distributed::autograd::DistAutogradContainer::releaseContextIfPresent(long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so #11 0x00007f5574cc6395 in torch::distributed::rpc::RequestCallbackNoPython::processCleanupAutogradContextReq(torch::distributed::rpc::RpcCommandBase&) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so #12 0x00007f5574cccf15 in torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so Thread 72 (Thread 0x7f53077fe700 (LWP 205412)): #0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007f55bc62adbd in __GI___pthread_mutex_lock (mutex=0x564884396440) at ../nptl/pthread_mutex_lock.c:80 #2 0x00007f5574c82a2f in torch::distributed::autograd::DistAutogradContainer::retrieveContext(long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so #3 0x00007f557de9bb2f in pybind11::cpp_function::initialize<torch::distributed::autograd::(anonymous namespace)::dist_autograd_init(_object*, _object*)::{lambda(long)#11}, pybind11::dict, long, pybind11::name, pybind11::scope, pybind11::sibling, char [931], pybind11::arg>(torch::distributed::autograd::(anonymous namespace)::dist_autograd_init(_object*, _object*)::{lambda(long)#11}&&, pybind11::dict (*)(long), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [931], pybind11::arg const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so ``` Basically Thread 72, holds GIL and tries to acquire the lock for DistAutogradContainer to perform a lookup on a map. On the other hand, Thread 79 holds the lock on DistAutogradContainer to remove a Tensor and as part of TensorImpl destructor, concrete_decref_fn is called which waits for GIL. As a result, we have a deadlock. To fix this issue, I've ensured we release GIL when we call `retrieveContext` and acquire it later when needed. ghstack-source-id: 133493659 Test Plan: waitforbuildbot Reviewed By: mrshenli Differential Revision: D29682624 fbshipit-source-id: f68a1fb39040ca0447a26e456a97bce64af6b79c
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Merge more changes into forked repo