Skip to content

Merge more changes into forked repo#6

Merged
imaginary-person merged 25 commits intoimaginary-person:masterfrom
pytorch:master
Jan 23, 2021
Merged

Merge more changes into forked repo#6
imaginary-person merged 25 commits intoimaginary-person:masterfrom
pytorch:master

Conversation

@imaginary-person
Copy link
Copy Markdown
Owner

Merge more changes into forked repo

ezyang and others added 25 commits January 22, 2021 13:11
Summary:
Pull Request resolved: #49505

I have a problem which is that static runtime needs a way to bypass
dispatch and call into kernels directly.  Previously, it used
native:: bindings to do this; but these bindings no longer exist
for structured kernels!  Enter at::cpu: a namespace of exactly
at:: compatible functions that assume all of their arguments are
CPU and non-autograd!  The header looks like this:

```
namespace at {
namespace cpu {

CAFFE2_API Tensor & add_out(Tensor & out, const Tensor & self, const Tensor & other, Scalar alpha=1);
CAFFE2_API Tensor add(const Tensor & self, const Tensor & other, Scalar alpha=1);
CAFFE2_API Tensor & add_(Tensor & self, const Tensor & other, Scalar alpha=1);
CAFFE2_API Tensor & upsample_nearest1d_out(Tensor & out, const Tensor & self, IntArrayRef output_size, c10::optional<double> scales=c10::nullopt);
CAFFE2_API Tensor upsample_nearest1d(const Tensor & self, IntArrayRef output_size, c10::optional<double> scales=c10::nullopt);
CAFFE2_API Tensor & upsample_nearest1d_backward_out(Tensor & grad_input, const Tensor & grad_output, IntArrayRef output_size, IntArrayRef input_size, c10::optional<double> scales=c10::nullopt);
CAFFE2_API Tensor upsample_nearest1d_backward(const Tensor & grad_output, IntArrayRef output_size, IntArrayRef input_size, c10::optional<double> scales=c10::nullopt);

}}
```

This slows down static runtime because these are not the "allow
resize of nonzero tensor" variant binding (unlike the ones I had manually
written).  We can restore this: it's a matter of adding codegen smarts to
do this, but I haven't done it just yet since it's marginally more
complicated.

In principle, non-structured kernels could get this treatment too.
But, like an evil mastermind, I'm withholding it from this patch, as an extra
carrot to get people to migrate to structured muahahahaha.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: smessmer

Differential Revision: D25616105

Pulled By: ezyang

fbshipit-source-id: 84955ae09d0b373ca1ed05e0e4e0074a18d1a0b5
Summary:
We added this option in #48248, but it would be good to document it somewhere as well, hence adding it to this contributing doc.

Pull Request resolved: #50861

Reviewed By: mrshenli

Differential Revision: D26014505

Pulled By: rohan-varma

fbshipit-source-id: c1321679f01dd52038131ff571362ad36884510a
Summary:
Closes #50513 by resolving the first three checkboxes. If this PR is merged, I will also modify one or both of the following wiki pages to add instructions on how to use this `mypy` wrapper for VS Code editor integration:

- [Guide for adding type annotations to PyTorch](https://github.com/pytorch/pytorch/wiki/Guide-for-adding-type-annotations-to-PyTorch)
- [Lint as you type](https://github.com/pytorch/pytorch/wiki/Lint-as-you-type)

The test plan below is fairly manual, so let me know if I should add more automated tests to this PR.

Pull Request resolved: #50826

Test Plan:
Unit tests for globbing function:
```
python test/test_testing.py TestMypyWrapper -v
```

Manual checks:

- Uninstall `mypy` and run `python test/test_type_hints.py` to verify that it still works when `mypy` is absent.
- Reinstall `mypy` and run `python test/test_type_hints.py` to verify that this didn't break the `TestTypeHints` suite.
- Run `python test/test_type_hints.py` again (should finish quickly) to verify that this didn't break `mypy` caching.
- Run `torch/testing/_internal/mypy_wrapper.py` on a few Python files in this repo to verify that it doesn't give any additional warnings when the `TestTypeHints` suite passes. Some examples (compare with the behavior of just running `mypy` on these files):
  ```sh
  torch/testing/_internal/mypy_wrapper.py README.md
  torch/testing/_internal/mypy_wrapper.py tools/fast_nvcc/fast_nvcc.py
  torch/testing/_internal/mypy_wrapper.py test/test_type_hints.py
  torch/testing/_internal/mypy_wrapper.py torch/random.py
  torch/testing/_internal/mypy_wrapper.py torch/testing/_internal/mypy_wrapper.py
  ```
- Remove type hints from `torch.testing._internal.mypy_wrapper` and verify that running `mypy_wrapper.py` on that file gives type errors.
- Remove the path to `mypy_wrapper.py` from the `files` setting in `mypy-strict.ini` and verify that running it again on itself no longer gives type errors.
- Add `test/test_type_hints.py` to the `files` setting in `mypy-strict.ini` and verify that running the `mypy` wrapper on it again now gives type errors.
- Remove type hints from `torch/random.py` and verify that running the `mypy` wrapper on it again now gives type errors.
- Add the suggested JSON from the docstring of `torch.testing._internal.mypy_wrapper.main` to your `.vscode/settings.json` and verify that VS Code gives the same results (inline, while editing any Python file in the repo) as running the `mypy` wrapper on the command line, in all the above cases.

Reviewed By: glaringlee, walterddr

Differential Revision: D25977352

Pulled By: samestep

fbshipit-source-id: 4b3a5e8a9071fcad65a19f193bf3dc7dc3ba1b96
Summary:
This PR adds a simple debugging helper which exports the AliasDb state as a [GraphViz](http://www.graphviz.org/) graph definition. The generated files can be viewed with any Graphviz viewer (including online based, for example http://viz-js.com)

Usage:

1. Call `AliasDb::dumpToGraphvizFile()` from a debugger. Using gdb for example:
`call aliasDb_->dumpToGraphvizFile("alias.dot")`

2. Add explicit calls to `AliasDb::dumpToGraphvizFile()`, which returns `true` if it succeeds.

An example output file is attached: [example.zip](https://github.com/pytorch/pytorch/files/5805840/example.zip)

Pull Request resolved: #50452

Reviewed By: ngimel

Differential Revision: D25980222

Pulled By: eellison

fbshipit-source-id: 47805a0a81ce73c6ba859340d37b9a806f9000d5
Summary:
Used Caffe2 Swish implemenmtation to implement the operator. Will need
to resolve the error introduced.
```
test_quantized_swish_2D (tests.operators.testQuantizedSilu.TestSiLU) ... input:
 (tensor([[-6.0000, -5.9961, -5.9922,  ..., -5.7734, -5.7695, -5.7656],
        [-5.7617, -5.7539, -5.7500,  ..., -5.5352, -5.5312, -5.5234],
        [-5.5195, -5.5156, -5.5117,  ..., -5.2930, -5.2891, -5.2852],
        ...,
        [ 5.2852,  5.2891,  5.2930,  ...,  5.5117,  5.5156,  5.5195],
        [ 5.5234,  5.5312,  5.5352,  ...,  5.7500,  5.7539,  5.7617],
        [ 5.7656,  5.7695,  5.7734,  ...,  5.9922,  5.9961,  6.0000]]),)
base_res:
 tensor([[-0.0148, -0.0149, -0.0149,  ..., -0.0179, -0.0180, -0.0180],
        [-0.0181, -0.0182, -0.0182,  ..., -0.0218, -0.0218, -0.0220],
        [-0.0220, -0.0221, -0.0222,  ..., -0.0265, -0.0266, -0.0266],
        ...,
        [ 5.2585,  5.2625,  5.2665,  ...,  5.4895,  5.4935,  5.4975],
        [ 5.5015,  5.5094,  5.5134,  ...,  5.7318,  5.7357,  5.7437],
        [ 5.7476,  5.7516,  5.7555,  ...,  5.9773,  5.9812,  5.9852]])
tnco_res:
 tensor([[-0.0148, -0.0149, -0.0149,  ..., -0.0179, -0.0180, -0.0180],
        [-0.0181, -0.0182, -0.0182,  ..., -0.0218, -0.0218, -0.0220],
        [-0.0220, -0.0221, -0.0222,  ..., -0.0265, -0.0265, -0.0266],
        ...,
        [ 5.2578,  5.2617,  5.2656,  ...,  5.4922,  5.4922,  5.4961],
        [ 5.5000,  5.5078,  5.5156,  ...,  5.7305,  5.7383,  5.7422],
        [ 5.7461,  5.7500,  5.7539,  ...,  5.9766,  5.9805,  5.9844]])
nnpi_res:
 tensor([[-0.0148, -0.0149, -0.0149,  ..., -0.0179, -0.0180, -0.0180],
        [-0.0181, -0.0182, -0.0182,  ..., -0.0218, -0.0218, -0.0220],
        [-0.0220, -0.0221, -0.0222,  ..., -0.0265, -0.0266, -0.0266],
        ...,
        [ 5.2585,  5.2625,  5.2665,  ...,  5.4895,  5.4935,  5.4975],
        [ 5.5015,  5.5094,  5.5134,  ...,  5.7318,  5.7357,  5.7437],
        [ 5.7476,  5.7516,  5.7555,  ...,  5.9773,  5.9812,  5.9852]])
diff:
 tensor([[4.1956e-06, 9.8441e-07, 6.0154e-06,  ..., 4.2785e-06, 7.6480e-06,
         1.0842e-05],
        [1.3988e-06, 4.1034e-06, 6.5863e-06,  ..., 5.3961e-06, 2.9635e-06,
         1.0209e-05],
        [1.2219e-06, 7.9758e-06, 1.7386e-05,  ..., 3.0547e-07, 2.2141e-05,
         1.4316e-05],
        ...,
        [7.0286e-04, 7.8678e-04, 8.7023e-04,  ..., 2.6422e-03, 1.3347e-03,
         1.4052e-03],
        [1.4753e-03, 1.6141e-03, 2.2225e-03,  ..., 1.2884e-03, 2.5592e-03,
         1.4634e-03],
        [1.5216e-03, 1.5793e-03, 1.6365e-03,  ..., 6.9284e-04, 7.4100e-04,
         7.8964e-04]])
nnpi traced graph:
 graph(%self : __torch__.tests.operators.testQuantizedSilu.SiLUModel,
      %x : Float(*, *, requires_grad=0, device=cpu)):
  %3 : None = prim::Constant()
  %4 : bool = prim::Constant[value=0]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
  %5 : Device = prim::Constant[value="cpu"]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
  %6 : int = prim::Constant[value=0]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
  %7 : int = prim::Constant[value=6]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
  %8 : Float(*, *, requires_grad=0, device=cpu) = aten::zeros_like(%x, %7, %6, %5, %4, %3) # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
  %input : Float(*, *, requires_grad=0, device=cpu) = glow::FusionGroup_0(%x, %8)
  %10 : Tensor = aten::silu(%input) # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/torch/nn/functional.py:1804:0
  return (%10)
with glow::FusionGroup_0 = graph(%0 : Float(*, *, requires_grad=0, device=cpu),
      %1 : Float(*, *, requires_grad=0, device=cpu)):
  %2 : int = prim::Constant[value=1]()
  %input : Float(*, *, requires_grad=0, device=cpu) = aten::add(%0, %1, %2) # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
  %4 : int = prim::Constant[value=1]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
  return (%input)

tnco traced graph:
 graph(%self : __torch__.tests.operators.testQuantizedSilu.___torch_mangle_0.SiLUModel,
      %x : Float(*, *, requires_grad=0, device=cpu)):
  %2 : int = prim::Constant[value=1]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
  %3 : None = prim::Constant()
  %4 : bool = prim::Constant[value=0]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
  %5 : Device = prim::Constant[value="cpu"]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
  %6 : int = prim::Constant[value=0]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
  %7 : int = prim::Constant[value=6]() # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
  %8 : Float(*, *, requires_grad=0, device=cpu) = aten::zeros_like(%x, %7, %6, %5, %4, %3) # /data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py:13:0
  %12 : Tensor = fakeNNPI::addFP16(%x, %8, %2)
  %11 : Tensor = fakeNNPI::siluFP16(%12)
  return (%11)

FAIL

======================================================================
FAIL: test_quantized_swish_2D (tests.operators.testQuantizedSilu.TestSiLU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/operators/testQuantizedSilu.py", line 26, in test_quantized_swish_2D
    validate_nnpi_model(model, (x,), expected_ops, [])
  File "/data/users/kaus/fbsource/fbcode/buck-out/dev/gen/glow/fb/torch_glow/custom_nnpi_ops/testQuantizedSilu#binary,link-tree/tests/utils.py", line 73, in validate_nnpi_model
    assert is_equal
AssertionError
```

Test Plan:
Run test with buck test  mode/dev
//glow/fb/torch_glow/custom_nnpi_ops:testQuantizedSilu

Reviewed By: hyuen

Differential Revision: D25981369

fbshipit-source-id: dd0f3686b3cbf6fc575c959c7661125ecbf0b0db
Summary:
Pull Request resolved: #50949

When converting RPC Message into Python objects, we were not using
a CUDAFuture for the chained Future. As a result, the streams are
not synchronized when calling `rpc_async(...).wait()`. This commit
uses `Future::then` API to create the chained Future, which will
be creating a CUDAFuture if the existing Future is a CUDA one.

fixes #50881
fixes #50839

Test Plan: Imported from OSS

Reviewed By: pritamdamania87

Differential Revision: D26020458

Pulled By: mrshenli

fbshipit-source-id: 25195fbc10b99f4c401ec3ed7a382128464b5f08
Summary: Pull Request resolved: #47667

Test Plan: Imported from OSS

Reviewed By: anjali411, ngimel

Differential Revision: D25255572

Pulled By: Krovatkin

fbshipit-source-id: d0152c9ef5b1994e27be9888bcb123dca3ecd88f
…rt (#50793)

Summary:
This contains some improvements and refactoring to how patching is done in `torch.fx.symbolic_trace`.

1) Functions from `math.*` are now supported without needing to call `torch.fx.wrap()`.  `wrap()` actually errors on some of these function because they are written in C and don't have `__code__` requiring use of the string version.  `math` usage is relatively common, for example [BERT uses math.sqrt here](https://github.com/pytorch/benchmark/blob/6f79061bd145eeaa9b4a75847939901fd245ddf9/torchbenchmark/models/BERT_pytorch/bert_pytorch/model/attention/single.py#L16).  Both `math.sqrt()` and `from math import sqrt` (copying to module namespace) are supported.  When modules are called FX now searches the module's global scope to find methods to patch.

2) [Guarded behind `env FX_PATCH_GETITEM=1`] Fixes a failed trace of [PositionalEmbedding from BERT](https://github.com/pytorch/benchmark/blob/6f79061bd145eeaa9b4a75847939901fd245ddf9/torchbenchmark/models/BERT_pytorch/bert_pytorch/model/embedding/position.py#L24), which failed to trace with the error `TypeError: slice indices must be integers or None or have an __index__ method` (a Proxy() is getting passed into `Tensor.__getitem__`).  See #50710 for why this is disabled by default.

3) Support for automatically wrapping methods that may have been copied to a different module scope via an import like `from foo import wrapped_function`.  This also isn't exposed in `torch.fx.wrap`, but is used to implement `math.*` support.

Pull Request resolved: #50793

Test Plan: Added unittests to check each feature

Reviewed By: jamesr66a

Differential Revision: D25999788

Pulled By: jansel

fbshipit-source-id: f1ce11a69b7d97f26c9e2741c6acf9c513a84467
Summary:
Also, get rid of MSVC specific `_USE_MATH_DEFINES`

Test at compile time that c10::pi<double> == M_PI

Pull Request resolved: #50819

Reviewed By: albanD

Differential Revision: D25976330

Pulled By: malfet

fbshipit-source-id: 8f3ddfd58a5aa4bd382da64ad6ecc679706d1284
Summary:
Pull Request resolved: #50740

This PR adds a `check_batched_grad=True` option to
NewModuleTest-generated NN tests.

Test Plan: - run tests (`pytest test/test_nn.py -v -rf`)

Reviewed By: ejguan

Differential Revision: D25997679

Pulled By: zou3519

fbshipit-source-id: b75e73d7e86fd3af9bad6efed7127b36551587b3
…50916)

Summary:
`philox_engine_inputs()` is deprecated.  Callers should refactor to use `philox_cuda_state()`, and afaik all call sites in aten have already been refactored, but in the meantime on behalf of other consumers (ie extensions, possibly some lingering call sites in jit), `philox_engine_inputs` should handle the increment the same way `philox_cuda_state` does.

Pull Request resolved: #50916

Reviewed By: mrshenli

Differential Revision: D26022618

Pulled By: ngimel

fbshipit-source-id: 17178ad099ddc17d3596b9508ae4dce729b44f57
…r-friendly wrapper

Test Plan: revert-hammer

Differential Revision:
D25977352 (73dffc8)

Original commit changeset: 4b3a5e8a9071

fbshipit-source-id: a0383ea4158f54be6f128b9ddb2cd12fc3a3ea53
Summary: Pull Request resolved: #50966

Test Plan: Imported from OSS

Reviewed By: suo

Differential Revision: D26029101

Pulled By: jamesr66a

fbshipit-source-id: 4374771be74d0a4d05fdd29107be5357130c2a76
Summary:
Because it is shorter, faster, and does not have TF32 issue.

Benchmark: https://github.com/zasdfgbnm/things/blob/master/2021Q1/kron.ipynb

Pull Request resolved: #50927

Reviewed By: glaringlee

Differential Revision: D26022385

Pulled By: ngimel

fbshipit-source-id: 513c9e9138c35c70d3a475a8407728af21321dae
…t the extra_files map. (#50932)

Summary:
Pull Request resolved: #50932

After the change to split `_load_for_mobile()` into multiple methods, one which takes in the `extra_files` map, and one which doesn't, we can change the implementation of the `deserialize()` method with different overloads as well. Suggested by raziel on D25968216 (bb909d2).

ghstack-source-id: 120185089

Test Plan: Build/Sandcastle.

Reviewed By: JacobSzwejbka

Differential Revision: D26014084

fbshipit-source-id: 914142137346a6246def1acf38a3204dd4c4f52f
Summary:
Fixes a typo in `ScriptModule`'s docstring and converts it to the raw format (`r"""...`).

Fixes #48634

Pull Request resolved: #48608

Reviewed By: anjali411

Differential Revision: D25242022

Pulled By: gmagogsfm

fbshipit-source-id: 5199868af999c6c360c7dd5e2813659f1028acab
Summary: Pull Request resolved: #50859

Test Plan:
Unit test:
```
buck test //caffe2/test:torch
```
Benchmark:
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 13 \
./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench \
--scripted_model=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge_v2/traced_precomputation.pt \
--pt_inputs=/home/hlu/ads/adindexer/adindexer_ctr_mobilefeed/pt/merge_v2/container_precomputation_bs20.pt \
--iters=10000 --warmup_iters=10000 --num_threads=1 --pt_enable_static_runtime=true \
--pt_cleanup_activations=true --pt_enable_out_variant=true --do_profile=true
```

Reduces the total time spent on flatten from 1.22% to 0.97% (net 0.25% reduction).
```
Before:

Static runtime ms per iter: 0.0725054. Iters per second: 13792.1
    0.000857179 ms.    1.21862%. aten::flatten (1 nodes)

After:

Static runtime ms per iter: 0.0720371. Iters per second: 13881.7
    0.000686155 ms.    0.97151%. aten::flatten (1 nodes)
```

Reviewed By: ajyu

Differential Revision: D25986759

fbshipit-source-id: dc0f542c56a688d331d349845b78084577970476
)

Summary:
Pull Request resolved: #50851

Improves upon the previous unittest to ensure allreduce_hook results in the same gradients as vanilla allreduce in DDP.

ghstack-source-id: 120229103

Test Plan:
buck build mode/dev-nosan //caffe2/test/distributed:distributed_nccl_fork --keep-going
BACKEND=nccl WORLD_SIZE=2 ~/fbcode/buck-out/dev/gen/caffe2/test/distributed/distributed_nccl_fork#binary.par -r test_ddp_hook_parity

Reviewed By: SciPioneer

Differential Revision: D25963654

fbshipit-source-id: d55eee0aee9cf1da52aa0c4ba1066718aa8fd9a4
Summary:
Pull Request resolved: #50624

Add TorchScript compatible Adam functional optimizer to distributed optimizer

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D25932770

Pulled By: wanchaol

fbshipit-source-id: cab3f1164c76186969c284a2c52481b79bbb7190
Summary: Pull Request resolved: #50618

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D25932778

Pulled By: wanchaol

fbshipit-source-id: 8df3567b477bc5ba3556b8c5294cd3da5db963ad
Summary:
Pull Request resolved: #50623

Add TorchScript compatible Adadelta functional optimizer to distributed optimizer

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D25932772

Pulled By: wanchaol

fbshipit-source-id: d59b04e5f0b6bab7e0d1c5f68e66249a65958e0b
Summary:
Pull Request resolved: #50619

Add TorchScript compatible RMSprop functional optimizer to distributed optimizer

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D25932775

Pulled By: wanchaol

fbshipit-source-id: bd4854f9f95a740e02a1bebe24f780488460ba4d
Summary:
Pull Request resolved: #50620

Add TorchScript compatible AdamW functional optimizer to distributed optimizer

Test Plan: Imported from OSS

Reviewed By: rohan-varma

Differential Revision: D25932774

Pulled By: wanchaol

fbshipit-source-id: 64eb4aeaa3cab208d0ebbec7c4d91a9d43951947
Summary: Pull Request resolved: #50380

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25949361

Pulled By: anjali411

fbshipit-source-id: 9910bc5b532c9bf3add530221d643b2c41c62d01
Summary:
Pull Request resolved: #50321

Quantization team reported that when there are two empty tensors are replicated among ranks, the two empty tensors start to share storage after resizing.

The root cause is unflatten_dense_tensor unflattened the empty tensor as view of flat tensor and thus share storage with other tensors.

This PR is trying to avoid unflatten the empty tensor as view of flat tensor so that empty tensor will not share storage with other tensors.

Test Plan: unit test

Reviewed By: pritamdamania87

Differential Revision: D25859503

fbshipit-source-id: 5b760b31af6ed2b66bb22954cba8d1514f389cca
@imaginary-person imaginary-person merged commit f71556e into imaginary-person:master Jan 23, 2021
imaginary-person pushed a commit that referenced this pull request May 26, 2021
Summary: added more statistic info for static runtime

Test Plan:
caffe2/benchmarks/static_runtime:static_runtime_cpptest

Expected output example:

Static runtime ms per iter: 0.939483. Iters per second: 1064.41
Node #0: 0.195671 ms/iter, %wide_offset.1 : Tensor = aten::add(%wide.1, %self._mu, %4)
Node #1: 0.169457 ms/iter, %wide_normalized.1 : Tensor = aten::mul(%wide_offset.1, %self._sigma)
Node #2: 0.118218 ms/iter, %wide_preproc.1 : Tensor = aten::clamp(%wide_normalized.1, %5, %6)
Node #3: 0.038814 ms/iter, %user_emb_t.1 : Tensor = aten::transpose(%user_emb.1, %4, %7)
Node #4: 0.0860747 ms/iter, %dp_unflatten.1 : Tensor = aten::bmm(%ad_emb_packed.1, %user_emb_t.1)
Node #5: 0.0102666 ms/iter, %31 : Tensor = static_runtime::flatten_copy(%dp_unflatten.1, %4, %8)
Node #6: 0.000476333 ms/iter, %19 : Tensor[] = prim::ListConstruct(%31, %wide_preproc.1)
Node #7: 0.0707332 ms/iter, %input.1 : Tensor = aten::cat(%19, %4)
Node #8: 0.123695 ms/iter, %fc1.1 : Tensor = aten::addmm(%self._fc_b, %input.1, %29, %4, %4)
Node #9: 0.0309244 ms/iter, %23 : Tensor = aten::sigmoid(%fc1.1)
Node #10: 0.0046297 ms/iter, %24 : (Tensor) = prim::TupleConstruct(%23)
Time per node type:
       0.195671 ms.    23.0483%. aten::add (1 nodes)
       0.169457 ms.    19.9605%. aten::mul (1 nodes, out variant)
       0.123695 ms.    14.5702%. aten::addmm (1 nodes, out variant)
       0.118218 ms.     13.925%. aten::clamp (1 nodes, out variant)
      0.0860747 ms.    10.1388%. aten::bmm (1 nodes, out variant)
      0.0707332 ms.    8.33175%. aten::cat (1 nodes, out variant)
       0.038814 ms.    4.57195%. aten::transpose (1 nodes)
      0.0309244 ms.    3.64263%. aten::sigmoid (1 nodes, out variant)
      0.0102666 ms.    1.20932%. static_runtime::flatten_copy (1 nodes, out variant)
      0.0046297 ms.   0.545338%. prim::TupleConstruct (1 nodes, out variant)
    0.000476333 ms.  0.0561079%. prim::ListConstruct (1 nodes, out variant)
       0.848959 ms. in Total
StaticRuntime setup time: 0.018925 ms
Memory allocation time: 0.019808 ms
Memory deallocation time: 0.0120445 ms
Outputs deallocation time: 0.0864947 ms
Total memory managed: 19328 bytes
Total number of reused tensors: 3
Total number of 'out' variant nodes/total number of nodes: 9/11 (81.8182%)

Reviewed By: hlu1

Differential Revision: D28553029

fbshipit-source-id: 55e7eab50b4b475ae219896100bdf4f6678875a4
imaginary-person pushed a commit that referenced this pull request Jul 2, 2021
Summary:
Pull Request resolved: pytorch#60987

We were seeing deadlocks as follows during shutdown:

```
Thread 1 (LWP 2432101):
#0  0x00007efca470190b in __pause_nocancel () from /lib64/libc.so.6
#1  0x00007efca49de485 in __pthread_mutex_lock_full () from /lib64/libpthread.so.0
#2  0x00007ef91d4c42c6 in __cuda_CallJitEntryPoint () from /lib64/libnvidia-ptxjitcompiler.so.1
#3  0x00007efc651ac8f1 in ?? () from /lib64/libcuda.so
#4  0x00007efc651aee03 in ?? () from /lib64/libcuda.so
#5  0x00007efc64f76b84 in ?? () from /lib64/libcuda.so
#6  0x00007efc64f77f5d in ?? () from /lib64/libcuda.so
#7  0x00007efc64eac858 in ?? () from /lib64/libcuda.so
#8  0x00007efc64eacfbc in ?? () from /lib64/libcuda.so
#9  0x00007efc7810a924 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#10 0x00007efc780fa2be in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#11 0x00007efc78111044 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#12 0x00007efc7811580a in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#13 0x00007efc78115aa4 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#14 0x00007efc781079ec in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#15 0x00007efc780e6a7a in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#16 0x00007efc7811cfa5 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#17 0x00007efc777ea98c in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#18 0x00007efc777ebd80 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#19 0x00007efc777ea2c9 in ?? () from /usr/local/cuda/lib64/libcublas.so.11
#20 0x00007efc778c2e2d in cublasDestroy_v2 () from /usr/local/cuda/lib64/libcublas.so.11
#21 0x00007efc51a3fb56 in std::_Sp_counted_ptr_inplace<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle>, std::allocator<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle> >, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /data/users/pritam/pytorch/torch/lib/libtorch_cuda.so
#22 0x00007efc51a3fc5f in std::shared_ptr<at::cuda::(anonymous namespace)::DeviceThreadHandlePool<cublasContext*, &at::cuda::(anonymous namespace)::createCublasHandle, &at::cuda::(anonymous namespace)::destroyCublasHandle> >::~shared_ptr() () from /data/users/pritam/pytorch/torch/lib/libtorch_cuda.so
#23 0x00007efca4648b0c in __run_exit_handlers () from /lib64/libc.so.6
#24 0x00007efca4648c40 in exit () from /lib64/libc.so.6
#25 0x0000558c8852e5f9 in Py_Exit (sts=0) at /tmp/build/80754af9/python_1614362349910/work/Python/pylifecycle.c:2292
#26 0x0000558c8852e6a7 in handle_system_exit () at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:636
#27 0x0000558c8852e742 in PyErr_PrintEx (set_sys_last_vars=<optimized out>, set_sys_last_vars=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:646
#28 0x0000558c88540dd6 in PyRun_SimpleStringFlags (command=0x7efca4dc9050 "from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=9, pipe_handle=13)\n", flags=0x7ffe3a986110) at /tmp/build/80754af9/python_1614362349910/work/Python/pythonrun.c:457
#29 0x0000558c88540ead in pymain_run_command (cf=0x7ffe3a986110, command=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:420
#30 pymain_run_python (pymain=0x7ffe3a986220) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:2907
#31 pymain_main (pymain=0x7ffe3a986220) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:3460
#32 0x0000558c8854122c in _Py_UnixMain (argc=<optimized out>, argv=<optimized out>) at /tmp/build/80754af9/python_1614362349910/work/Modules/main.c:3495
#33 0x00007efca4632493 in __libc_start_main () from /lib64/libc.so.6
#34 0x0000558c884e5e90 in _start () at ../sysdeps/x86_64/elf/start.S:103
```

This was likely caused due to a static singleton that wasn't leaky. Following
the guidance in https://isocpp.org/wiki/faq/ctors#construct-on-first-use-v2 to
use a leaky singleton instead.
ghstack-source-id: 132847448

Test Plan: Verified locally.

Reviewed By: malfet

Differential Revision: D29468866

fbshipit-source-id: 89250594c5cd2643417b1da584c658b742dc5a5c
imaginary-person pushed a commit that referenced this pull request Jul 20, 2021
Summary:
Pull Request resolved: pytorch#61588

As part of debugging pytorch#60290,
we discovered the following deadlock:

```
Thread 79 (Thread 0x7f52ff7fe700 (LWP 205437)):
#0  pthread_cond_timedwait@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
#1  0x0000564880199152 in PyCOND_TIMEDWAIT (cond=0x564880346080 <gil_cond>, mut=0x564880346100 <gil_mutex>, us=5000) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/condvar.h:103
#2  take_gil (tstate=0x7f5254005ef0) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/ceval_gil.h:224
#3  0x0000564880217b62 in PyEval_AcquireThread (tstate=0x7f5254005ef0) at /home/builder/ktietz/cos6/ci_cos6/python_1622833237666/work/Python/ceval.c:278
#4  0x00007f557d54aabd in pybind11::gil_scoped_acquire::gil_scoped_acquire() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#5  0x00007f557da7792f in (anonymous namespace)::concrete_decref_fn(c10::impl::PyInterpreter const*, _object*) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so
#6  0x00007f5560dadba6 in c10::TensorImpl::release_resources() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libc10.so
#7  0x00007f5574c885bc in std::_Sp_counted_ptr_inplace<torch::distributed::autograd::DistAutogradContext, std::allocator<torch::distributed::autograd::DistAutogradContext>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#8  0x00007f5574c815e9 in std::__detail::_Hashtable_alloc<std::allocator<std::__detail::_Hash_node<std::pair<long const, std::shared_ptr<torch::distributed::autograd::DistAutogradContext> >, false> > >::_M_deallocate_node(std::__detail::_Hash_node<std::pair<long const, std::shared_ptr<torch::distributed::autograd::DistAutogradContext> >, false>*) [clone .isra.325] () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#9  0x00007f5574c81bf1 in torch::distributed::autograd::DistAutogradContainer::eraseContextIdAndReset(torch::distributed::autograd::DistAutogradContainer::ContextsShard&, long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#10 0x00007f5574c86e83 in torch::distributed::autograd::DistAutogradContainer::releaseContextIfPresent(long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#11 0x00007f5574cc6395 in torch::distributed::rpc::RequestCallbackNoPython::processCleanupAutogradContextReq(torch::distributed::rpc::RpcCommandBase&) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#12 0x00007f5574cccf15 in torch::distributed::rpc::RequestCallbackNoPython::processRpc(torch::distributed::rpc::RpcCommandBase&, torch::distributed::rpc::MessageType const&, std::vector<c10::Stream, std::allocator<c10::Stream> >) const () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so

Thread 72 (Thread 0x7f53077fe700 (LWP 205412)):
#0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f55bc62adbd in __GI___pthread_mutex_lock (mutex=0x564884396440) at ../nptl/pthread_mutex_lock.c:80
#2  0x00007f5574c82a2f in torch::distributed::autograd::DistAutogradContainer::retrieveContext(long) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_cpu.so
#3  0x00007f557de9bb2f in pybind11::cpp_function::initialize<torch::distributed::autograd::(anonymous namespace)::dist_autograd_init(_object*, _object*)::{lambda(long)#11}, pybind11::dict, long, pybind11::name, pybind11::scope, pybind11::sibling, char [931], pybind11::arg>(torch::distributed::autograd::(anonymous namespace)::dist_autograd_init(_object*, _object*)::{lambda(long)#11}&&, pybind11::dict (*)(long), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, char const (&) [931], pybind11::arg const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call) () from /opt/conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so

```

Basically Thread 72, holds GIL and tries to acquire the lock for
DistAutogradContainer to perform a lookup on a map. On the other hand,
Thread 79 holds the lock on DistAutogradContainer to remove a Tensor and as
part of TensorImpl destructor, concrete_decref_fn is called which waits for
GIL. As a result, we have a deadlock.

To fix this issue, I've ensured we release GIL when we call `retrieveContext`
and acquire it later when needed.
ghstack-source-id: 133493659

Test Plan: waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D29682624

fbshipit-source-id: f68a1fb39040ca0447a26e456a97bce64af6b79c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.