Skip to content

Memory profiling#37775

Closed
ilia-cher wants to merge 61 commits intogh/ilia-cher/68/basefrom
gh/ilia-cher/68/head
Closed

Memory profiling#37775
ilia-cher wants to merge 61 commits intogh/ilia-cher/68/basefrom
gh/ilia-cher/68/head

Conversation

@ilia-cher
Copy link
Copy Markdown
Contributor

@ilia-cher ilia-cher commented May 4, 2020

Stack from ghstack:

Summary:
Adding memory usage into profiler table output

Test Plan:

BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=1 USE_CUDA=1 python setup.py develop install
$ python benchmarks/profiler_benchmark/resnet_memory_profiler.py
output: https://gist.github.com/ilia-cher/3f37d54c3b2afb24d6776858e6860f69
$ python test/test_autograd.py TestAutograd.test_memory_profiler
Couldn't download test skip set, leaving all tests enabled...
Running CPU test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        60.58%           105.892us        93.42%           163.285us        163.285us        800 b            0 b              0 b              0 b              1                []
rand                         10.53%           18.405us         32.83%           57.393us         57.393us         800 b            0 b              0 b              0 b              1                []
empty                        1.77%            3.092us          1.77%            3.092us          3.092us          800 b            800 b            0 b              0 b              1                []
uniform_                     19.64%           34.325us         20.54%           35.896us         35.896us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.90%            1.571us          0.90%            1.571us          1.571us          0 b              0 b              0 b              0 b              1                [[10, 10]]
test_user_scope_dealloc      6.58%            11.508us         6.58%            11.508us         11.508us         -800 b           -800 b           0 b              0 b              1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 174.793us

Running CUDA test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        29.37%           86.836us         93.05%           275.143us        275.143us        0 b              -800 b           1.00 Kb          0 b              1                []
to                           7.42%            21.939us         51.31%           151.703us        151.703us        0 b              0 b              1.00 Kb          0 b              1                [[10, 10]]
empty_strided                6.19%            18.295us         6.19%            18.295us         18.295us         0 b              0 b              1.00 Kb          1.00 Kb          1                []
rand                         4.50%            13.316us         12.38%           36.604us         36.604us         800 b            0 b              0 b              0 b              1                []
empty                        0.83%            2.456us          0.83%            2.456us          2.456us          800 b            800 b            0 b              0 b              1                []
uniform_                     6.44%            19.044us         7.05%            20.832us         20.832us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.60%            1.788us          0.60%            1.788us          1.788us          0 b              0 b              0 b              0 b              1                [[10, 10]]
copy_                        37.70%           111.469us        37.70%           111.469us        111.469us        0 b              0 b              0 b              0 b              1                [[10, 10], [10, 10]]
test_user_scope_dealloc      6.95%            20.544us         6.95%            20.544us         20.544us         0 b              0 b              -1.00 Kb         -1.00 Kb         1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 295.687us

Running MKLDNN test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        34.23%           43.503us         88.57%           112.550us        112.550us        400 b            -400 b           0 b              0 b              1                []
rand                         8.00%            10.167us         18.34%           23.302us         23.302us         400 b            0 b              0 b              0 b              1                []
empty                        2.22%            2.815us          2.22%            2.815us          2.815us          400 b            400 b            0 b              0 b              1                []
to_mkldnn                    35.16%           44.675us         36.00%           45.745us         45.745us         400 b            400 b            0 b              0 b              1                [[10, 10]]
uniform_                     7.24%            9.198us          8.12%            10.320us         10.320us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.88%            1.122us          0.88%            1.122us          1.122us          0 b              0 b              0 b              0 b              1                [[10, 10]]
contiguous                   0.84%            1.070us          0.84%            1.070us          1.070us          0 b              0 b              0 b              0 b              1                [[10, 10]]
test_user_scope_dealloc      11.43%           14.525us         11.43%           14.525us         14.525us         -400 b           -400 b           0 b              0 b              1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 127.075us

.
----------------------------------------------------------------------
Ran 1 test in 1.571s

OK

Differential Revision: D21384248

Summary:
Adding memory usage into profiler table output

Test Plan:
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py
develop install --cmake

python

import torch
import torchvision.models as models
model = models.resnet18()
inp = torch.randn(5, 3, 224, 224)

with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inp)

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15))

---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem Total    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
resize_                      0.37%            577.936us        0.37%            577.936us        9.796us          339.03 Mb        59               [[0]]
empty                        0.69%            1.061ms          0.74%            1.139ms          5.556us          47.42 Mb         205              []
stride                       0.00%            0.853us          0.00%            0.853us          0.853us          19.53 Kb         1                [[5, 1000]]
empty_strided                0.01%            21.393us         0.02%            26.033us         5.207us          252 b            5                []
is_complex                   0.02%            37.425us         0.02%            37.425us         1.291us          208 b            29               [[]]
masked_select                0.04%            55.333us         0.06%            93.616us         46.808us         120 b            2                [[30], [30]]
conv2d                       0.01%            18.009us         9.62%            14.902ms         14.902ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
convolution                  0.01%            12.436us         9.61%            14.884ms         14.884ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_convolution                 0.03%            52.381us         9.60%            14.871ms         14.871ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
size                         0.00%            5.429us          0.00%            5.429us          0.339us          0 b              16               [[5, 3, 224, 224]]
contiguous                   0.00%            1.934us          0.00%            1.934us          0.967us          0 b              2                [[5, 3, 224, 224]]
_convolution_nogroup         0.02%            27.505us         9.57%            14.814ms         14.814ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_nnpack_available            0.02%            34.267us         0.02%            34.267us         1.713us          0 b              20               []
thnn_conv2d                  0.01%            13.274us         9.54%            14.771ms         14.771ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
thnn_conv2d_forward          5.98%            9.264ms          19.02%           29.446ms         14.723ms         0 b              2                [[5, 3, 224, 224], [64, 3, 7, 7], [
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 154.855ms

[ghstack-poisoned]
@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented May 4, 2020

💊 CI failures summary and remediations

As of commit 448af80 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 2/2 non-CircleCI failure(s)

Extra GitHub checks: 1 failed


ci.pytorch.org: 1 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 271 times.

ilia-cher added 2 commits May 4, 2020 12:37
Summary:
Adding memory usage into profiler table output

Test Plan:
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py
develop install --cmake

```
import torch
import torchvision.models as models
model = models.resnet18()
inp = torch.randn(5, 3, 224, 224)

with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inp)

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15))

---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem Total    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
resize_                      0.37%            577.936us        0.37%            577.936us        9.796us          339.03 Mb        59               [[0]]
empty                        0.69%            1.061ms          0.74%            1.139ms          5.556us          47.42 Mb         205              []
stride                       0.00%            0.853us          0.00%            0.853us          0.853us          19.53 Kb         1                [[5, 1000]]
empty_strided                0.01%            21.393us         0.02%            26.033us         5.207us          252 b            5                []
is_complex                   0.02%            37.425us         0.02%            37.425us         1.291us          208 b            29               [[]]
masked_select                0.04%            55.333us         0.06%            93.616us         46.808us         120 b            2                [[30], [30]]
conv2d                       0.01%            18.009us         9.62%            14.902ms         14.902ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
convolution                  0.01%            12.436us         9.61%            14.884ms         14.884ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_convolution                 0.03%            52.381us         9.60%            14.871ms         14.871ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
size                         0.00%            5.429us          0.00%            5.429us          0.339us          0 b              16               [[5, 3, 224, 224]]
contiguous                   0.00%            1.934us          0.00%            1.934us          0.967us          0 b              2                [[5, 3, 224, 224]]
_convolution_nogroup         0.02%            27.505us         9.57%            14.814ms         14.814ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_nnpack_available            0.02%            34.267us         0.02%            34.267us         1.713us          0 b              20               []
thnn_conv2d                  0.01%            13.274us         9.54%            14.771ms         14.771ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
thnn_conv2d_forward          5.98%            9.264ms          19.02%           29.446ms         14.723ms         0 b              2                [[5, 3, 224, 224], [64, 3, 7, 7], [
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 154.855ms
```

Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248)

[ghstack-poisoned]
Summary:
Adding memory usage into profiler table output

Test Plan:
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py
develop install --cmake

```
import torch
import torchvision.models as models
model = models.resnet18()
inp = torch.randn(5, 3, 224, 224)

with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inp)

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15))

---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem Total    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
resize_                      0.37%            577.936us        0.37%            577.936us        9.796us          339.03 Mb        59               [[0]]
empty                        0.69%            1.061ms          0.74%            1.139ms          5.556us          47.42 Mb         205              []
stride                       0.00%            0.853us          0.00%            0.853us          0.853us          19.53 Kb         1                [[5, 1000]]
empty_strided                0.01%            21.393us         0.02%            26.033us         5.207us          252 b            5                []
is_complex                   0.02%            37.425us         0.02%            37.425us         1.291us          208 b            29               [[]]
masked_select                0.04%            55.333us         0.06%            93.616us         46.808us         120 b            2                [[30], [30]]
conv2d                       0.01%            18.009us         9.62%            14.902ms         14.902ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
convolution                  0.01%            12.436us         9.61%            14.884ms         14.884ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_convolution                 0.03%            52.381us         9.60%            14.871ms         14.871ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
size                         0.00%            5.429us          0.00%            5.429us          0.339us          0 b              16               [[5, 3, 224, 224]]
contiguous                   0.00%            1.934us          0.00%            1.934us          0.967us          0 b              2                [[5, 3, 224, 224]]
_convolution_nogroup         0.02%            27.505us         9.57%            14.814ms         14.814ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_nnpack_available            0.02%            34.267us         0.02%            34.267us         1.713us          0 b              20               []
thnn_conv2d                  0.01%            13.274us         9.54%            14.771ms         14.771ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
thnn_conv2d_forward          5.98%            9.264ms          19.02%           29.446ms         14.723ms         0 b              2                [[5, 3, 224, 224], [64, 3, 7, 7], [
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 154.855ms
```

Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248)

[ghstack-poisoned]
Copy link
Copy Markdown
Collaborator

@dzhulgakov dzhulgakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about testing? Can you write unittests? (including enabling profiler in the middle of the allocation)

I'd also double-triple check that memory allocation tracking works well with non-standard allocators, e.g. if one does MKLDNN allocations (with to_mkldnn) or some of the internal ones (huge pages)

Comment thread c10/core/Allocator.cpp Outdated
Comment thread c10/core/Allocator.h Outdated
Comment thread c10/core/CPUAllocator.cpp Outdated
Comment thread c10/core/CPUAllocator.cpp Outdated
Comment thread c10/core/CPUAllocator.cpp
size_table_.erase(it);
void ProfiledCPUMemoryReporter::Delete(void* ptr) {
if (memoryProfilingEnabled()) {
std::lock_guard<std::mutex> guard(mutex_);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that'd make execution way slower but I guess it's ok for memory profiling. We should just make sure not to mix the two

ilia-cher added 16 commits May 4, 2020 23:31
Summary:
Adding memory usage into profiler table output

Test Plan:
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py
develop install --cmake

```
import torch
import torchvision.models as models
model = models.resnet18()
inp = torch.randn(5, 3, 224, 224)

with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inp)

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15))

---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem Total    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
resize_                      0.37%            577.936us        0.37%            577.936us        9.796us          339.03 Mb        59               [[0]]
empty                        0.69%            1.061ms          0.74%            1.139ms          5.556us          47.42 Mb         205              []
stride                       0.00%            0.853us          0.00%            0.853us          0.853us          19.53 Kb         1                [[5, 1000]]
empty_strided                0.01%            21.393us         0.02%            26.033us         5.207us          252 b            5                []
is_complex                   0.02%            37.425us         0.02%            37.425us         1.291us          208 b            29               [[]]
masked_select                0.04%            55.333us         0.06%            93.616us         46.808us         120 b            2                [[30], [30]]
conv2d                       0.01%            18.009us         9.62%            14.902ms         14.902ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
convolution                  0.01%            12.436us         9.61%            14.884ms         14.884ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_convolution                 0.03%            52.381us         9.60%            14.871ms         14.871ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
size                         0.00%            5.429us          0.00%            5.429us          0.339us          0 b              16               [[5, 3, 224, 224]]
contiguous                   0.00%            1.934us          0.00%            1.934us          0.967us          0 b              2                [[5, 3, 224, 224]]
_convolution_nogroup         0.02%            27.505us         9.57%            14.814ms         14.814ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_nnpack_available            0.02%            34.267us         0.02%            34.267us         1.713us          0 b              20               []
thnn_conv2d                  0.01%            13.274us         9.54%            14.771ms         14.771ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
thnn_conv2d_forward          5.98%            9.264ms          19.02%           29.446ms         14.723ms         0 b              2                [[5, 3, 224, 224], [64, 3, 7, 7], [
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 154.855ms
```

Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248)

[ghstack-poisoned]
Summary:
Adding memory usage into profiler table output

Test Plan:
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py
develop install --cmake

```
import torch
import torchvision.models as models
model = models.resnet18()
inp = torch.randn(5, 3, 224, 224)

with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inp)

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15))

---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem Total    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
resize_                      0.37%            577.936us        0.37%            577.936us        9.796us          339.03 Mb        59               [[0]]
empty                        0.69%            1.061ms          0.74%            1.139ms          5.556us          47.42 Mb         205              []
stride                       0.00%            0.853us          0.00%            0.853us          0.853us          19.53 Kb         1                [[5, 1000]]
empty_strided                0.01%            21.393us         0.02%            26.033us         5.207us          252 b            5                []
is_complex                   0.02%            37.425us         0.02%            37.425us         1.291us          208 b            29               [[]]
masked_select                0.04%            55.333us         0.06%            93.616us         46.808us         120 b            2                [[30], [30]]
conv2d                       0.01%            18.009us         9.62%            14.902ms         14.902ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
convolution                  0.01%            12.436us         9.61%            14.884ms         14.884ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_convolution                 0.03%            52.381us         9.60%            14.871ms         14.871ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
size                         0.00%            5.429us          0.00%            5.429us          0.339us          0 b              16               [[5, 3, 224, 224]]
contiguous                   0.00%            1.934us          0.00%            1.934us          0.967us          0 b              2                [[5, 3, 224, 224]]
_convolution_nogroup         0.02%            27.505us         9.57%            14.814ms         14.814ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_nnpack_available            0.02%            34.267us         0.02%            34.267us         1.713us          0 b              20               []
thnn_conv2d                  0.01%            13.274us         9.54%            14.771ms         14.771ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
thnn_conv2d_forward          5.98%            9.264ms          19.02%           29.446ms         14.723ms         0 b              2                [[5, 3, 224, 224], [64, 3, 7, 7], [
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 154.855ms
```

Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248)

[ghstack-poisoned]
Summary:
Adding memory usage into profiler table output

Test Plan:
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py
develop install --cmake

```
import torch
import torchvision.models as models
model = models.resnet18()
inp = torch.randn(5, 3, 224, 224)

with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inp)

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15))

---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem Total    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
resize_                      0.37%            577.936us        0.37%            577.936us        9.796us          339.03 Mb        59               [[0]]
empty                        0.69%            1.061ms          0.74%            1.139ms          5.556us          47.42 Mb         205              []
stride                       0.00%            0.853us          0.00%            0.853us          0.853us          19.53 Kb         1                [[5, 1000]]
empty_strided                0.01%            21.393us         0.02%            26.033us         5.207us          252 b            5                []
is_complex                   0.02%            37.425us         0.02%            37.425us         1.291us          208 b            29               [[]]
masked_select                0.04%            55.333us         0.06%            93.616us         46.808us         120 b            2                [[30], [30]]
conv2d                       0.01%            18.009us         9.62%            14.902ms         14.902ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
convolution                  0.01%            12.436us         9.61%            14.884ms         14.884ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_convolution                 0.03%            52.381us         9.60%            14.871ms         14.871ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
size                         0.00%            5.429us          0.00%            5.429us          0.339us          0 b              16               [[5, 3, 224, 224]]
contiguous                   0.00%            1.934us          0.00%            1.934us          0.967us          0 b              2                [[5, 3, 224, 224]]
_convolution_nogroup         0.02%            27.505us         9.57%            14.814ms         14.814ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_nnpack_available            0.02%            34.267us         0.02%            34.267us         1.713us          0 b              20               []
thnn_conv2d                  0.01%            13.274us         9.54%            14.771ms         14.771ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
thnn_conv2d_forward          5.98%            9.264ms          19.02%           29.446ms         14.723ms         0 b              2                [[5, 3, 224, 224], [64, 3, 7, 7], [
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 154.855ms
```

Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248)

[ghstack-poisoned]
Summary:
Adding memory usage into profiler table output

Test Plan:
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py
develop install --cmake

```
import torch
import torchvision.models as models
model = models.resnet18()
inp = torch.randn(5, 3, 224, 224)

with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inp)

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15))

---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem Total    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
resize_                      0.37%            577.936us        0.37%            577.936us        9.796us          339.03 Mb        59               [[0]]
empty                        0.69%            1.061ms          0.74%            1.139ms          5.556us          47.42 Mb         205              []
stride                       0.00%            0.853us          0.00%            0.853us          0.853us          19.53 Kb         1                [[5, 1000]]
empty_strided                0.01%            21.393us         0.02%            26.033us         5.207us          252 b            5                []
is_complex                   0.02%            37.425us         0.02%            37.425us         1.291us          208 b            29               [[]]
masked_select                0.04%            55.333us         0.06%            93.616us         46.808us         120 b            2                [[30], [30]]
conv2d                       0.01%            18.009us         9.62%            14.902ms         14.902ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
convolution                  0.01%            12.436us         9.61%            14.884ms         14.884ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_convolution                 0.03%            52.381us         9.60%            14.871ms         14.871ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
size                         0.00%            5.429us          0.00%            5.429us          0.339us          0 b              16               [[5, 3, 224, 224]]
contiguous                   0.00%            1.934us          0.00%            1.934us          0.967us          0 b              2                [[5, 3, 224, 224]]
_convolution_nogroup         0.02%            27.505us         9.57%            14.814ms         14.814ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_nnpack_available            0.02%            34.267us         0.02%            34.267us         1.713us          0 b              20               []
thnn_conv2d                  0.01%            13.274us         9.54%            14.771ms         14.771ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
thnn_conv2d_forward          5.98%            9.264ms          19.02%           29.446ms         14.723ms         0 b              2                [[5, 3, 224, 224], [64, 3, 7, 7], [
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 154.855ms
```

Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248)

[ghstack-poisoned]
Summary:
Adding memory usage into profiler table output

Test Plan:
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py
develop install --cmake

```
import torch
import torchvision.models as models
model = models.resnet18()
inp = torch.randn(5, 3, 224, 224)

with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inp)

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15))

---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem Total    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
resize_                      0.37%            577.936us        0.37%            577.936us        9.796us          339.03 Mb        59               [[0]]
empty                        0.69%            1.061ms          0.74%            1.139ms          5.556us          47.42 Mb         205              []
stride                       0.00%            0.853us          0.00%            0.853us          0.853us          19.53 Kb         1                [[5, 1000]]
empty_strided                0.01%            21.393us         0.02%            26.033us         5.207us          252 b            5                []
is_complex                   0.02%            37.425us         0.02%            37.425us         1.291us          208 b            29               [[]]
masked_select                0.04%            55.333us         0.06%            93.616us         46.808us         120 b            2                [[30], [30]]
conv2d                       0.01%            18.009us         9.62%            14.902ms         14.902ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
convolution                  0.01%            12.436us         9.61%            14.884ms         14.884ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_convolution                 0.03%            52.381us         9.60%            14.871ms         14.871ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
size                         0.00%            5.429us          0.00%            5.429us          0.339us          0 b              16               [[5, 3, 224, 224]]
contiguous                   0.00%            1.934us          0.00%            1.934us          0.967us          0 b              2                [[5, 3, 224, 224]]
_convolution_nogroup         0.02%            27.505us         9.57%            14.814ms         14.814ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_nnpack_available            0.02%            34.267us         0.02%            34.267us         1.713us          0 b              20               []
thnn_conv2d                  0.01%            13.274us         9.54%            14.771ms         14.771ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
thnn_conv2d_forward          5.98%            9.264ms          19.02%           29.446ms         14.723ms         0 b              2                [[5, 3, 224, 224], [64, 3, 7, 7], [
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 154.855ms
```

Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248)

[ghstack-poisoned]
Summary:
Adding memory usage into profiler table output

Test Plan:
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py
develop install --cmake

```
import torch
import torchvision.models as models
model = models.resnet18()
inp = torch.randn(5, 3, 224, 224)

with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inp)

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15))

---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem Total    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
resize_                      0.37%            577.936us        0.37%            577.936us        9.796us          339.03 Mb        59               [[0]]
empty                        0.69%            1.061ms          0.74%            1.139ms          5.556us          47.42 Mb         205              []
stride                       0.00%            0.853us          0.00%            0.853us          0.853us          19.53 Kb         1                [[5, 1000]]
empty_strided                0.01%            21.393us         0.02%            26.033us         5.207us          252 b            5                []
is_complex                   0.02%            37.425us         0.02%            37.425us         1.291us          208 b            29               [[]]
masked_select                0.04%            55.333us         0.06%            93.616us         46.808us         120 b            2                [[30], [30]]
conv2d                       0.01%            18.009us         9.62%            14.902ms         14.902ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
convolution                  0.01%            12.436us         9.61%            14.884ms         14.884ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_convolution                 0.03%            52.381us         9.60%            14.871ms         14.871ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
size                         0.00%            5.429us          0.00%            5.429us          0.339us          0 b              16               [[5, 3, 224, 224]]
contiguous                   0.00%            1.934us          0.00%            1.934us          0.967us          0 b              2                [[5, 3, 224, 224]]
_convolution_nogroup         0.02%            27.505us         9.57%            14.814ms         14.814ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_nnpack_available            0.02%            34.267us         0.02%            34.267us         1.713us          0 b              20               []
thnn_conv2d                  0.01%            13.274us         9.54%            14.771ms         14.771ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
thnn_conv2d_forward          5.98%            9.264ms          19.02%           29.446ms         14.723ms         0 b              2                [[5, 3, 224, 224], [64, 3, 7, 7], [
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 154.855ms
```

Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248)

[ghstack-poisoned]
Summary:
Adding memory usage into profiler table output

Test Plan:
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py
develop install --cmake

```
import torch
import torchvision.models as models
model = models.resnet18()
inp = torch.randn(5, 3, 224, 224)

with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inp)

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15))

---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem Total    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
resize_                      0.37%            577.936us        0.37%            577.936us        9.796us          339.03 Mb        59               [[0]]
empty                        0.69%            1.061ms          0.74%            1.139ms          5.556us          47.42 Mb         205              []
stride                       0.00%            0.853us          0.00%            0.853us          0.853us          19.53 Kb         1                [[5, 1000]]
empty_strided                0.01%            21.393us         0.02%            26.033us         5.207us          252 b            5                []
is_complex                   0.02%            37.425us         0.02%            37.425us         1.291us          208 b            29               [[]]
masked_select                0.04%            55.333us         0.06%            93.616us         46.808us         120 b            2                [[30], [30]]
conv2d                       0.01%            18.009us         9.62%            14.902ms         14.902ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
convolution                  0.01%            12.436us         9.61%            14.884ms         14.884ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_convolution                 0.03%            52.381us         9.60%            14.871ms         14.871ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
size                         0.00%            5.429us          0.00%            5.429us          0.339us          0 b              16               [[5, 3, 224, 224]]
contiguous                   0.00%            1.934us          0.00%            1.934us          0.967us          0 b              2                [[5, 3, 224, 224]]
_convolution_nogroup         0.02%            27.505us         9.57%            14.814ms         14.814ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_nnpack_available            0.02%            34.267us         0.02%            34.267us         1.713us          0 b              20               []
thnn_conv2d                  0.01%            13.274us         9.54%            14.771ms         14.771ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
thnn_conv2d_forward          5.98%            9.264ms          19.02%           29.446ms         14.723ms         0 b              2                [[5, 3, 224, 224], [64, 3, 7, 7], [
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 154.855ms
```

Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248)

[ghstack-poisoned]
Summary:
Adding memory usage into profiler table output

Test Plan:
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py
develop install --cmake

```
import torch
import torchvision.models as models
model = models.resnet18()
inp = torch.randn(5, 3, 224, 224)

with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inp)

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15))

---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem Total    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
resize_                      0.37%            577.936us        0.37%            577.936us        9.796us          339.03 Mb        59               [[0]]
empty                        0.69%            1.061ms          0.74%            1.139ms          5.556us          47.42 Mb         205              []
stride                       0.00%            0.853us          0.00%            0.853us          0.853us          19.53 Kb         1                [[5, 1000]]
empty_strided                0.01%            21.393us         0.02%            26.033us         5.207us          252 b            5                []
is_complex                   0.02%            37.425us         0.02%            37.425us         1.291us          208 b            29               [[]]
masked_select                0.04%            55.333us         0.06%            93.616us         46.808us         120 b            2                [[30], [30]]
conv2d                       0.01%            18.009us         9.62%            14.902ms         14.902ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
convolution                  0.01%            12.436us         9.61%            14.884ms         14.884ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_convolution                 0.03%            52.381us         9.60%            14.871ms         14.871ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
size                         0.00%            5.429us          0.00%            5.429us          0.339us          0 b              16               [[5, 3, 224, 224]]
contiguous                   0.00%            1.934us          0.00%            1.934us          0.967us          0 b              2                [[5, 3, 224, 224]]
_convolution_nogroup         0.02%            27.505us         9.57%            14.814ms         14.814ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_nnpack_available            0.02%            34.267us         0.02%            34.267us         1.713us          0 b              20               []
thnn_conv2d                  0.01%            13.274us         9.54%            14.771ms         14.771ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
thnn_conv2d_forward          5.98%            9.264ms          19.02%           29.446ms         14.723ms         0 b              2                [[5, 3, 224, 224], [64, 3, 7, 7], [
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 154.855ms
```

Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248)

[ghstack-poisoned]
Summary:
Adding memory usage into profiler table output

Test Plan:
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py
develop install --cmake

```
import torch
import torchvision.models as models
model = models.resnet18()
inp = torch.randn(5, 3, 224, 224)

with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inp)

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15))

---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem Total    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
resize_                      0.37%            577.936us        0.37%            577.936us        9.796us          339.03 Mb        59               [[0]]
empty                        0.69%            1.061ms          0.74%            1.139ms          5.556us          47.42 Mb         205              []
stride                       0.00%            0.853us          0.00%            0.853us          0.853us          19.53 Kb         1                [[5, 1000]]
empty_strided                0.01%            21.393us         0.02%            26.033us         5.207us          252 b            5                []
is_complex                   0.02%            37.425us         0.02%            37.425us         1.291us          208 b            29               [[]]
masked_select                0.04%            55.333us         0.06%            93.616us         46.808us         120 b            2                [[30], [30]]
conv2d                       0.01%            18.009us         9.62%            14.902ms         14.902ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
convolution                  0.01%            12.436us         9.61%            14.884ms         14.884ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_convolution                 0.03%            52.381us         9.60%            14.871ms         14.871ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
size                         0.00%            5.429us          0.00%            5.429us          0.339us          0 b              16               [[5, 3, 224, 224]]
contiguous                   0.00%            1.934us          0.00%            1.934us          0.967us          0 b              2                [[5, 3, 224, 224]]
_convolution_nogroup         0.02%            27.505us         9.57%            14.814ms         14.814ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_nnpack_available            0.02%            34.267us         0.02%            34.267us         1.713us          0 b              20               []
thnn_conv2d                  0.01%            13.274us         9.54%            14.771ms         14.771ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
thnn_conv2d_forward          5.98%            9.264ms          19.02%           29.446ms         14.723ms         0 b              2                [[5, 3, 224, 224], [64, 3, 7, 7], [
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 154.855ms
```

Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248)

[ghstack-poisoned]
Summary:
Adding memory usage into profiler table output

Test Plan:
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py
develop install --cmake

```
import torch
import torchvision.models as models
model = models.resnet18()
inp = torch.randn(5, 3, 224, 224)

with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inp)

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15))

---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem Total    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
resize_                      0.37%            577.936us        0.37%            577.936us        9.796us          339.03 Mb        59               [[0]]
empty                        0.69%            1.061ms          0.74%            1.139ms          5.556us          47.42 Mb         205              []
stride                       0.00%            0.853us          0.00%            0.853us          0.853us          19.53 Kb         1                [[5, 1000]]
empty_strided                0.01%            21.393us         0.02%            26.033us         5.207us          252 b            5                []
is_complex                   0.02%            37.425us         0.02%            37.425us         1.291us          208 b            29               [[]]
masked_select                0.04%            55.333us         0.06%            93.616us         46.808us         120 b            2                [[30], [30]]
conv2d                       0.01%            18.009us         9.62%            14.902ms         14.902ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
convolution                  0.01%            12.436us         9.61%            14.884ms         14.884ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_convolution                 0.03%            52.381us         9.60%            14.871ms         14.871ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
size                         0.00%            5.429us          0.00%            5.429us          0.339us          0 b              16               [[5, 3, 224, 224]]
contiguous                   0.00%            1.934us          0.00%            1.934us          0.967us          0 b              2                [[5, 3, 224, 224]]
_convolution_nogroup         0.02%            27.505us         9.57%            14.814ms         14.814ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_nnpack_available            0.02%            34.267us         0.02%            34.267us         1.713us          0 b              20               []
thnn_conv2d                  0.01%            13.274us         9.54%            14.771ms         14.771ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
thnn_conv2d_forward          5.98%            9.264ms          19.02%           29.446ms         14.723ms         0 b              2                [[5, 3, 224, 224], [64, 3, 7, 7], [
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 154.855ms
```

Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248)

[ghstack-poisoned]
Summary:
Adding memory usage into profiler table output

Test Plan:
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py
develop install --cmake

```
import torch
import torchvision.models as models
model = models.resnet18()
inp = torch.randn(5, 3, 224, 224)

with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inp)

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15))

---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem Total    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
resize_                      0.37%            577.936us        0.37%            577.936us        9.796us          339.03 Mb        59               [[0]]
empty                        0.69%            1.061ms          0.74%            1.139ms          5.556us          47.42 Mb         205              []
stride                       0.00%            0.853us          0.00%            0.853us          0.853us          19.53 Kb         1                [[5, 1000]]
empty_strided                0.01%            21.393us         0.02%            26.033us         5.207us          252 b            5                []
is_complex                   0.02%            37.425us         0.02%            37.425us         1.291us          208 b            29               [[]]
masked_select                0.04%            55.333us         0.06%            93.616us         46.808us         120 b            2                [[30], [30]]
conv2d                       0.01%            18.009us         9.62%            14.902ms         14.902ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
convolution                  0.01%            12.436us         9.61%            14.884ms         14.884ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_convolution                 0.03%            52.381us         9.60%            14.871ms         14.871ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
size                         0.00%            5.429us          0.00%            5.429us          0.339us          0 b              16               [[5, 3, 224, 224]]
contiguous                   0.00%            1.934us          0.00%            1.934us          0.967us          0 b              2                [[5, 3, 224, 224]]
_convolution_nogroup         0.02%            27.505us         9.57%            14.814ms         14.814ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_nnpack_available            0.02%            34.267us         0.02%            34.267us         1.713us          0 b              20               []
thnn_conv2d                  0.01%            13.274us         9.54%            14.771ms         14.771ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
thnn_conv2d_forward          5.98%            9.264ms          19.02%           29.446ms         14.723ms         0 b              2                [[5, 3, 224, 224], [64, 3, 7, 7], [
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 154.855ms
```

Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248)

[ghstack-poisoned]
Summary:
Adding memory usage into profiler table output

Test Plan:
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py
develop install --cmake

```
import torch
import torchvision.models as models
model = models.resnet18()
inp = torch.randn(5, 3, 224, 224)

with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inp)

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15))

---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem Total    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
resize_                      0.37%            577.936us        0.37%            577.936us        9.796us          339.03 Mb        59               [[0]]
empty                        0.69%            1.061ms          0.74%            1.139ms          5.556us          47.42 Mb         205              []
stride                       0.00%            0.853us          0.00%            0.853us          0.853us          19.53 Kb         1                [[5, 1000]]
empty_strided                0.01%            21.393us         0.02%            26.033us         5.207us          252 b            5                []
is_complex                   0.02%            37.425us         0.02%            37.425us         1.291us          208 b            29               [[]]
masked_select                0.04%            55.333us         0.06%            93.616us         46.808us         120 b            2                [[30], [30]]
conv2d                       0.01%            18.009us         9.62%            14.902ms         14.902ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
convolution                  0.01%            12.436us         9.61%            14.884ms         14.884ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_convolution                 0.03%            52.381us         9.60%            14.871ms         14.871ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
size                         0.00%            5.429us          0.00%            5.429us          0.339us          0 b              16               [[5, 3, 224, 224]]
contiguous                   0.00%            1.934us          0.00%            1.934us          0.967us          0 b              2                [[5, 3, 224, 224]]
_convolution_nogroup         0.02%            27.505us         9.57%            14.814ms         14.814ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_nnpack_available            0.02%            34.267us         0.02%            34.267us         1.713us          0 b              20               []
thnn_conv2d                  0.01%            13.274us         9.54%            14.771ms         14.771ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
thnn_conv2d_forward          5.98%            9.264ms          19.02%           29.446ms         14.723ms         0 b              2                [[5, 3, 224, 224], [64, 3, 7, 7], [
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 154.855ms
```

Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248)

[ghstack-poisoned]
Summary:
Adding memory usage into profiler table output

Test Plan:
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py
develop install --cmake

```
import torch
import torchvision.models as models
model = models.resnet18()
inp = torch.randn(5, 3, 224, 224)

with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inp)

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15))

---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem Total    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
resize_                      0.37%            577.936us        0.37%            577.936us        9.796us          339.03 Mb        59               [[0]]
empty                        0.69%            1.061ms          0.74%            1.139ms          5.556us          47.42 Mb         205              []
stride                       0.00%            0.853us          0.00%            0.853us          0.853us          19.53 Kb         1                [[5, 1000]]
empty_strided                0.01%            21.393us         0.02%            26.033us         5.207us          252 b            5                []
is_complex                   0.02%            37.425us         0.02%            37.425us         1.291us          208 b            29               [[]]
masked_select                0.04%            55.333us         0.06%            93.616us         46.808us         120 b            2                [[30], [30]]
conv2d                       0.01%            18.009us         9.62%            14.902ms         14.902ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
convolution                  0.01%            12.436us         9.61%            14.884ms         14.884ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_convolution                 0.03%            52.381us         9.60%            14.871ms         14.871ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
size                         0.00%            5.429us          0.00%            5.429us          0.339us          0 b              16               [[5, 3, 224, 224]]
contiguous                   0.00%            1.934us          0.00%            1.934us          0.967us          0 b              2                [[5, 3, 224, 224]]
_convolution_nogroup         0.02%            27.505us         9.57%            14.814ms         14.814ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_nnpack_available            0.02%            34.267us         0.02%            34.267us         1.713us          0 b              20               []
thnn_conv2d                  0.01%            13.274us         9.54%            14.771ms         14.771ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
thnn_conv2d_forward          5.98%            9.264ms          19.02%           29.446ms         14.723ms         0 b              2                [[5, 3, 224, 224], [64, 3, 7, 7], [
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 154.855ms
```

Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248)

[ghstack-poisoned]
Summary:
Adding memory usage into profiler table output

Test Plan:
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py
develop install --cmake

```
import torch
import torchvision.models as models
model = models.resnet18()
inp = torch.randn(5, 3, 224, 224)

with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inp)

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15))

---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem Total    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
resize_                      0.37%            577.936us        0.37%            577.936us        9.796us          339.03 Mb        59               [[0]]
empty                        0.69%            1.061ms          0.74%            1.139ms          5.556us          47.42 Mb         205              []
stride                       0.00%            0.853us          0.00%            0.853us          0.853us          19.53 Kb         1                [[5, 1000]]
empty_strided                0.01%            21.393us         0.02%            26.033us         5.207us          252 b            5                []
is_complex                   0.02%            37.425us         0.02%            37.425us         1.291us          208 b            29               [[]]
masked_select                0.04%            55.333us         0.06%            93.616us         46.808us         120 b            2                [[30], [30]]
conv2d                       0.01%            18.009us         9.62%            14.902ms         14.902ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
convolution                  0.01%            12.436us         9.61%            14.884ms         14.884ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_convolution                 0.03%            52.381us         9.60%            14.871ms         14.871ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
size                         0.00%            5.429us          0.00%            5.429us          0.339us          0 b              16               [[5, 3, 224, 224]]
contiguous                   0.00%            1.934us          0.00%            1.934us          0.967us          0 b              2                [[5, 3, 224, 224]]
_convolution_nogroup         0.02%            27.505us         9.57%            14.814ms         14.814ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_nnpack_available            0.02%            34.267us         0.02%            34.267us         1.713us          0 b              20               []
thnn_conv2d                  0.01%            13.274us         9.54%            14.771ms         14.771ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
thnn_conv2d_forward          5.98%            9.264ms          19.02%           29.446ms         14.723ms         0 b              2                [[5, 3, 224, 224], [64, 3, 7, 7], [
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 154.855ms
```

Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248)

[ghstack-poisoned]
Summary:
Adding memory usage into profiler table output

Test Plan:
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py
develop install --cmake

```
import torch
import torchvision.models as models
model = models.resnet18()
inp = torch.randn(5, 3, 224, 224)

with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inp)

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15))

---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem Total    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
resize_                      0.37%            577.936us        0.37%            577.936us        9.796us          339.03 Mb        59               [[0]]
empty                        0.69%            1.061ms          0.74%            1.139ms          5.556us          47.42 Mb         205              []
stride                       0.00%            0.853us          0.00%            0.853us          0.853us          19.53 Kb         1                [[5, 1000]]
empty_strided                0.01%            21.393us         0.02%            26.033us         5.207us          252 b            5                []
is_complex                   0.02%            37.425us         0.02%            37.425us         1.291us          208 b            29               [[]]
masked_select                0.04%            55.333us         0.06%            93.616us         46.808us         120 b            2                [[30], [30]]
conv2d                       0.01%            18.009us         9.62%            14.902ms         14.902ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
convolution                  0.01%            12.436us         9.61%            14.884ms         14.884ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_convolution                 0.03%            52.381us         9.60%            14.871ms         14.871ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
size                         0.00%            5.429us          0.00%            5.429us          0.339us          0 b              16               [[5, 3, 224, 224]]
contiguous                   0.00%            1.934us          0.00%            1.934us          0.967us          0 b              2                [[5, 3, 224, 224]]
_convolution_nogroup         0.02%            27.505us         9.57%            14.814ms         14.814ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_nnpack_available            0.02%            34.267us         0.02%            34.267us         1.713us          0 b              20               []
thnn_conv2d                  0.01%            13.274us         9.54%            14.771ms         14.771ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
thnn_conv2d_forward          5.98%            9.264ms          19.02%           29.446ms         14.723ms         0 b              2                [[5, 3, 224, 224], [64, 3, 7, 7], [
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 154.855ms
```

Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248)

[ghstack-poisoned]
Summary:
Adding memory usage into profiler table output

Test Plan:
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py
develop install --cmake

```
import torch
import torchvision.models as models
model = models.resnet18()
inp = torch.randn(5, 3, 224, 224)

with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inp)

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15))

---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem Total    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
resize_                      0.37%            577.936us        0.37%            577.936us        9.796us          339.03 Mb        59               [[0]]
empty                        0.69%            1.061ms          0.74%            1.139ms          5.556us          47.42 Mb         205              []
stride                       0.00%            0.853us          0.00%            0.853us          0.853us          19.53 Kb         1                [[5, 1000]]
empty_strided                0.01%            21.393us         0.02%            26.033us         5.207us          252 b            5                []
is_complex                   0.02%            37.425us         0.02%            37.425us         1.291us          208 b            29               [[]]
masked_select                0.04%            55.333us         0.06%            93.616us         46.808us         120 b            2                [[30], [30]]
conv2d                       0.01%            18.009us         9.62%            14.902ms         14.902ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
convolution                  0.01%            12.436us         9.61%            14.884ms         14.884ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_convolution                 0.03%            52.381us         9.60%            14.871ms         14.871ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
size                         0.00%            5.429us          0.00%            5.429us          0.339us          0 b              16               [[5, 3, 224, 224]]
contiguous                   0.00%            1.934us          0.00%            1.934us          0.967us          0 b              2                [[5, 3, 224, 224]]
_convolution_nogroup         0.02%            27.505us         9.57%            14.814ms         14.814ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_nnpack_available            0.02%            34.267us         0.02%            34.267us         1.713us          0 b              20               []
thnn_conv2d                  0.01%            13.274us         9.54%            14.771ms         14.771ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
thnn_conv2d_forward          5.98%            9.264ms          19.02%           29.446ms         14.723ms         0 b              2                [[5, 3, 224, 224], [64, 3, 7, 7], [
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 154.855ms
```

Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248)

[ghstack-poisoned]
ilia-cher added 4 commits May 13, 2020 19:17
Summary:
Adding memory usage into profiler table output

Test Plan:
```
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=1 USE_CUDA=1 python setup.py develop install
```

```
$ python benchmarks/profiler_benchmark/resnet_memory_profiler.py
output: https://gist.github.com/ilia-cher/3f37d54c3b2afb24d6776858e6860f69
```

```
$ python test/test_autograd.py TestAutograd.test_memory_profiler
Couldn't download test skip set, leaving all tests enabled...
Running CPU test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        60.58%           105.892us        93.42%           163.285us        163.285us        800 b            0 b              0 b              0 b              1                []
rand                         10.53%           18.405us         32.83%           57.393us         57.393us         800 b            0 b              0 b              0 b              1                []
empty                        1.77%            3.092us          1.77%            3.092us          3.092us          800 b            800 b            0 b              0 b              1                []
uniform_                     19.64%           34.325us         20.54%           35.896us         35.896us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.90%            1.571us          0.90%            1.571us          1.571us          0 b              0 b              0 b              0 b              1                [[10, 10]]
test_user_scope_dealloc      6.58%            11.508us         6.58%            11.508us         11.508us         -800 b           -800 b           0 b              0 b              1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 174.793us

Running CUDA test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        29.37%           86.836us         93.05%           275.143us        275.143us        0 b              -800 b           1.00 Kb          0 b              1                []
to                           7.42%            21.939us         51.31%           151.703us        151.703us        0 b              0 b              1.00 Kb          0 b              1                [[10, 10]]
empty_strided                6.19%            18.295us         6.19%            18.295us         18.295us         0 b              0 b              1.00 Kb          1.00 Kb          1                []
rand                         4.50%            13.316us         12.38%           36.604us         36.604us         800 b            0 b              0 b              0 b              1                []
empty                        0.83%            2.456us          0.83%            2.456us          2.456us          800 b            800 b            0 b              0 b              1                []
uniform_                     6.44%            19.044us         7.05%            20.832us         20.832us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.60%            1.788us          0.60%            1.788us          1.788us          0 b              0 b              0 b              0 b              1                [[10, 10]]
copy_                        37.70%           111.469us        37.70%           111.469us        111.469us        0 b              0 b              0 b              0 b              1                [[10, 10], [10, 10]]
test_user_scope_dealloc      6.95%            20.544us         6.95%            20.544us         20.544us         0 b              0 b              -1.00 Kb         -1.00 Kb         1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 295.687us

Running MKLDNN test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        34.23%           43.503us         88.57%           112.550us        112.550us        400 b            -400 b           0 b              0 b              1                []
rand                         8.00%            10.167us         18.34%           23.302us         23.302us         400 b            0 b              0 b              0 b              1                []
empty                        2.22%            2.815us          2.22%            2.815us          2.815us          400 b            400 b            0 b              0 b              1                []
to_mkldnn                    35.16%           44.675us         36.00%           45.745us         45.745us         400 b            400 b            0 b              0 b              1                [[10, 10]]
uniform_                     7.24%            9.198us          8.12%            10.320us         10.320us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.88%            1.122us          0.88%            1.122us          1.122us          0 b              0 b              0 b              0 b              1                [[10, 10]]
contiguous                   0.84%            1.070us          0.84%            1.070us          1.070us          0 b              0 b              0 b              0 b              1                [[10, 10]]
test_user_scope_dealloc      11.43%           14.525us         11.43%           14.525us         14.525us         -400 b           -400 b           0 b              0 b              1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 127.075us

.
----------------------------------------------------------------------
Ran 1 test in 1.571s

OK
```

Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248)

[ghstack-poisoned]
Summary:
Adding memory usage into profiler table output

Test Plan:
```
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=1 USE_CUDA=1 python setup.py develop install
```

```
$ python benchmarks/profiler_benchmark/resnet_memory_profiler.py
output: https://gist.github.com/ilia-cher/3f37d54c3b2afb24d6776858e6860f69
```

```
$ python test/test_autograd.py TestAutograd.test_memory_profiler
Couldn't download test skip set, leaving all tests enabled...
Running CPU test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        60.58%           105.892us        93.42%           163.285us        163.285us        800 b            0 b              0 b              0 b              1                []
rand                         10.53%           18.405us         32.83%           57.393us         57.393us         800 b            0 b              0 b              0 b              1                []
empty                        1.77%            3.092us          1.77%            3.092us          3.092us          800 b            800 b            0 b              0 b              1                []
uniform_                     19.64%           34.325us         20.54%           35.896us         35.896us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.90%            1.571us          0.90%            1.571us          1.571us          0 b              0 b              0 b              0 b              1                [[10, 10]]
test_user_scope_dealloc      6.58%            11.508us         6.58%            11.508us         11.508us         -800 b           -800 b           0 b              0 b              1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 174.793us

Running CUDA test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        29.37%           86.836us         93.05%           275.143us        275.143us        0 b              -800 b           1.00 Kb          0 b              1                []
to                           7.42%            21.939us         51.31%           151.703us        151.703us        0 b              0 b              1.00 Kb          0 b              1                [[10, 10]]
empty_strided                6.19%            18.295us         6.19%            18.295us         18.295us         0 b              0 b              1.00 Kb          1.00 Kb          1                []
rand                         4.50%            13.316us         12.38%           36.604us         36.604us         800 b            0 b              0 b              0 b              1                []
empty                        0.83%            2.456us          0.83%            2.456us          2.456us          800 b            800 b            0 b              0 b              1                []
uniform_                     6.44%            19.044us         7.05%            20.832us         20.832us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.60%            1.788us          0.60%            1.788us          1.788us          0 b              0 b              0 b              0 b              1                [[10, 10]]
copy_                        37.70%           111.469us        37.70%           111.469us        111.469us        0 b              0 b              0 b              0 b              1                [[10, 10], [10, 10]]
test_user_scope_dealloc      6.95%            20.544us         6.95%            20.544us         20.544us         0 b              0 b              -1.00 Kb         -1.00 Kb         1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 295.687us

Running MKLDNN test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        34.23%           43.503us         88.57%           112.550us        112.550us        400 b            -400 b           0 b              0 b              1                []
rand                         8.00%            10.167us         18.34%           23.302us         23.302us         400 b            0 b              0 b              0 b              1                []
empty                        2.22%            2.815us          2.22%            2.815us          2.815us          400 b            400 b            0 b              0 b              1                []
to_mkldnn                    35.16%           44.675us         36.00%           45.745us         45.745us         400 b            400 b            0 b              0 b              1                [[10, 10]]
uniform_                     7.24%            9.198us          8.12%            10.320us         10.320us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.88%            1.122us          0.88%            1.122us          1.122us          0 b              0 b              0 b              0 b              1                [[10, 10]]
contiguous                   0.84%            1.070us          0.84%            1.070us          1.070us          0 b              0 b              0 b              0 b              1                [[10, 10]]
test_user_scope_dealloc      11.43%           14.525us         11.43%           14.525us         14.525us         -400 b           -400 b           0 b              0 b              1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 127.075us

.
----------------------------------------------------------------------
Ran 1 test in 1.571s

OK
```

Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248)

[ghstack-poisoned]
Summary:
Adding memory usage into profiler table output

Test Plan:
```
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=1 USE_CUDA=1 python setup.py develop install
```

```
$ python benchmarks/profiler_benchmark/resnet_memory_profiler.py
output: https://gist.github.com/ilia-cher/3f37d54c3b2afb24d6776858e6860f69
```

```
$ python test/test_autograd.py TestAutograd.test_memory_profiler
Couldn't download test skip set, leaving all tests enabled...
Running CPU test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        60.58%           105.892us        93.42%           163.285us        163.285us        800 b            0 b              0 b              0 b              1                []
rand                         10.53%           18.405us         32.83%           57.393us         57.393us         800 b            0 b              0 b              0 b              1                []
empty                        1.77%            3.092us          1.77%            3.092us          3.092us          800 b            800 b            0 b              0 b              1                []
uniform_                     19.64%           34.325us         20.54%           35.896us         35.896us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.90%            1.571us          0.90%            1.571us          1.571us          0 b              0 b              0 b              0 b              1                [[10, 10]]
test_user_scope_dealloc      6.58%            11.508us         6.58%            11.508us         11.508us         -800 b           -800 b           0 b              0 b              1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 174.793us

Running CUDA test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        29.37%           86.836us         93.05%           275.143us        275.143us        0 b              -800 b           1.00 Kb          0 b              1                []
to                           7.42%            21.939us         51.31%           151.703us        151.703us        0 b              0 b              1.00 Kb          0 b              1                [[10, 10]]
empty_strided                6.19%            18.295us         6.19%            18.295us         18.295us         0 b              0 b              1.00 Kb          1.00 Kb          1                []
rand                         4.50%            13.316us         12.38%           36.604us         36.604us         800 b            0 b              0 b              0 b              1                []
empty                        0.83%            2.456us          0.83%            2.456us          2.456us          800 b            800 b            0 b              0 b              1                []
uniform_                     6.44%            19.044us         7.05%            20.832us         20.832us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.60%            1.788us          0.60%            1.788us          1.788us          0 b              0 b              0 b              0 b              1                [[10, 10]]
copy_                        37.70%           111.469us        37.70%           111.469us        111.469us        0 b              0 b              0 b              0 b              1                [[10, 10], [10, 10]]
test_user_scope_dealloc      6.95%            20.544us         6.95%            20.544us         20.544us         0 b              0 b              -1.00 Kb         -1.00 Kb         1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 295.687us

Running MKLDNN test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        34.23%           43.503us         88.57%           112.550us        112.550us        400 b            -400 b           0 b              0 b              1                []
rand                         8.00%            10.167us         18.34%           23.302us         23.302us         400 b            0 b              0 b              0 b              1                []
empty                        2.22%            2.815us          2.22%            2.815us          2.815us          400 b            400 b            0 b              0 b              1                []
to_mkldnn                    35.16%           44.675us         36.00%           45.745us         45.745us         400 b            400 b            0 b              0 b              1                [[10, 10]]
uniform_                     7.24%            9.198us          8.12%            10.320us         10.320us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.88%            1.122us          0.88%            1.122us          1.122us          0 b              0 b              0 b              0 b              1                [[10, 10]]
contiguous                   0.84%            1.070us          0.84%            1.070us          1.070us          0 b              0 b              0 b              0 b              1                [[10, 10]]
test_user_scope_dealloc      11.43%           14.525us         11.43%           14.525us         14.525us         -400 b           -400 b           0 b              0 b              1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 127.075us

.
----------------------------------------------------------------------
Ran 1 test in 1.571s

OK
```

Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248)

[ghstack-poisoned]
Summary:
Adding memory usage into profiler table output

Test Plan:
```
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=1 USE_CUDA=1 python setup.py develop install
```

```
$ python benchmarks/profiler_benchmark/resnet_memory_profiler.py
output: https://gist.github.com/ilia-cher/3f37d54c3b2afb24d6776858e6860f69
```

```
$ python test/test_autograd.py TestAutograd.test_memory_profiler
Couldn't download test skip set, leaving all tests enabled...
Running CPU test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        60.58%           105.892us        93.42%           163.285us        163.285us        800 b            0 b              0 b              0 b              1                []
rand                         10.53%           18.405us         32.83%           57.393us         57.393us         800 b            0 b              0 b              0 b              1                []
empty                        1.77%            3.092us          1.77%            3.092us          3.092us          800 b            800 b            0 b              0 b              1                []
uniform_                     19.64%           34.325us         20.54%           35.896us         35.896us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.90%            1.571us          0.90%            1.571us          1.571us          0 b              0 b              0 b              0 b              1                [[10, 10]]
test_user_scope_dealloc      6.58%            11.508us         6.58%            11.508us         11.508us         -800 b           -800 b           0 b              0 b              1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 174.793us

Running CUDA test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        29.37%           86.836us         93.05%           275.143us        275.143us        0 b              -800 b           1.00 Kb          0 b              1                []
to                           7.42%            21.939us         51.31%           151.703us        151.703us        0 b              0 b              1.00 Kb          0 b              1                [[10, 10]]
empty_strided                6.19%            18.295us         6.19%            18.295us         18.295us         0 b              0 b              1.00 Kb          1.00 Kb          1                []
rand                         4.50%            13.316us         12.38%           36.604us         36.604us         800 b            0 b              0 b              0 b              1                []
empty                        0.83%            2.456us          0.83%            2.456us          2.456us          800 b            800 b            0 b              0 b              1                []
uniform_                     6.44%            19.044us         7.05%            20.832us         20.832us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.60%            1.788us          0.60%            1.788us          1.788us          0 b              0 b              0 b              0 b              1                [[10, 10]]
copy_                        37.70%           111.469us        37.70%           111.469us        111.469us        0 b              0 b              0 b              0 b              1                [[10, 10], [10, 10]]
test_user_scope_dealloc      6.95%            20.544us         6.95%            20.544us         20.544us         0 b              0 b              -1.00 Kb         -1.00 Kb         1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 295.687us

Running MKLDNN test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        34.23%           43.503us         88.57%           112.550us        112.550us        400 b            -400 b           0 b              0 b              1                []
rand                         8.00%            10.167us         18.34%           23.302us         23.302us         400 b            0 b              0 b              0 b              1                []
empty                        2.22%            2.815us          2.22%            2.815us          2.815us          400 b            400 b            0 b              0 b              1                []
to_mkldnn                    35.16%           44.675us         36.00%           45.745us         45.745us         400 b            400 b            0 b              0 b              1                [[10, 10]]
uniform_                     7.24%            9.198us          8.12%            10.320us         10.320us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.88%            1.122us          0.88%            1.122us          1.122us          0 b              0 b              0 b              0 b              1                [[10, 10]]
contiguous                   0.84%            1.070us          0.84%            1.070us          1.070us          0 b              0 b              0 b              0 b              1                [[10, 10]]
test_user_scope_dealloc      11.43%           14.525us         11.43%           14.525us         14.525us         -400 b           -400 b           0 b              0 b              1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 127.075us

.
----------------------------------------------------------------------
Ran 1 test in 1.571s

OK
```

Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248)

[ghstack-poisoned]
@ilia-cher ilia-cher requested a review from dzhulgakov May 14, 2020 03:40
ilia-cher added 3 commits May 14, 2020 02:49
Summary:
Adding memory usage into profiler table output

Test Plan:
```
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=1 USE_CUDA=1 python setup.py develop install
```

```
$ python benchmarks/profiler_benchmark/resnet_memory_profiler.py
output: https://gist.github.com/ilia-cher/3f37d54c3b2afb24d6776858e6860f69
```

```
$ python test/test_autograd.py TestAutograd.test_memory_profiler
Couldn't download test skip set, leaving all tests enabled...
Running CPU test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        60.58%           105.892us        93.42%           163.285us        163.285us        800 b            0 b              0 b              0 b              1                []
rand                         10.53%           18.405us         32.83%           57.393us         57.393us         800 b            0 b              0 b              0 b              1                []
empty                        1.77%            3.092us          1.77%            3.092us          3.092us          800 b            800 b            0 b              0 b              1                []
uniform_                     19.64%           34.325us         20.54%           35.896us         35.896us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.90%            1.571us          0.90%            1.571us          1.571us          0 b              0 b              0 b              0 b              1                [[10, 10]]
test_user_scope_dealloc      6.58%            11.508us         6.58%            11.508us         11.508us         -800 b           -800 b           0 b              0 b              1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 174.793us

Running CUDA test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        29.37%           86.836us         93.05%           275.143us        275.143us        0 b              -800 b           1.00 Kb          0 b              1                []
to                           7.42%            21.939us         51.31%           151.703us        151.703us        0 b              0 b              1.00 Kb          0 b              1                [[10, 10]]
empty_strided                6.19%            18.295us         6.19%            18.295us         18.295us         0 b              0 b              1.00 Kb          1.00 Kb          1                []
rand                         4.50%            13.316us         12.38%           36.604us         36.604us         800 b            0 b              0 b              0 b              1                []
empty                        0.83%            2.456us          0.83%            2.456us          2.456us          800 b            800 b            0 b              0 b              1                []
uniform_                     6.44%            19.044us         7.05%            20.832us         20.832us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.60%            1.788us          0.60%            1.788us          1.788us          0 b              0 b              0 b              0 b              1                [[10, 10]]
copy_                        37.70%           111.469us        37.70%           111.469us        111.469us        0 b              0 b              0 b              0 b              1                [[10, 10], [10, 10]]
test_user_scope_dealloc      6.95%            20.544us         6.95%            20.544us         20.544us         0 b              0 b              -1.00 Kb         -1.00 Kb         1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 295.687us

Running MKLDNN test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        34.23%           43.503us         88.57%           112.550us        112.550us        400 b            -400 b           0 b              0 b              1                []
rand                         8.00%            10.167us         18.34%           23.302us         23.302us         400 b            0 b              0 b              0 b              1                []
empty                        2.22%            2.815us          2.22%            2.815us          2.815us          400 b            400 b            0 b              0 b              1                []
to_mkldnn                    35.16%           44.675us         36.00%           45.745us         45.745us         400 b            400 b            0 b              0 b              1                [[10, 10]]
uniform_                     7.24%            9.198us          8.12%            10.320us         10.320us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.88%            1.122us          0.88%            1.122us          1.122us          0 b              0 b              0 b              0 b              1                [[10, 10]]
contiguous                   0.84%            1.070us          0.84%            1.070us          1.070us          0 b              0 b              0 b              0 b              1                [[10, 10]]
test_user_scope_dealloc      11.43%           14.525us         11.43%           14.525us         14.525us         -400 b           -400 b           0 b              0 b              1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 127.075us

.
----------------------------------------------------------------------
Ran 1 test in 1.571s

OK
```

Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248)

[ghstack-poisoned]
Summary:
Adding memory usage into profiler table output

Test Plan:
```
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=1 USE_CUDA=1 python setup.py develop install
```

```
$ python benchmarks/profiler_benchmark/resnet_memory_profiler.py
output: https://gist.github.com/ilia-cher/3f37d54c3b2afb24d6776858e6860f69
```

```
$ python test/test_autograd.py TestAutograd.test_memory_profiler
Couldn't download test skip set, leaving all tests enabled...
Running CPU test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        60.58%           105.892us        93.42%           163.285us        163.285us        800 b            0 b              0 b              0 b              1                []
rand                         10.53%           18.405us         32.83%           57.393us         57.393us         800 b            0 b              0 b              0 b              1                []
empty                        1.77%            3.092us          1.77%            3.092us          3.092us          800 b            800 b            0 b              0 b              1                []
uniform_                     19.64%           34.325us         20.54%           35.896us         35.896us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.90%            1.571us          0.90%            1.571us          1.571us          0 b              0 b              0 b              0 b              1                [[10, 10]]
test_user_scope_dealloc      6.58%            11.508us         6.58%            11.508us         11.508us         -800 b           -800 b           0 b              0 b              1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 174.793us

Running CUDA test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        29.37%           86.836us         93.05%           275.143us        275.143us        0 b              -800 b           1.00 Kb          0 b              1                []
to                           7.42%            21.939us         51.31%           151.703us        151.703us        0 b              0 b              1.00 Kb          0 b              1                [[10, 10]]
empty_strided                6.19%            18.295us         6.19%            18.295us         18.295us         0 b              0 b              1.00 Kb          1.00 Kb          1                []
rand                         4.50%            13.316us         12.38%           36.604us         36.604us         800 b            0 b              0 b              0 b              1                []
empty                        0.83%            2.456us          0.83%            2.456us          2.456us          800 b            800 b            0 b              0 b              1                []
uniform_                     6.44%            19.044us         7.05%            20.832us         20.832us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.60%            1.788us          0.60%            1.788us          1.788us          0 b              0 b              0 b              0 b              1                [[10, 10]]
copy_                        37.70%           111.469us        37.70%           111.469us        111.469us        0 b              0 b              0 b              0 b              1                [[10, 10], [10, 10]]
test_user_scope_dealloc      6.95%            20.544us         6.95%            20.544us         20.544us         0 b              0 b              -1.00 Kb         -1.00 Kb         1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 295.687us

Running MKLDNN test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        34.23%           43.503us         88.57%           112.550us        112.550us        400 b            -400 b           0 b              0 b              1                []
rand                         8.00%            10.167us         18.34%           23.302us         23.302us         400 b            0 b              0 b              0 b              1                []
empty                        2.22%            2.815us          2.22%            2.815us          2.815us          400 b            400 b            0 b              0 b              1                []
to_mkldnn                    35.16%           44.675us         36.00%           45.745us         45.745us         400 b            400 b            0 b              0 b              1                [[10, 10]]
uniform_                     7.24%            9.198us          8.12%            10.320us         10.320us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.88%            1.122us          0.88%            1.122us          1.122us          0 b              0 b              0 b              0 b              1                [[10, 10]]
contiguous                   0.84%            1.070us          0.84%            1.070us          1.070us          0 b              0 b              0 b              0 b              1                [[10, 10]]
test_user_scope_dealloc      11.43%           14.525us         11.43%           14.525us         14.525us         -400 b           -400 b           0 b              0 b              1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 127.075us

.
----------------------------------------------------------------------
Ran 1 test in 1.571s

OK
```

Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248)

[ghstack-poisoned]
Summary:
Adding memory usage into profiler table output

Test Plan:
```
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=1 USE_CUDA=1 python setup.py develop install
```

```
$ python benchmarks/profiler_benchmark/resnet_memory_profiler.py
output: https://gist.github.com/ilia-cher/3f37d54c3b2afb24d6776858e6860f69
```

```
$ python test/test_autograd.py TestAutograd.test_memory_profiler
Couldn't download test skip set, leaving all tests enabled...
Running CPU test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        60.58%           105.892us        93.42%           163.285us        163.285us        800 b            0 b              0 b              0 b              1                []
rand                         10.53%           18.405us         32.83%           57.393us         57.393us         800 b            0 b              0 b              0 b              1                []
empty                        1.77%            3.092us          1.77%            3.092us          3.092us          800 b            800 b            0 b              0 b              1                []
uniform_                     19.64%           34.325us         20.54%           35.896us         35.896us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.90%            1.571us          0.90%            1.571us          1.571us          0 b              0 b              0 b              0 b              1                [[10, 10]]
test_user_scope_dealloc      6.58%            11.508us         6.58%            11.508us         11.508us         -800 b           -800 b           0 b              0 b              1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 174.793us

Running CUDA test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        29.37%           86.836us         93.05%           275.143us        275.143us        0 b              -800 b           1.00 Kb          0 b              1                []
to                           7.42%            21.939us         51.31%           151.703us        151.703us        0 b              0 b              1.00 Kb          0 b              1                [[10, 10]]
empty_strided                6.19%            18.295us         6.19%            18.295us         18.295us         0 b              0 b              1.00 Kb          1.00 Kb          1                []
rand                         4.50%            13.316us         12.38%           36.604us         36.604us         800 b            0 b              0 b              0 b              1                []
empty                        0.83%            2.456us          0.83%            2.456us          2.456us          800 b            800 b            0 b              0 b              1                []
uniform_                     6.44%            19.044us         7.05%            20.832us         20.832us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.60%            1.788us          0.60%            1.788us          1.788us          0 b              0 b              0 b              0 b              1                [[10, 10]]
copy_                        37.70%           111.469us        37.70%           111.469us        111.469us        0 b              0 b              0 b              0 b              1                [[10, 10], [10, 10]]
test_user_scope_dealloc      6.95%            20.544us         6.95%            20.544us         20.544us         0 b              0 b              -1.00 Kb         -1.00 Kb         1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 295.687us

Running MKLDNN test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        34.23%           43.503us         88.57%           112.550us        112.550us        400 b            -400 b           0 b              0 b              1                []
rand                         8.00%            10.167us         18.34%           23.302us         23.302us         400 b            0 b              0 b              0 b              1                []
empty                        2.22%            2.815us          2.22%            2.815us          2.815us          400 b            400 b            0 b              0 b              1                []
to_mkldnn                    35.16%           44.675us         36.00%           45.745us         45.745us         400 b            400 b            0 b              0 b              1                [[10, 10]]
uniform_                     7.24%            9.198us          8.12%            10.320us         10.320us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.88%            1.122us          0.88%            1.122us          1.122us          0 b              0 b              0 b              0 b              1                [[10, 10]]
contiguous                   0.84%            1.070us          0.84%            1.070us          1.070us          0 b              0 b              0 b              0 b              1                [[10, 10]]
test_user_scope_dealloc      11.43%           14.525us         11.43%           14.525us         14.525us         -400 b           -400 b           0 b              0 b              1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 127.075us

.
----------------------------------------------------------------------
Ran 1 test in 1.571s

OK
```

Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248)

[ghstack-poisoned]
Copy link
Copy Markdown
Collaborator

@dzhulgakov dzhulgakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I think it's good to go! (and nice catch on overlapping ranges)

There's also some allocator stuff left in TH, but I'm not sure what it is and I didn't trace where it gets called:
https://github.com/pytorch/pytorch/blob/40265e2d663cc0027cffa6e80ee1ec67d467ca00/aten/src/TH/THAllocator.cpp

Comment thread c10/core/Allocator.h Outdated
// An interface for reporting thread local memory usage
// per device
struct C10_API MemoryReportingInfoBase : public c10::DebugInfoBase {
MemoryReportingInfoBase() {}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed - move it to .cpp file to avoid potentially duplicated symbols

@ilia-cher
Copy link
Copy Markdown
Contributor Author

from what I got (thanks @ezyang) this is a special allocator used to allocate tensors in shared memory space used for inter-process communication; I guess we can add memory reporting to there too

ilia-cher added 2 commits May 18, 2020 21:04
Summary:
Adding memory usage into profiler table output

Test Plan:
```
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=1 USE_CUDA=1 python setup.py develop install
```

```
$ python benchmarks/profiler_benchmark/resnet_memory_profiler.py
output: https://gist.github.com/ilia-cher/3f37d54c3b2afb24d6776858e6860f69
```

```
$ python test/test_autograd.py TestAutograd.test_memory_profiler
Couldn't download test skip set, leaving all tests enabled...
Running CPU test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        60.58%           105.892us        93.42%           163.285us        163.285us        800 b            0 b              0 b              0 b              1                []
rand                         10.53%           18.405us         32.83%           57.393us         57.393us         800 b            0 b              0 b              0 b              1                []
empty                        1.77%            3.092us          1.77%            3.092us          3.092us          800 b            800 b            0 b              0 b              1                []
uniform_                     19.64%           34.325us         20.54%           35.896us         35.896us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.90%            1.571us          0.90%            1.571us          1.571us          0 b              0 b              0 b              0 b              1                [[10, 10]]
test_user_scope_dealloc      6.58%            11.508us         6.58%            11.508us         11.508us         -800 b           -800 b           0 b              0 b              1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 174.793us

Running CUDA test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        29.37%           86.836us         93.05%           275.143us        275.143us        0 b              -800 b           1.00 Kb          0 b              1                []
to                           7.42%            21.939us         51.31%           151.703us        151.703us        0 b              0 b              1.00 Kb          0 b              1                [[10, 10]]
empty_strided                6.19%            18.295us         6.19%            18.295us         18.295us         0 b              0 b              1.00 Kb          1.00 Kb          1                []
rand                         4.50%            13.316us         12.38%           36.604us         36.604us         800 b            0 b              0 b              0 b              1                []
empty                        0.83%            2.456us          0.83%            2.456us          2.456us          800 b            800 b            0 b              0 b              1                []
uniform_                     6.44%            19.044us         7.05%            20.832us         20.832us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.60%            1.788us          0.60%            1.788us          1.788us          0 b              0 b              0 b              0 b              1                [[10, 10]]
copy_                        37.70%           111.469us        37.70%           111.469us        111.469us        0 b              0 b              0 b              0 b              1                [[10, 10], [10, 10]]
test_user_scope_dealloc      6.95%            20.544us         6.95%            20.544us         20.544us         0 b              0 b              -1.00 Kb         -1.00 Kb         1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 295.687us

Running MKLDNN test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        34.23%           43.503us         88.57%           112.550us        112.550us        400 b            -400 b           0 b              0 b              1                []
rand                         8.00%            10.167us         18.34%           23.302us         23.302us         400 b            0 b              0 b              0 b              1                []
empty                        2.22%            2.815us          2.22%            2.815us          2.815us          400 b            400 b            0 b              0 b              1                []
to_mkldnn                    35.16%           44.675us         36.00%           45.745us         45.745us         400 b            400 b            0 b              0 b              1                [[10, 10]]
uniform_                     7.24%            9.198us          8.12%            10.320us         10.320us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.88%            1.122us          0.88%            1.122us          1.122us          0 b              0 b              0 b              0 b              1                [[10, 10]]
contiguous                   0.84%            1.070us          0.84%            1.070us          1.070us          0 b              0 b              0 b              0 b              1                [[10, 10]]
test_user_scope_dealloc      11.43%           14.525us         11.43%           14.525us         14.525us         -400 b           -400 b           0 b              0 b              1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 127.075us

.
----------------------------------------------------------------------
Ran 1 test in 1.571s

OK
```

Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248)

[ghstack-poisoned]
Summary:
Adding memory usage into profiler table output

Test Plan:
```
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=1 USE_CUDA=1 python setup.py develop install
```

```
$ python benchmarks/profiler_benchmark/resnet_memory_profiler.py
output: https://gist.github.com/ilia-cher/3f37d54c3b2afb24d6776858e6860f69
```

```
$ python test/test_autograd.py TestAutograd.test_memory_profiler
Couldn't download test skip set, leaving all tests enabled...
Running CPU test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        60.58%           105.892us        93.42%           163.285us        163.285us        800 b            0 b              0 b              0 b              1                []
rand                         10.53%           18.405us         32.83%           57.393us         57.393us         800 b            0 b              0 b              0 b              1                []
empty                        1.77%            3.092us          1.77%            3.092us          3.092us          800 b            800 b            0 b              0 b              1                []
uniform_                     19.64%           34.325us         20.54%           35.896us         35.896us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.90%            1.571us          0.90%            1.571us          1.571us          0 b              0 b              0 b              0 b              1                [[10, 10]]
test_user_scope_dealloc      6.58%            11.508us         6.58%            11.508us         11.508us         -800 b           -800 b           0 b              0 b              1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 174.793us

Running CUDA test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        29.37%           86.836us         93.05%           275.143us        275.143us        0 b              -800 b           1.00 Kb          0 b              1                []
to                           7.42%            21.939us         51.31%           151.703us        151.703us        0 b              0 b              1.00 Kb          0 b              1                [[10, 10]]
empty_strided                6.19%            18.295us         6.19%            18.295us         18.295us         0 b              0 b              1.00 Kb          1.00 Kb          1                []
rand                         4.50%            13.316us         12.38%           36.604us         36.604us         800 b            0 b              0 b              0 b              1                []
empty                        0.83%            2.456us          0.83%            2.456us          2.456us          800 b            800 b            0 b              0 b              1                []
uniform_                     6.44%            19.044us         7.05%            20.832us         20.832us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.60%            1.788us          0.60%            1.788us          1.788us          0 b              0 b              0 b              0 b              1                [[10, 10]]
copy_                        37.70%           111.469us        37.70%           111.469us        111.469us        0 b              0 b              0 b              0 b              1                [[10, 10], [10, 10]]
test_user_scope_dealloc      6.95%            20.544us         6.95%            20.544us         20.544us         0 b              0 b              -1.00 Kb         -1.00 Kb         1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 295.687us

Running MKLDNN test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        34.23%           43.503us         88.57%           112.550us        112.550us        400 b            -400 b           0 b              0 b              1                []
rand                         8.00%            10.167us         18.34%           23.302us         23.302us         400 b            0 b              0 b              0 b              1                []
empty                        2.22%            2.815us          2.22%            2.815us          2.815us          400 b            400 b            0 b              0 b              1                []
to_mkldnn                    35.16%           44.675us         36.00%           45.745us         45.745us         400 b            400 b            0 b              0 b              1                [[10, 10]]
uniform_                     7.24%            9.198us          8.12%            10.320us         10.320us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.88%            1.122us          0.88%            1.122us          1.122us          0 b              0 b              0 b              0 b              1                [[10, 10]]
contiguous                   0.84%            1.070us          0.84%            1.070us          1.070us          0 b              0 b              0 b              0 b              1                [[10, 10]]
test_user_scope_dealloc      11.43%           14.525us         11.43%           14.525us         14.525us         -400 b           -400 b           0 b              0 b              1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 127.075us

.
----------------------------------------------------------------------
Ran 1 test in 1.571s

OK
```

Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Copy Markdown
Contributor

@ilia-cher merged this pull request in a94fb71.

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented May 20, 2020

This broke ROCm tests:


01:17:35 ======================================================================
01:17:35 FAIL: test_memory_profiler (__main__.TestAutograd)
01:17:35 ----------------------------------------------------------------------
01:17:35 Traceback (most recent call last):
01:17:35   File "test_autograd.py", line 2940, in test_memory_profiler
01:17:35     "test_user_scope_dealloc",
01:17:35   File "test_autograd.py", line 2897, in check_metrics
01:17:35     self.assertTrue(stat_metrics[alloc_fn] > 0)
01:17:35 AssertionError: False is not true
01:17:35 
01:17:35 ----------------------------------------------------------------------

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented May 20, 2020

@jeffdaily
Copy link
Copy Markdown
Collaborator

Thank you for the notification. In the future, if a new test for a new feature is breaking ROCm CI, can we have the developer(s) add the skipIfRocm decorator and tag me to look into it? This would help with our CI stability. Ideally, it would be fixed prior to merging the PR, but I understand at present it's a tough request.

@jeffdaily
Copy link
Copy Markdown
Collaborator

cc @ezyang #38790

@ilia-cher
Copy link
Copy Markdown
Contributor Author

I'm landing the fix #38795

@ilia-cher
Copy link
Copy Markdown
Contributor Author

have you noticed though that py3.6-clang7-rocmdeb-ubuntu16.04-test2 was broken on trunk at least since May 16
https://ezyang.github.io/pytorch-ci-hud/build/pytorch-master ?

@albanD
Copy link
Copy Markdown
Collaborator

albanD commented May 20, 2020

I don't think it was continuously broken, it was fixed on the 19th, then broken again 3 commits later :/

@facebook-github-bot facebook-github-bot deleted the gh/ilia-cher/68/head branch May 23, 2020 14:16
@seemethere seemethere added this to the 1.6.0 milestone Jun 22, 2020
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
Pull Request resolved: pytorch#37775

Adding memory usage into profiler table output

Test Plan:
BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py
develop install --cmake

```
import torch
import torchvision.models as models
model = models.resnet18()
inp = torch.randn(5, 3, 224, 224)

with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof:
    model(inp)

print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15))
```

```
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem Total    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
resize_                      0.37%            577.936us        0.37%            577.936us        9.796us          339.03 Mb        59               [[0]]
empty                        0.69%            1.061ms          0.74%            1.139ms          5.556us          47.42 Mb         205              []
stride                       0.00%            0.853us          0.00%            0.853us          0.853us          19.53 Kb         1                [[5, 1000]]
empty_strided                0.01%            21.393us         0.02%            26.033us         5.207us          252 b            5                []
is_complex                   0.02%            37.425us         0.02%            37.425us         1.291us          208 b            29               [[]]
masked_select                0.04%            55.333us         0.06%            93.616us         46.808us         120 b            2                [[30], [30]]
conv2d                       0.01%            18.009us         9.62%            14.902ms         14.902ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
convolution                  0.01%            12.436us         9.61%            14.884ms         14.884ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_convolution                 0.03%            52.381us         9.60%            14.871ms         14.871ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
size                         0.00%            5.429us          0.00%            5.429us          0.339us          0 b              16               [[5, 3, 224, 224]]
contiguous                   0.00%            1.934us          0.00%            1.934us          0.967us          0 b              2                [[5, 3, 224, 224]]
_convolution_nogroup         0.02%            27.505us         9.57%            14.814ms         14.814ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
_nnpack_available            0.02%            34.267us         0.02%            34.267us         1.713us          0 b              20               []
thnn_conv2d                  0.01%            13.274us         9.54%            14.771ms         14.771ms         0 b              1                [[5, 3, 224, 224], [64, 3, 7, 7], [
thnn_conv2d_forward          5.98%            9.264ms          19.02%           29.446ms         14.723ms         0 b              2                [[5, 3, 224, 224], [64, 3, 7, 7], [
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 154.855ms
```

Reviewed By: ngimel

Differential Revision: D21384248

Pulled By: ilia-cher

fbshipit-source-id: 31359cce2aa06f6255ed1ad8c60d03cb640bfec3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants