Memory profiling by ilia-cher · Pull Request #37775 · pytorch/pytorch

ilia-cher · 2020-05-04T17:35:51Z

Stack from ghstack:

RecordFunction in Dispatcher #37587 RecordFunction in Dispatcher
Use TensorMethods.cpp #37639 Use TensorMethods.cpp
Fixes for profiling JIT code #38453 Fixes for profiling JIT code
Memory profiling #37775 Memory profiling

Summary:
Adding memory usage into profiler table output

Test Plan:

BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=1 USE_CUDA=1 python setup.py develop install

$ python benchmarks/profiler_benchmark/resnet_memory_profiler.py
output: https://gist.github.com/ilia-cher/3f37d54c3b2afb24d6776858e6860f69

$ python test/test_autograd.py TestAutograd.test_memory_profiler
Couldn't download test skip set, leaving all tests enabled...
Running CPU test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        60.58%           105.892us        93.42%           163.285us        163.285us        800 b            0 b              0 b              0 b              1                []
rand                         10.53%           18.405us         32.83%           57.393us         57.393us         800 b            0 b              0 b              0 b              1                []
empty                        1.77%            3.092us          1.77%            3.092us          3.092us          800 b            800 b            0 b              0 b              1                []
uniform_                     19.64%           34.325us         20.54%           35.896us         35.896us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.90%            1.571us          0.90%            1.571us          1.571us          0 b              0 b              0 b              0 b              1                [[10, 10]]
test_user_scope_dealloc      6.58%            11.508us         6.58%            11.508us         11.508us         -800 b           -800 b           0 b              0 b              1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 174.793us

Running CUDA test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        29.37%           86.836us         93.05%           275.143us        275.143us        0 b              -800 b           1.00 Kb          0 b              1                []
to                           7.42%            21.939us         51.31%           151.703us        151.703us        0 b              0 b              1.00 Kb          0 b              1                [[10, 10]]
empty_strided                6.19%            18.295us         6.19%            18.295us         18.295us         0 b              0 b              1.00 Kb          1.00 Kb          1                []
rand                         4.50%            13.316us         12.38%           36.604us         36.604us         800 b            0 b              0 b              0 b              1                []
empty                        0.83%            2.456us          0.83%            2.456us          2.456us          800 b            800 b            0 b              0 b              1                []
uniform_                     6.44%            19.044us         7.05%            20.832us         20.832us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.60%            1.788us          0.60%            1.788us          1.788us          0 b              0 b              0 b              0 b              1                [[10, 10]]
copy_                        37.70%           111.469us        37.70%           111.469us        111.469us        0 b              0 b              0 b              0 b              1                [[10, 10], [10, 10]]
test_user_scope_dealloc      6.95%            20.544us         6.95%            20.544us         20.544us         0 b              0 b              -1.00 Kb         -1.00 Kb         1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 295.687us

Running MKLDNN test
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                         Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CPU Mem          Self CPU Mem     CUDA Mem         Self CUDA Mem    Number of Calls  Input Shapes
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
test_user_scope_alloc        34.23%           43.503us         88.57%           112.550us        112.550us        400 b            -400 b           0 b              0 b              1                []
rand                         8.00%            10.167us         18.34%           23.302us         23.302us         400 b            0 b              0 b              0 b              1                []
empty                        2.22%            2.815us          2.22%            2.815us          2.815us          400 b            400 b            0 b              0 b              1                []
to_mkldnn                    35.16%           44.675us         36.00%           45.745us         45.745us         400 b            400 b            0 b              0 b              1                [[10, 10]]
uniform_                     7.24%            9.198us          8.12%            10.320us         10.320us         0 b              0 b              0 b              0 b              1                [[10, 10]]
is_complex                   0.88%            1.122us          0.88%            1.122us          1.122us          0 b              0 b              0 b              0 b              1                [[10, 10]]
contiguous                   0.84%            1.070us          0.84%            1.070us          1.070us          0 b              0 b              0 b              0 b              1                [[10, 10]]
test_user_scope_dealloc      11.43%           14.525us         11.43%           14.525us         14.525us         -400 b           -400 b           0 b              0 b              1                []
---------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 127.075us

.
----------------------------------------------------------------------
Ran 1 test in 1.571s

OK

Differential Revision: D21384248

Summary: Adding memory usage into profiler table output Test Plan: BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py develop install --cmake python import torch import torchvision.models as models model = models.resnet18() inp = torch.randn(5, 3, 224, 224) with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof: model(inp) print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15)) --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CPU Mem Total Number of Calls Input Shapes --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- resize_ 0.37% 577.936us 0.37% 577.936us 9.796us 339.03 Mb 59 [[0]] empty 0.69% 1.061ms 0.74% 1.139ms 5.556us 47.42 Mb 205 [] stride 0.00% 0.853us 0.00% 0.853us 0.853us 19.53 Kb 1 [[5, 1000]] empty_strided 0.01% 21.393us 0.02% 26.033us 5.207us 252 b 5 [] is_complex 0.02% 37.425us 0.02% 37.425us 1.291us 208 b 29 [[]] masked_select 0.04% 55.333us 0.06% 93.616us 46.808us 120 b 2 [[30], [30]] conv2d 0.01% 18.009us 9.62% 14.902ms 14.902ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ convolution 0.01% 12.436us 9.61% 14.884ms 14.884ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ _convolution 0.03% 52.381us 9.60% 14.871ms 14.871ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ size 0.00% 5.429us 0.00% 5.429us 0.339us 0 b 16 [[5, 3, 224, 224]] contiguous 0.00% 1.934us 0.00% 1.934us 0.967us 0 b 2 [[5, 3, 224, 224]] _convolution_nogroup 0.02% 27.505us 9.57% 14.814ms 14.814ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ _nnpack_available 0.02% 34.267us 0.02% 34.267us 1.713us 0 b 20 [] thnn_conv2d 0.01% 13.274us 9.54% 14.771ms 14.771ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ thnn_conv2d_forward 5.98% 9.264ms 19.02% 29.446ms 14.723ms 0 b 2 [[5, 3, 224, 224], [64, 3, 7, 7], [ --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Self CPU time total: 154.855ms [ghstack-poisoned]

dr-ci · 2020-05-04T17:45:06Z

💊 CI failures summary and remediations

As of commit 448af80 (more details on the Dr. CI page):

2/2 failures possibly* introduced in this PR
- 2/2 non-CircleCI failure(s)

Extra GitHub checks: 1 failed

Failed: GitHub Actions - quick-checks

ci.pytorch.org: 1 failed

Failed: pr/py3.6-clang7-rocmdeb-ubuntu16.04

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 271 times.

Summary: Adding memory usage into profiler table output Test Plan: BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py develop install --cmake ``` import torch import torchvision.models as models model = models.resnet18() inp = torch.randn(5, 3, 224, 224) with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof: model(inp) print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15)) --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CPU Mem Total Number of Calls Input Shapes --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- resize_ 0.37% 577.936us 0.37% 577.936us 9.796us 339.03 Mb 59 [[0]] empty 0.69% 1.061ms 0.74% 1.139ms 5.556us 47.42 Mb 205 [] stride 0.00% 0.853us 0.00% 0.853us 0.853us 19.53 Kb 1 [[5, 1000]] empty_strided 0.01% 21.393us 0.02% 26.033us 5.207us 252 b 5 [] is_complex 0.02% 37.425us 0.02% 37.425us 1.291us 208 b 29 [[]] masked_select 0.04% 55.333us 0.06% 93.616us 46.808us 120 b 2 [[30], [30]] conv2d 0.01% 18.009us 9.62% 14.902ms 14.902ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ convolution 0.01% 12.436us 9.61% 14.884ms 14.884ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ _convolution 0.03% 52.381us 9.60% 14.871ms 14.871ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ size 0.00% 5.429us 0.00% 5.429us 0.339us 0 b 16 [[5, 3, 224, 224]] contiguous 0.00% 1.934us 0.00% 1.934us 0.967us 0 b 2 [[5, 3, 224, 224]] _convolution_nogroup 0.02% 27.505us 9.57% 14.814ms 14.814ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ _nnpack_available 0.02% 34.267us 0.02% 34.267us 1.713us 0 b 20 [] thnn_conv2d 0.01% 13.274us 9.54% 14.771ms 14.771ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ thnn_conv2d_forward 5.98% 9.264ms 19.02% 29.446ms 14.723ms 0 b 2 [[5, 3, 224, 224], [64, 3, 7, 7], [ --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Self CPU time total: 154.855ms ``` Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248) [ghstack-poisoned]

dzhulgakov

What about testing? Can you write unittests? (including enabling profiler in the middle of the allocation)

I'd also double-triple check that memory allocation tracking works well with non-standard allocators, e.g. if one does MKLDNN allocations (with to_mkldnn) or some of the internal ones (huge pages)

dzhulgakov · 2020-05-05T05:26:31Z

-  size_table_.erase(it);
+void ProfiledCPUMemoryReporter::Delete(void* ptr) {
+  if (memoryProfilingEnabled()) {
+    std::lock_guard<std::mutex> guard(mutex_);


that'd make execution way slower but I guess it's ok for memory profiling. We should just make sure not to mix the two

Summary: Adding memory usage into profiler table output Test Plan: BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py develop install --cmake ``` import torch import torchvision.models as models model = models.resnet18() inp = torch.randn(5, 3, 224, 224) with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof: model(inp) print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15)) --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CPU Mem Total Number of Calls Input Shapes --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- resize_ 0.37% 577.936us 0.37% 577.936us 9.796us 339.03 Mb 59 [[0]] empty 0.69% 1.061ms 0.74% 1.139ms 5.556us 47.42 Mb 205 [] stride 0.00% 0.853us 0.00% 0.853us 0.853us 19.53 Kb 1 [[5, 1000]] empty_strided 0.01% 21.393us 0.02% 26.033us 5.207us 252 b 5 [] is_complex 0.02% 37.425us 0.02% 37.425us 1.291us 208 b 29 [[]] masked_select 0.04% 55.333us 0.06% 93.616us 46.808us 120 b 2 [[30], [30]] conv2d 0.01% 18.009us 9.62% 14.902ms 14.902ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ convolution 0.01% 12.436us 9.61% 14.884ms 14.884ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ _convolution 0.03% 52.381us 9.60% 14.871ms 14.871ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ size 0.00% 5.429us 0.00% 5.429us 0.339us 0 b 16 [[5, 3, 224, 224]] contiguous 0.00% 1.934us 0.00% 1.934us 0.967us 0 b 2 [[5, 3, 224, 224]] _convolution_nogroup 0.02% 27.505us 9.57% 14.814ms 14.814ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ _nnpack_available 0.02% 34.267us 0.02% 34.267us 1.713us 0 b 20 [] thnn_conv2d 0.01% 13.274us 9.54% 14.771ms 14.771ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ thnn_conv2d_forward 5.98% 9.264ms 19.02% 29.446ms 14.723ms 0 b 2 [[5, 3, 224, 224], [64, 3, 7, 7], [ --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Self CPU time total: 154.855ms ``` Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248) [ghstack-poisoned]

Summary: Adding memory usage into profiler table output Test Plan: ``` BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=1 USE_CUDA=1 python setup.py develop install ``` ``` $ python benchmarks/profiler_benchmark/resnet_memory_profiler.py output: https://gist.github.com/ilia-cher/3f37d54c3b2afb24d6776858e6860f69 ``` ``` $ python test/test_autograd.py TestAutograd.test_memory_profiler Couldn't download test skip set, leaving all tests enabled... Running CPU test --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem Number of Calls Input Shapes --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- test_user_scope_alloc 60.58% 105.892us 93.42% 163.285us 163.285us 800 b 0 b 0 b 0 b 1 [] rand 10.53% 18.405us 32.83% 57.393us 57.393us 800 b 0 b 0 b 0 b 1 [] empty 1.77% 3.092us 1.77% 3.092us 3.092us 800 b 800 b 0 b 0 b 1 [] uniform_ 19.64% 34.325us 20.54% 35.896us 35.896us 0 b 0 b 0 b 0 b 1 [[10, 10]] is_complex 0.90% 1.571us 0.90% 1.571us 1.571us 0 b 0 b 0 b 0 b 1 [[10, 10]] test_user_scope_dealloc 6.58% 11.508us 6.58% 11.508us 11.508us -800 b -800 b 0 b 0 b 1 [] --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Self CPU time total: 174.793us Running CUDA test --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem Number of Calls Input Shapes --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- test_user_scope_alloc 29.37% 86.836us 93.05% 275.143us 275.143us 0 b -800 b 1.00 Kb 0 b 1 [] to 7.42% 21.939us 51.31% 151.703us 151.703us 0 b 0 b 1.00 Kb 0 b 1 [[10, 10]] empty_strided 6.19% 18.295us 6.19% 18.295us 18.295us 0 b 0 b 1.00 Kb 1.00 Kb 1 [] rand 4.50% 13.316us 12.38% 36.604us 36.604us 800 b 0 b 0 b 0 b 1 [] empty 0.83% 2.456us 0.83% 2.456us 2.456us 800 b 800 b 0 b 0 b 1 [] uniform_ 6.44% 19.044us 7.05% 20.832us 20.832us 0 b 0 b 0 b 0 b 1 [[10, 10]] is_complex 0.60% 1.788us 0.60% 1.788us 1.788us 0 b 0 b 0 b 0 b 1 [[10, 10]] copy_ 37.70% 111.469us 37.70% 111.469us 111.469us 0 b 0 b 0 b 0 b 1 [[10, 10], [10, 10]] test_user_scope_dealloc 6.95% 20.544us 6.95% 20.544us 20.544us 0 b 0 b -1.00 Kb -1.00 Kb 1 [] --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Self CPU time total: 295.687us Running MKLDNN test --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem Number of Calls Input Shapes --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- test_user_scope_alloc 34.23% 43.503us 88.57% 112.550us 112.550us 400 b -400 b 0 b 0 b 1 [] rand 8.00% 10.167us 18.34% 23.302us 23.302us 400 b 0 b 0 b 0 b 1 [] empty 2.22% 2.815us 2.22% 2.815us 2.815us 400 b 400 b 0 b 0 b 1 [] to_mkldnn 35.16% 44.675us 36.00% 45.745us 45.745us 400 b 400 b 0 b 0 b 1 [[10, 10]] uniform_ 7.24% 9.198us 8.12% 10.320us 10.320us 0 b 0 b 0 b 0 b 1 [[10, 10]] is_complex 0.88% 1.122us 0.88% 1.122us 1.122us 0 b 0 b 0 b 0 b 1 [[10, 10]] contiguous 0.84% 1.070us 0.84% 1.070us 1.070us 0 b 0 b 0 b 0 b 1 [[10, 10]] test_user_scope_dealloc 11.43% 14.525us 11.43% 14.525us 14.525us -400 b -400 b 0 b 0 b 1 [] --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Self CPU time total: 127.075us . ---------------------------------------------------------------------- Ran 1 test in 1.571s OK ``` Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248) [ghstack-poisoned]

dzhulgakov

Great! I think it's good to go! (and nice catch on overlapping ranges)

There's also some allocator stuff left in TH, but I'm not sure what it is and I didn't trace where it gets called:
https://github.com/pytorch/pytorch/blob/40265e2d663cc0027cffa6e80ee1ec67d467ca00/aten/src/TH/THAllocator.cpp

dzhulgakov · 2020-05-15T04:42:43Z

+// An interface for reporting thread local memory usage
+// per device
+struct C10_API MemoryReportingInfoBase : public c10::DebugInfoBase {
+  MemoryReportingInfoBase() {}


as discussed - move it to .cpp file to avoid potentially duplicated symbols

ilia-cher · 2020-05-18T21:43:57Z

from what I got (thanks @ezyang) this is a special allocator used to allocate tensors in shared memory space used for inter-process communication; I guess we can add memory reporting to there too

Summary: Adding memory usage into profiler table output Test Plan: ``` BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=1 USE_CUDA=1 python setup.py develop install ``` ``` $ python benchmarks/profiler_benchmark/resnet_memory_profiler.py output: https://gist.github.com/ilia-cher/3f37d54c3b2afb24d6776858e6860f69 ``` ``` $ python test/test_autograd.py TestAutograd.test_memory_profiler Couldn't download test skip set, leaving all tests enabled... Running CPU test --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem Number of Calls Input Shapes --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- test_user_scope_alloc 60.58% 105.892us 93.42% 163.285us 163.285us 800 b 0 b 0 b 0 b 1 [] rand 10.53% 18.405us 32.83% 57.393us 57.393us 800 b 0 b 0 b 0 b 1 [] empty 1.77% 3.092us 1.77% 3.092us 3.092us 800 b 800 b 0 b 0 b 1 [] uniform_ 19.64% 34.325us 20.54% 35.896us 35.896us 0 b 0 b 0 b 0 b 1 [[10, 10]] is_complex 0.90% 1.571us 0.90% 1.571us 1.571us 0 b 0 b 0 b 0 b 1 [[10, 10]] test_user_scope_dealloc 6.58% 11.508us 6.58% 11.508us 11.508us -800 b -800 b 0 b 0 b 1 [] --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Self CPU time total: 174.793us Running CUDA test --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem Number of Calls Input Shapes --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- test_user_scope_alloc 29.37% 86.836us 93.05% 275.143us 275.143us 0 b -800 b 1.00 Kb 0 b 1 [] to 7.42% 21.939us 51.31% 151.703us 151.703us 0 b 0 b 1.00 Kb 0 b 1 [[10, 10]] empty_strided 6.19% 18.295us 6.19% 18.295us 18.295us 0 b 0 b 1.00 Kb 1.00 Kb 1 [] rand 4.50% 13.316us 12.38% 36.604us 36.604us 800 b 0 b 0 b 0 b 1 [] empty 0.83% 2.456us 0.83% 2.456us 2.456us 800 b 800 b 0 b 0 b 1 [] uniform_ 6.44% 19.044us 7.05% 20.832us 20.832us 0 b 0 b 0 b 0 b 1 [[10, 10]] is_complex 0.60% 1.788us 0.60% 1.788us 1.788us 0 b 0 b 0 b 0 b 1 [[10, 10]] copy_ 37.70% 111.469us 37.70% 111.469us 111.469us 0 b 0 b 0 b 0 b 1 [[10, 10], [10, 10]] test_user_scope_dealloc 6.95% 20.544us 6.95% 20.544us 20.544us 0 b 0 b -1.00 Kb -1.00 Kb 1 [] --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Self CPU time total: 295.687us Running MKLDNN test --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem Number of Calls Input Shapes --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- test_user_scope_alloc 34.23% 43.503us 88.57% 112.550us 112.550us 400 b -400 b 0 b 0 b 1 [] rand 8.00% 10.167us 18.34% 23.302us 23.302us 400 b 0 b 0 b 0 b 1 [] empty 2.22% 2.815us 2.22% 2.815us 2.815us 400 b 400 b 0 b 0 b 1 [] to_mkldnn 35.16% 44.675us 36.00% 45.745us 45.745us 400 b 400 b 0 b 0 b 1 [[10, 10]] uniform_ 7.24% 9.198us 8.12% 10.320us 10.320us 0 b 0 b 0 b 0 b 1 [[10, 10]] is_complex 0.88% 1.122us 0.88% 1.122us 1.122us 0 b 0 b 0 b 0 b 1 [[10, 10]] contiguous 0.84% 1.070us 0.84% 1.070us 1.070us 0 b 0 b 0 b 0 b 1 [[10, 10]] test_user_scope_dealloc 11.43% 14.525us 11.43% 14.525us 14.525us -400 b -400 b 0 b 0 b 1 [] --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Self CPU time total: 127.075us . ---------------------------------------------------------------------- Ran 1 test in 1.571s OK ``` Differential Revision: [D21384248](https://our.internmc.facebook.com/intern/diff/D21384248) [ghstack-poisoned]

facebook-github-bot · 2020-05-20T00:13:49Z

@ilia-cher merged this pull request in a94fb71.

ezyang · 2020-05-20T14:50:13Z

This broke ROCm tests:


01:17:35 ======================================================================
01:17:35 FAIL: test_memory_profiler (__main__.TestAutograd)
01:17:35 ----------------------------------------------------------------------
01:17:35 Traceback (most recent call last):
01:17:35   File "test_autograd.py", line 2940, in test_memory_profiler
01:17:35     "test_user_scope_dealloc",
01:17:35   File "test_autograd.py", line 2897, in check_metrics
01:17:35     self.assertTrue(stat_metrics[alloc_fn] > 0)
01:17:35 AssertionError: False is not true
01:17:35 
01:17:35 ----------------------------------------------------------------------

ezyang · 2020-05-20T14:50:39Z

Was failing on https://ci.pytorch.org/jenkins/job/pytorch-builds/job/py3.6-clang7-rocmdeb-ubuntu16.04-test2/28393/console

cc @jeffdaily

jeffdaily · 2020-05-20T16:10:21Z

Thank you for the notification. In the future, if a new test for a new feature is breaking ROCm CI, can we have the developer(s) add the skipIfRocm decorator and tag me to look into it? This would help with our CI stability. Ideally, it would be fixed prior to merging the PR, but I understand at present it's a tough request.

jeffdaily · 2020-05-20T16:38:17Z

cc @ezyang #38790

ilia-cher · 2020-05-20T20:31:23Z

I'm landing the fix #38795

ilia-cher · 2020-05-20T20:34:39Z

have you noticed though that py3.6-clang7-rocmdeb-ubuntu16.04-test2 was broken on trunk at least since May 16
https://ezyang.github.io/pytorch-ci-hud/build/pytorch-master ?

albanD · 2020-05-20T20:37:59Z

I don't think it was continuously broken, it was fixed on the 19th, then broken again 3 commits later :/

Summary: Pull Request resolved: pytorch#37775 Adding memory usage into profiler table output Test Plan: BUILD_BINARY=1 USE_BLAS=MKL USE_MKLDNN=0 USE_CUDA=0 python setup.py develop install --cmake ``` import torch import torchvision.models as models model = models.resnet18() inp = torch.randn(5, 3, 224, 224) with torch.autograd.profiler.profile(profile_memory=True, record_shapes=True) as prof: model(inp) print(prof.key_averages(group_by_input_shape=True).table(sort_by="cpu_memory_usage", row_limit=15)) ``` ``` --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CPU Mem Total Number of Calls Input Shapes --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- resize_ 0.37% 577.936us 0.37% 577.936us 9.796us 339.03 Mb 59 [[0]] empty 0.69% 1.061ms 0.74% 1.139ms 5.556us 47.42 Mb 205 [] stride 0.00% 0.853us 0.00% 0.853us 0.853us 19.53 Kb 1 [[5, 1000]] empty_strided 0.01% 21.393us 0.02% 26.033us 5.207us 252 b 5 [] is_complex 0.02% 37.425us 0.02% 37.425us 1.291us 208 b 29 [[]] masked_select 0.04% 55.333us 0.06% 93.616us 46.808us 120 b 2 [[30], [30]] conv2d 0.01% 18.009us 9.62% 14.902ms 14.902ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ convolution 0.01% 12.436us 9.61% 14.884ms 14.884ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ _convolution 0.03% 52.381us 9.60% 14.871ms 14.871ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ size 0.00% 5.429us 0.00% 5.429us 0.339us 0 b 16 [[5, 3, 224, 224]] contiguous 0.00% 1.934us 0.00% 1.934us 0.967us 0 b 2 [[5, 3, 224, 224]] _convolution_nogroup 0.02% 27.505us 9.57% 14.814ms 14.814ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ _nnpack_available 0.02% 34.267us 0.02% 34.267us 1.713us 0 b 20 [] thnn_conv2d 0.01% 13.274us 9.54% 14.771ms 14.771ms 0 b 1 [[5, 3, 224, 224], [64, 3, 7, 7], [ thnn_conv2d_forward 5.98% 9.264ms 19.02% 29.446ms 14.723ms 0 b 2 [[5, 3, 224, 224], [64, 3, 7, 7], [ --------------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ----------------------------------- Self CPU time total: 154.855ms ``` Reviewed By: ngimel Differential Revision: D21384248 Pulled By: ilia-cher fbshipit-source-id: 31359cce2aa06f6255ed1ad8c60d03cb640bfec3

ilia-cher requested review from albanD and apaszke as code owners May 4, 2020 17:35

ilia-cher added 2 commits May 4, 2020 12:37

dzhulgakov requested changes May 5, 2020

View reviewed changes

ilia-cher added 16 commits May 4, 2020 23:31

ilia-cher mentioned this pull request May 14, 2020

Fixes for profiling JIT code #38453

Closed

ilia-cher added 4 commits May 13, 2020 19:17

ilia-cher requested a review from dzhulgakov May 14, 2020 03:40

ilia-cher added 3 commits May 14, 2020 02:49

dzhulgakov approved these changes May 15, 2020

View reviewed changes

ilia-cher added 2 commits May 18, 2020 21:04

facebook-github-bot closed this in a94fb71 May 19, 2020

facebook-github-bot added the merged label May 20, 2020

facebook-github-bot deleted the gh/ilia-cher/68/head branch May 23, 2020 14:16

seemethere added this to the 1.6.0 milestone Jun 22, 2020

mruberry added the Merged label Oct 28, 2020

quinor mentioned this pull request Dec 2, 2020

add autograd profiler Lightning-AI/pytorch-lightning#1693

Closed

5 tasks

Conversation

ilia-cher commented May 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci Bot commented May 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Extra GitHub checks: 1 failed

ci.pytorch.org: 1 failed

Uh oh!

dzhulgakov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dzhulgakov May 5, 2020

Choose a reason for hiding this comment

Uh oh!

dzhulgakov left a comment

Choose a reason for hiding this comment

Uh oh!

dzhulgakov May 15, 2020

Choose a reason for hiding this comment

Uh oh!

ilia-cher commented May 18, 2020

Uh oh!

facebook-github-bot commented May 20, 2020

Uh oh!

ezyang commented May 20, 2020

Uh oh!

ezyang commented May 20, 2020

Uh oh!

jeffdaily commented May 20, 2020

Uh oh!

jeffdaily commented May 20, 2020

Uh oh!

ilia-cher commented May 20, 2020

Uh oh!

ilia-cher commented May 20, 2020

Uh oh!

albanD commented May 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

ilia-cher commented May 4, 2020 •

edited

Loading

dr-ci Bot commented May 4, 2020 •

edited

Loading