FLOPS Roofline Analysis Feature for PyTorch Profiler.#46506
FLOPS Roofline Analysis Feature for PyTorch Profiler.#46506xuzhao9 wants to merge 1 commit intopytorch:masterfrom
Conversation
💊 CI failures summary and remediationsAs of commit 1e0fcba1a1 (more details on the Dr. CI page):
3 failures not recognized by patterns:
Extra GitHub checks: 2 failed
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 5 times. |
💊 CI failures summary and remediationsAs of commit 6fa25f5 (more details on the Dr. CI page):
🚧 1 fixed upstream failure:These were probably caused by upstream breakages that were already fixed.
Please rebase on the
|
6535ac7 to
9443759
Compare
ilia-cher
left a comment
There was a problem hiding this comment.
LG, a few comments inline and a high-level comment:
how about passing extra args back into python? then we don't need flops code in C++ and can just implement flops compute as a python module;
we could also find these args useful for other purposes
|
btw we could also expand the list of ops we set extra args for, e.g. some element-wise ops (add, mult) - not blocking but could be easy to include too |
|
discussed offline: we can export extra_args to python in the follow ups, and keep the logic in C++, let's move the flops logic and extracting extra args logic into a separate .h/.cpp |
|
FYI: we've implemented flop counting for a few key aten operations at https://github.com/facebookresearch/fvcore/blob/master/fvcore/nn/jit_handles.py One note: we deliberately decided to print warnings for every operator that's not counted, unless the operator is explicitly ignored - because silently ignored operations give users a silent false sense that the model has low flops. |
thanks! might be useful for us, atm we use formulas code in C++, but python is also an option. Also, having formulas in python, would make it easy to impl. them in C++ too. |
|
There is a few reasons why in fvcore we prefer to count flops in python:
|
9443759 to
074c0f2
Compare
5ee9a3a to
79412b2
Compare
|
also please update the test section with new test output |
|
we should check the type of the inputs to make sure that we compute actual FLOPs (probably just adding a check would be enough) |
8bdfdd4 to
d41f150
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
@ilia-cher has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Codecov Report
@@ Coverage Diff @@
## master #46506 +/- ##
==========================================
+ Coverage 80.60% 80.62% +0.01%
==========================================
Files 1879 1880 +1
Lines 202892 203412 +520
==========================================
+ Hits 163543 163993 +450
- Misses 39349 39419 +70 |
d41f150 to
e00871b
Compare
malfet
left a comment
There was a problem hiding this comment.
Looks good to me, although please consider getting rid of all string literals in the code, for example instead of typing "mat1_size", "mat2_size", define a `constexpr auto kMat1Size = "mat1_size" in profiler_utils.h and then reference this constant in the code.
Otherwise, you code is can be subject to typos between difference literals that would be detected only during the runtime.
e00871b to
9909319
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
@xuzhao9 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
9909319 to
bd40860
Compare
15b859d to
4f8ed9c
Compare
Summary:
Currently, PyTorch Profiler lacks the ability to measure the FLOPs of operators, such as mm and conv.
FLOPs are helpful to estimate the computation complexity of the operators.
For now, we use input shapes to estimate the number of floating pointer operations.
In the future, we may compute this information by tracking hardware counters.
Test Plan:
Run `python test/test_profiler_flops.py -k test_flops`. The test will print a profiler table with "FLOPS" column, like the following:
---------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------- ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls Input Shapes MFLOPS
---------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------- ------------
aten::matmul 0.06% 57.653us 82.97% 79.310ms 79.310ms 1 [[40, 33, 1, 243], [243, 243]] --
aten::mm 82.84% 79.186ms 82.86% 79.204ms 79.204ms 1 [[1320, 243], [243, 243]] 984.323
aten::conv2d 0.04% 36.345us 16.06% 15.347ms 15.347ms 1 [[40, 16, 18, 260], [33, 16, 18, 18], [33], [ 44065010.318
aten::convolution 0.02% 16.016us 16.02% 15.310ms 15.310ms 1 [[40, 16, 18, 260], [33, 16, 18, 18], [33], [ --
aten::_convolution 0.07% 63.855us 16.00% 15.294ms 15.294ms 1 [[40, 16, 18, 260], [33, 16, 18, 18], [33], [ --
aten::mkldnn_convolution 15.89% 15.188ms 15.93% 15.225ms 15.225ms 1 [[40, 16, 18, 260], [33, 16, 18, 18], [33], [ --
aten::relu 0.10% 98.223us 0.64% 612.157us 306.079us 2 [[40, 33, 1, 243]] --
aten::threshold 0.49% 465.416us 0.54% 513.934us 256.967us 2 [[40, 33, 1, 243], [], []] --
aten::add_ 0.29% 279.301us 0.29% 279.301us 279.301us 1 [[40, 33, 1, 243], [243], []] --
aten::empty 0.10% 99.113us 0.10% 99.113us 24.778us 4 [[], [], [], [], [], []] --
---------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------- ------------
Self CPU time total: 95.585ms
.
----------------------------------------------------------------------
Ran 1 test in 0.176s
For now, we only provide FLOPs calculation for aten::conv2d and aten::mm operators.
4f8ed9c to
6fa25f5
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
@xuzhao9 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
Profiler now warns when profiling with |
|
sending a fix #49896 |
| kernel_sizes[2], kernel_sizes[3]); | ||
|
|
||
| // grouping is NOT properly handled yet | ||
| return conv2d_multiply_factor * minibatch * input_h * input_w * kernel_h * kernel_w * in_channels * out_channels; |
There was a problem hiding this comment.
To properly handle stride, pad, distillation, and so on, we should use output_h * output_w instead of input_h * input_w .
| for(int64_t dim : mat2_size) { | ||
| flops *= dim; | ||
| } | ||
| return flops; |
There was a problem hiding this comment.
Shouldn't we have an extra factor of 2 to account for multiplication and addition?
| std::tie(out_channels, std::ignore, kernel_h, kernel_w) = std::make_tuple(kernel_sizes[0], kernel_sizes[1], | ||
| kernel_sizes[2], kernel_sizes[3]); | ||
|
|
||
| // grouping is NOT properly handled yet |
There was a problem hiding this comment.
Handling group is as easy as just dividing the flop by extra_args.at(kGroups) or is there any corner case I didn't think about?
Summary: FLOPs Roofline Analysis Feature for PyTorch Profiler. Currently, PyTorch Profiler lacks the ability to measure the FLOPs of operators, such as mm and conv. FLOPs are helpful to estimate the computation complexity of the operators. For now, we use input shapes to estimate the number of floating pointer operations. In the future, we may compute this information by tracking hardware counters. Pull Request resolved: pytorch#46506 Test Plan: Run `python test/test_profiler_flops.py -k test_flops`. The test will print a profiler table with "FLOPS" column, like the following: ---------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------- ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls Input Shapes MFLOPS ---------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------- ------------ aten::matmul 0.06% 57.653us 82.97% 79.310ms 79.310ms 1 [[40, 33, 1, 243], [243, 243]] -- aten::mm 82.84% 79.186ms 82.86% 79.204ms 79.204ms 1 [[1320, 243], [243, 243]] 984.323 aten::conv2d 0.04% 36.345us 16.06% 15.347ms 15.347ms 1 [[40, 16, 18, 260], [33, 16, 18, 18], [33], [ 44065010.318 aten::convolution 0.02% 16.016us 16.02% 15.310ms 15.310ms 1 [[40, 16, 18, 260], [33, 16, 18, 18], [33], [ -- aten::_convolution 0.07% 63.855us 16.00% 15.294ms 15.294ms 1 [[40, 16, 18, 260], [33, 16, 18, 18], [33], [ -- aten::mkldnn_convolution 15.89% 15.188ms 15.93% 15.225ms 15.225ms 1 [[40, 16, 18, 260], [33, 16, 18, 18], [33], [ -- aten::relu 0.10% 98.223us 0.64% 612.157us 306.079us 2 [[40, 33, 1, 243]] -- aten::threshold 0.49% 465.416us 0.54% 513.934us 256.967us 2 [[40, 33, 1, 243], [], []] -- aten::add_ 0.29% 279.301us 0.29% 279.301us 279.301us 1 [[40, 33, 1, 243], [243], []] -- aten::empty 0.10% 99.113us 0.10% 99.113us 24.778us 4 [[], [], [], [], [], []] -- ---------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------- ------------ Self CPU time total: 95.584ms . ---------------------------------------------------------------------- Ran 1 test in 0.176s For now, we only provide FLOPs calculation for aten::conv2d and aten::mm operators. Reviewed By: ezyang Differential Revision: D25214452 Pulled By: xuzhao9 fbshipit-source-id: 0ae841bd8dbdeb032346dc3d9d38e19875aa1da3
|
@jspark1105 thanks for the feedback! we'll send an update to the formulas, cc. @xuzhao9 |
|
Thanks for the comments, @jspark1105 ! I have created #51377 to address your comments. |
Summary: FLOPs Roofline Analysis Feature for PyTorch Profiler. Currently, PyTorch Profiler lacks the ability to measure the FLOPs of operators, such as mm and conv. FLOPs are helpful to estimate the computation complexity of the operators. For now, we use input shapes to estimate the number of floating pointer operations. In the future, we may compute this information by tracking hardware counters. Pull Request resolved: pytorch#46506 Test Plan: Run `python test/test_profiler_flops.py -k test_flops`. The test will print a profiler table with "FLOPS" column, like the following: ---------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------- ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls Input Shapes MFLOPS ---------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------- ------------ aten::matmul 0.06% 57.653us 82.97% 79.310ms 79.310ms 1 [[40, 33, 1, 243], [243, 243]] -- aten::mm 82.84% 79.186ms 82.86% 79.204ms 79.204ms 1 [[1320, 243], [243, 243]] 984.323 aten::conv2d 0.04% 36.345us 16.06% 15.347ms 15.347ms 1 [[40, 16, 18, 260], [33, 16, 18, 18], [33], [ 44065010.318 aten::convolution 0.02% 16.016us 16.02% 15.310ms 15.310ms 1 [[40, 16, 18, 260], [33, 16, 18, 18], [33], [ -- aten::_convolution 0.07% 63.855us 16.00% 15.294ms 15.294ms 1 [[40, 16, 18, 260], [33, 16, 18, 18], [33], [ -- aten::mkldnn_convolution 15.89% 15.188ms 15.93% 15.225ms 15.225ms 1 [[40, 16, 18, 260], [33, 16, 18, 18], [33], [ -- aten::relu 0.10% 98.223us 0.64% 612.157us 306.079us 2 [[40, 33, 1, 243]] -- aten::threshold 0.49% 465.416us 0.54% 513.934us 256.967us 2 [[40, 33, 1, 243], [], []] -- aten::add_ 0.29% 279.301us 0.29% 279.301us 279.301us 1 [[40, 33, 1, 243], [243], []] -- aten::empty 0.10% 99.113us 0.10% 99.113us 24.778us 4 [[], [], [], [], [], []] -- ---------------------------- ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------- ------------ Self CPU time total: 95.584ms . ---------------------------------------------------------------------- Ran 1 test in 0.176s For now, we only provide FLOPs calculation for aten::conv2d and aten::mm operators. Reviewed By: ezyang Differential Revision: D25214452 Pulled By: xuzhao9 fbshipit-source-id: 0ae841bd8dbdeb032346dc3d9d38e19875aa1da3
FLOPs Roofline Analysis Feature for PyTorch Profiler.
Summary:
Currently, PyTorch Profiler lacks the ability to measure the FLOPs of operators, such as mm and conv.
FLOPs are helpful to estimate the computation complexity of the operators.
For now, we use input shapes to estimate the number of floating pointer operations.
In the future, we may compute this information by tracking hardware counters.
Test Plan:
Run
python test/test_profiler_flops.py -k test_flops. The test will print a profiler table with "FLOPS" column, like the following:Self CPU time total: 95.584ms
.
Ran 1 test in 0.176s
For now, we only provide FLOPs calculation for aten::conv2d and aten::mm operators.