Prototype benchmarking util by robieta · Pull Request #38338 · pytorch/pytorch

robieta · 2020-05-12T18:11:54Z

This is the prototype for the modular utils that we've been discussing. It is admittedly a large PR, but a good fraction of that is documentation and examples. I've trimmed a bit on the edges since we last discussed this design (for instance Timer is no longer Fuzzer aware), but it's mostly the same.

In addition to the library and hermetic examples, I've included examples.end_to_end which tests #38061 over a variety of shapes, dtypes, degrees of broadcasting, and layouts. (CC @crcrpar) I only did CPU as I'm not set up on a GPU machine yet. Results from my devserver

Key takeaways:

For contiguous Tensors, larger dtypes (fp32 and fp64) and lots of reuse of the mask due to broadcasting, improvements are significant. (Presumably due to better vectorization?)
There is an extra ~1.5 us overhead, which dominates small kernels.
Cases with lower write intensity (int8, lower mask fraction, etc) or non-contiguous seem to suffer.

Hopefully this demonstrates the proof-of-concept for how this tooling can be used to tune kernels and assess PRs. Looking forward to thoughts and feedback.

dr-ci · 2020-05-12T18:23:29Z

💊 CI failures summary and remediations

As of commit d068b7a (more details on the Dr. CI page):

✅ None of the CI failures appear to be your fault 💚

2/2 tentatively recognized as flaky ❄️
- Click here to rerun these jobs

❄️ 2 failures tentatively classified as flaky

but reruns have not yet been triggered to confirm:

caffe2_onnx_main_py3_6_clang7_ubuntu16_04_build (1/2)

Step: "Build" (full log | diagnosis details | 🔁 rerun) ❄️

Jun 29 23:01:44 Failed to recurse into submodule path 'third_party/ideep'

sys	0m0.060s 
Jun 29 23:01:19 ++ export BUILD_ENVIRONMENT=caffe2-onnx-main-py3.6-clang7-ubuntu16.04-build 
Jun 29 23:01:19 ++ BUILD_ENVIRONMENT=caffe2-onnx-main-py3.6-clang7-ubuntu16.04-build 
Jun 29 23:01:19 ++ git submodule sync 
Jun 29 23:01:19 ++ git submodule update -q --init --recursive 
Jun 29 23:01:44 error: RPC failed; curl 56 GnuTLS recv error (-54): Error in the pull function. 
Jun 29 23:01:44 fatal: The remote end hung up unexpectedly 
Jun 29 23:01:44 fatal: early EOF 
Jun 29 23:01:44 fatal: index-pack failed 
Jun 29 23:01:44 fatal: clone of 'https://github.com/intel/mkl-dnn.git' into submodule path 'mkl-dnn' failed 
Jun 29 23:01:44 Failed to recurse into submodule path 'third_party/ideep'

binary_windows_libtorch_3_7_cpu_debug_build (2/2)

Step: "Build" (full log | diagnosis details | 🔁 rerun) ❄️

CondaHTTPError: HTTP 000 CONNECTION FAILED for url

The system cannot find the file specified. 
Could Not Find C:\w\b\windows\miniconda.exe 
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current 
                                 Dload  Upload   Total   Spent    Left  Speed 
   0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0 100 54.6M  100 54.6M    0     0  54.6M      0  0:00:01 --:--:--  0:00:01  205M 
Collecting package metadata (current_repodata.json): ...working... done 
Solving environment: ...working... failed with repodata from current_repodata.json, will retry with next repodata source. 
Collecting package metadata (repodata.json): ...working... done 
Solving environment: ...working... done 
 
CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://repo.anaconda.com/pkgs/main/win-64/mkl-2020.1-216.conda> 
Elapsed: - 
 
An HTTP error occurred when trying to retrieve this URL. 
HTTP errors are often intermittent, and a simple retry will get you on your way. 
 
 
 
## Package Plan ## 
 
  environment location: C:\w\b\windows\conda\envs\py37

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 84 times.

robieta · 2020-05-12T18:43:28Z

CI is failing because of merge conflicts when it tries to fast forward the branch. Looking into it now.

…/timeit_benchmark

vadimkantorov · 2020-05-12T20:27:43Z

One thing to consider for CPU benchmarks (but may be hard to do properly): controlling for CPU throttling https://lemire.me/blog/2018/01/16/microbenchmarking-calls-for-idealized-conditions/ and maybe for CPU thread affinity

robieta · 2020-05-12T20:39:14Z

One thing to consider for CPU benchmarks (but may be hard to do properly): controlling for CPU throttling and maybe for CPU thread affinity

Yeah, it's a tough problem to be sure. I'm using the standard set of tricks to mitigate this:

Conduct a number of trials rather than one long run. (Generally tens to hundreds)
Use Median rather than mean for robustness to outliers / systematic throttling / etc.
Discard trials where too much variation is observed.
Trim visualization based on estimated significant figures.

I've found that measurements are fairly stable (at least on my machine), but I agree that it's definitely something to watch out for and this is not a panacea. The hope is that later versions will have proper runtime integration so we can get counts, allocations, etc.

facebook-github-bot

@robieta has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

dzhulgakov

This is pretty cool, I like nice APIs!

For CPU control it's hard to do so from within process. But we could provide a standard wrapper script that would turn off turbo (not sure whether there's a standard way to do it), pin to a single cpu and set thread scheduler priority to performance.

robieta · 2020-05-20T18:17:09Z

This is pretty cool, I like nice APIs!

For CPU control it's hard to do so from within process. But we could provide a standard wrapper script that would turn off turbo (not sure whether there's a standard way to do it), pin to a single cpu and set thread scheduler priority to performance.

Thanks!

I agree, there doesn't seem to be any way to control the environment without spawning a subprocess. One concern would be overhead; given that most measurements tend to be short (< 1s), I worry that the vast majority of time would be spent creating and destroying these controlled envs. One could imagine keeping a pool of subprocesses and having Timers "submit" work to it, but you start to get into non-trivial engineering complexity vs just spacing out replicates.

…/timeit_benchmark

facebook-github-bot

@robieta has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ngimel

Overall this looks great!

ngimel · 2020-05-26T03:19:10Z

+broadcastable to the shape of `x`:
+
+```
+fuzzer = Fuzzer(


As a follow-up, it probably makes sense to create fuzzer helpers that would fuzz tensors for common cases - e.g. UnaryOpFuzzer, BinaryOpFuzzer (with/without broadcast) with number of dimensions as an argument and sensible defaults.

I've added two which are pretty comprehensive for unary and binary ops, and I'll add more in the future. Let me know what you think.

ngimel · 2020-05-26T03:42:42Z

+        return output
+
+    @staticmethod
+    def color_segment(segment, value, group_values):


color-coding is very nice!

…/timeit_benchmark

facebook-github-bot

@robieta has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

robieta · 2020-06-24T20:39:18Z

I've updated the end-to-end example to test 3 PRs, as well as including a script to build the expected environments. The following were run on a 56 core, 8 P100 machine:

#39850 (cc @xwang233)

#39967 (cc @ShawnZhong)

#39744 (cc @nikitaved)

robieta · 2020-06-24T23:36:04Z

@nikitaved Thanks for pointing out that the GPU variation for your PR is quite high. (I can't seem to find the comment)
I re-ran it using the same environment for both before and after and saw similarly high levels of variation. After some experimentation, I found two things:

Running CPU and GPU at the same time significantly increases the variance (~ +/- 70% with vs. ~ +/- 30% without). I had limited the number of workers in the CPU pool, but htop indicates that I was still saturating the CPU at times so it was probably contending on the CPU portion. (e.g. stream management and op launch.)
Replicates are essential. I had this notion that GPU measurements were more stable than CPU, but empirically I had to go to 5 replicates to get down to < 5% variability. (The Timer does internal replicates, but those are clearly correlated either by device or temporally.)

Even then, you have to throw out almost 50% of the test cases because the variation is too high. These are basically all <75 us cases; there's just too much overhead jitter. Which is a shame. I think I'm going to experiment with having a worker do more measurements if it detects that a result is noisy to see if that lets us save some of those results.

…reduce noise

robieta · 2020-06-25T06:37:17Z

It looks like the underlying issue is that the GPUs are not identical, so the variance was due to master and branch being scheduled on different devices. Running them on the same GPU significantly reduces GPU run-to-run variation. And for CPU, 5 replicates is sufficient to clamp down on the variance. (Though a non-trivial number still get culled due to excessive variation.) In order to make this easier to validate, I added a --test_variance option that runs both the before and after on the same environment, so any difference is noise.

#39744 Variance test
#39744 summary (new methodology)

The variance test shows that we're now within ~5%, and rerunning the actual PR shows no meaningful change for GPU (Yay!) and CPU change which is beyond noise, but much more muted than before indicating that the +/- 50% swings were noise rather than real changes, and when we clamp down on that noise the real signal is ~5-10% relative difference.

I'll rerun all of the PR benchmarks tomorrow.

robieta · 2020-06-25T19:04:29Z

Updated runs: (This is on an 8xV100 machine since I lost the 8xP100 machine and assignment is random.)
#39850

#39967

#39744

nikitaved · 2020-06-25T19:13:46Z

@robieta , thanks for the update! When it comes to sorting, I observed that the TensorIterator might give a significant boost for large multidimensional contiguous tensors with the last dimensions sorted. Maybe it could be of any use as it might affect the performance of any dim-apply type of algorithm.

ngimel

This looks good and you should merge it. end2end example still contains things that are pr-specific and spread around, so maybe some restructuring will be required to make it easier to use for other prs, but that can be done later, the general utilities are in good shape and it definitely makes sense to merge them

ngimel · 2020-06-29T05:07:53Z

+def parse_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--pr", type=str, default=_PR_LIST[0], choices=_PR_LIST)
+    parser.add_argument("--num_gpus", type=int, default=8)


default probably should be "use all" instead of hardcoded 8 - number of available gpus can be queried later

ngimel · 2020-06-29T05:35:40Z

+
+        layout:
+            Indicates that `x` is not contiguous due to permutation. Invoking
+            `x.permute(steps)` (e.g. x.permute((2, 0, 1)) if steps = [2, 0, 1])


did you mean steps here or order?

ngimel · 2020-06-29T05:37:17Z

+        layout:
+            Indicates that `x` is not contiguous due to permutation. Invoking
+            `x.permute(steps)` (e.g. x.permute((2, 0, 1)) if steps = [2, 0, 1])
+            would produce a Tensor whose shape matches memory order. (Though still


a Tensor with physical memory layout matching logical memory layout? shape matches memory order is not very clear

facebook-github-bot

@robieta has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-06-30T20:15:07Z

@robieta merged this pull request in f394979.

Summary: This is a benchmarking tooling to work with sparse tensors. To implement this, we extended PR `benchmarking util` [https://github.com/pytorch/pytorch/issues/38338](https://github.com/pytorch/pytorch/pull/38338) for sparse tensors. In order to extend the proposed utility library the **FuzzedTensor** class was extended by creating the new **FuzzedSparseTensor** class. In addition two new operator classes were added, the `UnaryOpSparseFuzzer` and `BinaryOpSparseFuzzer`. The class `FuzzedSparseTensor` adds new input parameters to the constructor: 1. `sparse_dim`: The number of sparse dimensions in a sparse tensor. 2. `nnz`: Number of non-zero elements in the sparse tensor. 3. `density`: The density of the sparse tensor. 4. `coalesced`: As we know the sparse tensor format permits coalesced/uncoalesced sparse tensors. and removes `probability_contiguous`, `max_allocation_bytes`, `roll_parameter`, `tensor_constructor` as they are dense-tensors related parameters. In addition, I've extended the `torch.utils.benchmark.examples` to work with the new classes `FuzzedSparseTensor`, `UnaryOpSparseFuzzer` and `BinaryOpSparseFuzzer`. Hopefully, this tooling and these examples will help to make other benchmarks in other PRs. Looking forward to your thoughts and feedback. cc robieta, mruberry, ngimel Pull Request resolved: #48397 Reviewed By: ejguan Differential Revision: D26008137 Pulled By: mruberry fbshipit-source-id: 2f37811c7c3eaa3494a0f2500e519267f2186dfb

Summary: This is the prototype for the modular utils that we've been discussing. It is admittedly a large PR, but a good fraction of that is documentation and examples. I've trimmed a bit on the edges since we last discussed this design (for instance Timer is no longer Fuzzer aware), but it's mostly the same. In addition to the library and hermetic examples, I've included `examples.end_to_end` which tests pytorch#38061 over a variety of shapes, dtypes, degrees of broadcasting, and layouts. (CC crcrpar) I only did CPU as I'm not set up on a GPU machine yet. [Results from my devserver](https://gist.github.com/robieta/d1a8e1980556dc3f4f021c9f7c3738e2) Key takeaways: 1) For contiguous Tensors, larger dtypes (fp32 and fp64) and lots of reuse of the mask due to broadcasting, improvements are significant. (Presumably due to better vectorization?) 2) There is an extra ~1.5 us overhead, which dominates small kernels. 3) Cases with lower write intensity (int8, lower mask fraction, etc) or non-contiguous seem to suffer. Hopefully this demonstrates the proof-of-concept for how this tooling can be used to tune kernels and assess PRs. Looking forward to thoughts and feedback. Pull Request resolved: pytorch#38338 Differential Revision: D21551048 Pulled By: robieta fbshipit-source-id: 6c50e5439a04eac98b8a2355ef731852ba0500db

Taylor Robie added 2 commits May 12, 2020 10:39

Modular benchmarking utils rough draft

c361c76

normalize based on master

ebc6079

robieta requested review from VitalyFedyunin, dzhulgakov, ilia-cher and ngimel May 12, 2020 18:11

Taylor Robie added 3 commits May 12, 2020 11:47

Merge remote-tracking branch 'origin/fbcode/warm' into gh/taylorrobie…

246907e

…/timeit_benchmark

lint

71bd767

let me use lambdaaaasssssss

23e635c

facebook-github-bot reviewed May 13, 2020

View reviewed changes

fix int overflow issue from using np.prod

aa03b86

dzhulgakov reviewed May 20, 2020

View reviewed changes

Comment thread benchmarks/experimental_components/examples/end_to_end.py Outdated

Comment thread benchmarks/experimental_components/utils/timer.py Outdated

Taylor Robie added 2 commits May 20, 2020 11:25

Merge remote-tracking branch 'origin/fbcode/warm' into gh/taylorrobie…

9c3271b

…/timeit_benchmark

bug fixes from testing

cdec2e3

facebook-github-bot reviewed May 20, 2020

View reviewed changes

ngimel reviewed May 26, 2020

View reviewed changes

Merge remote-tracking branch 'origin/fbcode/warm' into gh/taylorrobie…

2d0be03

…/timeit_benchmark

facebook-github-bot reviewed May 27, 2020

View reviewed changes

Taylor Robie added 2 commits June 4, 2020 09:31

address PR comments

615a36b

remove comment

5da1ab6

robieta force-pushed the gh/taylorrobie/timeit_benchmark branch from 615a36b to a3db551 Compare June 4, 2020 16:40

robieta requested review from albanD, apaszke, ebetica and ezyang as code owners June 4, 2020 16:40

robieta mentioned this pull request Jun 23, 2020

Migrate var & std to ATen #39967

Closed

Taylor Robie added 6 commits June 24, 2020 10:43

address PR comments, refactor end-to-end example, and fix several bugs

1364b72

tweak e2e

fcf33de

Merge branch 'fbcode/warm' into gh/taylorrobie/timeit_benchmark

fccf5eb

delint and add description for steps and layout

154a85d

expand prod docstring

5106e40

more tweaks

a052306

tweak column layout

c614f81

add cpu replicates, and run both before and after on the same GPU to …

777d599

…reduce noise

robieta mentioned this pull request Jun 27, 2020

Disable the mkldnn for conv2d in some special cases #40610

Closed

ngimel approved these changes Jun 29, 2020

View reviewed changes

Taylor Robie added 2 commits June 29, 2020 15:56

final tweaks to examples

73a4f2a

Merge branch 'fbcode/warm' into gh/taylorrobie/timeit_benchmark

d068b7a

facebook-github-bot reviewed Jun 29, 2020

View reviewed changes

facebook-github-bot closed this in f394979 Jun 30, 2020

facebook-github-bot added the merged label Jun 30, 2020

fmassa deleted the gh/taylorrobie/timeit_benchmark branch July 6, 2020 18:05

ngimel mentioned this pull request Jul 9, 2020

Migrate addmm, addbmm and THBlas_gemm to ATen #40927

Closed

nikitaved mentioned this pull request Sep 18, 2020

Introducing (Const)StridedRandomAccessor + CompositeRandomAccessor + migrate sort to ATen (CPU) #39744

Closed

mruberry added the Merged label Oct 28, 2020

aocsa mentioned this pull request Dec 10, 2020

Sparse benchmarking utils #48397

Closed

Conversation

robieta commented May 12, 2020

Uh oh!

dr-ci Bot commented May 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

❄️ 2 failures tentatively classified as flaky

caffe2_onnx_main_py3_6_clang7_ubuntu16_04_build (1/2)

binary_windows_libtorch_3_7_cpu_debug_build (2/2)

Uh oh!

robieta commented May 12, 2020

Uh oh!

vadimkantorov commented May 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robieta commented May 12, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

dzhulgakov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

robieta commented May 20, 2020

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

ngimel May 26, 2020

Choose a reason for hiding this comment

Uh oh!

robieta Jun 4, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ngimel May 26, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

robieta commented Jun 24, 2020

Uh oh!

robieta commented Jun 24, 2020

Uh oh!

robieta commented Jun 25, 2020

Uh oh!

robieta commented Jun 25, 2020

Uh oh!

nikitaved commented Jun 25, 2020

Uh oh!

ngimel left a comment

Choose a reason for hiding this comment

Uh oh!

ngimel Jun 29, 2020

Choose a reason for hiding this comment

Uh oh!

ngimel Jun 29, 2020

Choose a reason for hiding this comment

Uh oh!

ngimel Jun 29, 2020

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jun 30, 2020

Uh oh!

Reviewers

Assignees

dr-ci Bot commented May 12, 2020 •

edited

Loading

vadimkantorov commented May 12, 2020 •

edited

Loading