compile time benchmarks for AOTDispatcher (inference/training/subclasses) by bdhirsh · Pull Request #136759 · pytorch/pytorch

bdhirsh · 2024-09-26T15:02:03Z

this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher:

(1) inference vs training code paths
(2) "subclasses" vs "no subclasses" codepaths

Also see #136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely)

I ran locally, and got these numbers on the 4 paths:

collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu
compile time instruction count for iteration 0 is 11692348671
compile time instruction count for iteration 1 is 3026287204
compile time instruction count for iteration 2 is 3011467318
compile time instruction count for iteration 3 is 3004485935
compile time instruction count for iteration 4 is 3003087410
collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu
compile time instruction count for iteration 0 is 6068003223
compile time instruction count for iteration 1 is 5585418102
compile time instruction count for iteration 2 is 5581856618
compile time instruction count for iteration 3 is 5581651794
compile time instruction count for iteration 4 is 5578742619
collecting compile time instruction count for aotdispatcher_inference_subclass_cpu
compile time instruction count for iteration 0 is 8634984264
compile time instruction count for iteration 1 is 8633467573
compile time instruction count for iteration 2 is 8632182092
compile time instruction count for iteration 3 is 8632056925
compile time instruction count for iteration 4 is 8632543871
collecting compile time instruction count for aotdispatcher_training_subclass_cpu
compile time instruction count for iteration 0 is 14737239311
compile time instruction count for iteration 1 is 14734346427
compile time instruction count for iteration 2 is 14736493730
compile time instruction count for iteration 3 is 14734121272
compile time instruction count for iteration 4 is 14733852882

Stack from ghstack (oldest at bottom):

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @rec

…ses) [ghstack-poisoned]

pytorch-bot · 2024-09-26T15:02:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136759

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 0148b0b with merge base e23e766 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

inductor-periodic / cuda12.1-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100) (gh) (similar failure)
moco

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-docs / build-docs-python-false (gh) (trunk failure)
/opt/conda/envs/py_3.9/lib/python3.9/site-packages/sphinx/util/inspect.py:376: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()

This comment was automatically generated by Dr. CI and updates every 15 minutes.

benchmarks/dynamo/pr_time_benchmarks/benchmarks/aotdispatcher.py

laithsakka · 2024-09-26T16:00:00Z

looks good for me, added some comments, will be good for someone else with more AOTDispatcher knowledge to take a look also.

…ing/subclasses)" this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher: (1) inference vs training code paths (2) "subclasses" vs "no subclasses" codepaths Also see #136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely) I ran locally, and got these numbers on the 4 paths: ``` collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu compile time instruction count for iteration 0 is 11692348671 compile time instruction count for iteration 1 is 3026287204 compile time instruction count for iteration 2 is 3011467318 compile time instruction count for iteration 3 is 3004485935 compile time instruction count for iteration 4 is 3003087410 collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu compile time instruction count for iteration 0 is 6068003223 compile time instruction count for iteration 1 is 5585418102 compile time instruction count for iteration 2 is 5581856618 compile time instruction count for iteration 3 is 5581651794 compile time instruction count for iteration 4 is 5578742619 collecting compile time instruction count for aotdispatcher_inference_subclass_cpu compile time instruction count for iteration 0 is 8634984264 compile time instruction count for iteration 1 is 8633467573 compile time instruction count for iteration 2 is 8632182092 compile time instruction count for iteration 3 is 8632056925 compile time instruction count for iteration 4 is 8632543871 collecting compile time instruction count for aotdispatcher_training_subclass_cpu compile time instruction count for iteration 0 is 14737239311 compile time instruction count for iteration 1 is 14734346427 compile time instruction count for iteration 2 is 14736493730 compile time instruction count for iteration 3 is 14734121272 compile time instruction count for iteration 4 is 14733852882 ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec [ghstack-poisoned]

compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because: (1) it consists of a single input + many weights that are used sequentially (2) contains a mix of recompute vs non-recomputed ops (matmul + sin) (3) it is relatively simple from running locally: ``` collecting compile time instruction count for aotdispatcher_partitioner_cpu compile time instruction count for iteration 0 is 21764219181 compile time instruction count for iteration 1 is 12475020009 compile time instruction count for iteration 2 is 12463710140 compile time instruction count for iteration 3 is 12455676489 compile time instruction count for iteration 4 is 12451344330 ``` Pull Request resolved: #136760 Approved by: https://github.com/ezyang ghstack dependencies: #136670, #136759

…/subclasses) (#136759)" This reverts commit b17cd26. Reverted #136759 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/c010c6099bf304bbb681af534b9f3996c33ce582) ([comment](#136670 (comment)))

pytorchmergebot · 2024-10-01T15:24:01Z

@bdhirsh your PR has been successfully reverted.

…ing/subclasses)" this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher: (1) inference vs training code paths (2) "subclasses" vs "no subclasses" codepaths Also see #136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely) I ran locally, and got these numbers on the 4 paths: ``` collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu compile time instruction count for iteration 0 is 11692348671 compile time instruction count for iteration 1 is 3026287204 compile time instruction count for iteration 2 is 3011467318 compile time instruction count for iteration 3 is 3004485935 compile time instruction count for iteration 4 is 3003087410 collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu compile time instruction count for iteration 0 is 6068003223 compile time instruction count for iteration 1 is 5585418102 compile time instruction count for iteration 2 is 5581856618 compile time instruction count for iteration 3 is 5581651794 compile time instruction count for iteration 4 is 5578742619 collecting compile time instruction count for aotdispatcher_inference_subclass_cpu compile time instruction count for iteration 0 is 8634984264 compile time instruction count for iteration 1 is 8633467573 compile time instruction count for iteration 2 is 8632182092 compile time instruction count for iteration 3 is 8632056925 compile time instruction count for iteration 4 is 8632543871 collecting compile time instruction count for aotdispatcher_training_subclass_cpu compile time instruction count for iteration 0 is 14737239311 compile time instruction count for iteration 1 is 14734346427 compile time instruction count for iteration 2 is 14736493730 compile time instruction count for iteration 3 is 14734121272 compile time instruction count for iteration 4 is 14733852882 ``` cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec [ghstack-poisoned]

…ses) (pytorch#136759) this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher: (1) inference vs training code paths (2) "subclasses" vs "no subclasses" codepaths Also see pytorch#136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely) I ran locally, and got these numbers on the 4 paths: ``` collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu compile time instruction count for iteration 0 is 11692348671 compile time instruction count for iteration 1 is 3026287204 compile time instruction count for iteration 2 is 3011467318 compile time instruction count for iteration 3 is 3004485935 compile time instruction count for iteration 4 is 3003087410 collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu compile time instruction count for iteration 0 is 6068003223 compile time instruction count for iteration 1 is 5585418102 compile time instruction count for iteration 2 is 5581856618 compile time instruction count for iteration 3 is 5581651794 compile time instruction count for iteration 4 is 5578742619 collecting compile time instruction count for aotdispatcher_inference_subclass_cpu compile time instruction count for iteration 0 is 8634984264 compile time instruction count for iteration 1 is 8633467573 compile time instruction count for iteration 2 is 8632182092 compile time instruction count for iteration 3 is 8632056925 compile time instruction count for iteration 4 is 8632543871 collecting compile time instruction count for aotdispatcher_training_subclass_cpu compile time instruction count for iteration 0 is 14737239311 compile time instruction count for iteration 1 is 14734346427 compile time instruction count for iteration 2 is 14736493730 compile time instruction count for iteration 3 is 14734121272 compile time instruction count for iteration 4 is 14733852882 ``` Pull Request resolved: pytorch#136759 Approved by: https://github.com/laithsakka ghstack dependencies: pytorch#136670

compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because: (1) it consists of a single input + many weights that are used sequentially (2) contains a mix of recompute vs non-recomputed ops (matmul + sin) (3) it is relatively simple from running locally: ``` collecting compile time instruction count for aotdispatcher_partitioner_cpu compile time instruction count for iteration 0 is 21764219181 compile time instruction count for iteration 1 is 12475020009 compile time instruction count for iteration 2 is 12463710140 compile time instruction count for iteration 3 is 12455676489 compile time instruction count for iteration 4 is 12451344330 ``` Pull Request resolved: pytorch#136760 Approved by: https://github.com/ezyang ghstack dependencies: pytorch#136670, pytorch#136759

…/subclasses) (pytorch#136759)" This reverts commit b17cd26. Reverted pytorch#136759 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/c010c6099bf304bbb681af534b9f3996c33ce582) ([comment](pytorch#136670 (comment)))

compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because: (1) it consists of a single input + many weights that are used sequentially (2) contains a mix of recompute vs non-recomputed ops (matmul + sin) (3) it is relatively simple from running locally: ``` collecting compile time instruction count for aotdispatcher_partitioner_cpu compile time instruction count for iteration 0 is 21764219181 compile time instruction count for iteration 1 is 12475020009 compile time instruction count for iteration 2 is 12463710140 compile time instruction count for iteration 3 is 12455676489 compile time instruction count for iteration 4 is 12451344330 ``` Pull Request resolved: #136760 Approved by: https://github.com/ezyang ghstack dependencies: #136759

compile time benchmarks for AOTDispatcher (inference/training/subclas…

90efc12

…ses) [ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: dynamo labels Sep 26, 2024

This was referenced Sep 26, 2024

inductor: use previous guards to know if a size is 1 for broadcasting #136670

Closed

compile time benchmarks for AOTDispatcher (partitioner) #136760

Closed

github-actions bot requested review from SherlockNoMad, albanD, antoniojkim, ezyang and miladm September 26, 2024 15:02

bdhirsh added the topic: not user facing topic category label Sep 26, 2024

laithsakka approved these changes Sep 26, 2024

View reviewed changes

benchmarks/dynamo/pr_time_benchmarks/benchmarks/aotdispatcher.py Show resolved Hide resolved

benchmarks/dynamo/pr_time_benchmarks/benchmarks/aotdispatcher.py Outdated Show resolved Hide resolved

benchmarks/dynamo/pr_time_benchmarks/benchmarks/aotdispatcher.py Show resolved Hide resolved

pytorchmergebot added the Merged label Sep 30, 2024

pytorchmergebot closed this in b17cd26 Sep 30, 2024

pytorchmergebot added the Reverted label Oct 1, 2024

pytorchmergebot reopened this Oct 1, 2024

albanD removed their request for review October 2, 2024 20:20

pytorchmergebot closed this in 48b8f81 Oct 8, 2024

github-actions bot deleted the gh/bdhirsh/615/head branch November 8, 2024 02:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compile time benchmarks for AOTDispatcher (inference/training/subclasses)#136759

compile time benchmarks for AOTDispatcher (inference/training/subclasses)#136759
bdhirsh wants to merge 3 commits intogh/bdhirsh/615/basefrom
gh/bdhirsh/615/head

bdhirsh commented Sep 26, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 26, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

laithsakka commented Sep 26, 2024

Uh oh!

pytorchmergebot commented Oct 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bdhirsh commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136759

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

laithsakka commented Sep 26, 2024

Uh oh!

pytorchmergebot commented Oct 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bdhirsh commented Sep 26, 2024 •

edited

Loading

pytorch-bot bot commented Sep 26, 2024 •

edited

Loading