[inductor][cpp] complete vectorization for int32/int64 by jgong5 · Pull Request #122961 · pytorch/pytorch

jgong5 · 2024-03-29T14:31:28Z

Stack from ghstack (oldest at bottom):

Summary
Implement the complete vectorization of index_expr functionally. We also add heuristic from performance perspective to resolve the regressions posted below: #122961 (comment) by disabling vectorization of specific (Fused) scheduler Node:

Heuristic 1: when the num of non-contiguous index_expr/load/store exceeds the threshold, we disable the vectorization.
Heuristic 2: when the total number of elements along the vec dim is less than tiling_factor/2, we disable the vectorization.

cc @voznesenskym @penguinwu @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-03-29T14:31:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122961

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 29501d9 with merge base bce0cab ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

slow / linux-focal-cuda12.1-py3-gcc9-slow-gradcheck / test (default, 5, 6, amz2023.linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
'Test'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: dab9448 Pull Request resolved: #122961

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 349ee59 Pull Request resolved: #122961

jgong5 · 2024-04-07T06:33:40Z

According to the fp32 inference benchmarking on the three suites, It improved perf on some of the models but caused regression on others. Need to figure out why.

Workload	single-threaded	multi-threaded
Background_Matting	1.19x	1.26x
hf_BigBird	0.83x	0.9x
hf_Longformer	0.84x	0.84x
hf_T5	1.28x	1.19x
hf_T5_base	1.46x	1.33x
hf_T5_large	1.24x	1.16x
lennard_jones	1.06x	0.99x
pyhpc_isoneutral_mixing	1.0x	0.76x
AllenaiLongformerBase	0.83x	0.87x
T5ForConditionalGeneration	1.38x	1.25x
T5Small	1.38x	1.23x

github-actions · 2024-06-06T06:35:34Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

**Summary** Enable the vectorization of `load_seed` and `randn`. For now, `randn` is using the reference implementation. **Test Plan** ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_randn ``` Pull Request resolved: #130317 Approved by: https://github.com/jgong5 ghstack dependencies: #122961

atalman · 2024-08-19T15:27:35Z

@pytorchmergebot revert -c nosignal -m "Breaks slow jobs: inductor/test_cpu_repro.py::CPUReproTests::test__adaptive_avg_pool2d GH job link HUD commit link"

pytorchmergebot · 2024-08-19T15:29:05Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2024-08-19T15:29:18Z

@jgong5 your PR has been successfully reverted.

…)" This reverts commit 99b3b58. Reverted #122961 on behalf of https://github.com/atalman due to Breaks slow jobs: inductor/test_cpu_repro.py::CPUReproTests::test__adaptive_avg_pool2d [GH job link](https://github.com/pytorch/pytorch/actions/runs/10432403692/job/28893704833) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/a0ef8888e60d934ae7e4ddaec1c1274b12d0d39d) ([comment](#122961 (comment)))

leslie-fang-intel · 2024-08-19T23:56:41Z

Adding ciflow: slow to reproduce the failure

**Summary** Implement the complete vectorization of `index_expr` functionally. We also add heuristic from performance perspective to resolve the regressions posted below: #122961 (comment) by disabling vectorization of specific (Fused) scheduler Node: - Heuristic 1: when the num of non-contiguous `index_expr/load/store` exceeds the threshold, we disable the vectorization. - Heuristic 2: when the total number of elements along the vec dim is less than `tiling_factor/2`, we disable the vectorization. cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

… int32/int64" **Summary** Implement the complete vectorization of `index_expr` functionally. We also add heuristic from performance perspective to resolve the regressions posted below: #122961 (comment) by disabling vectorization of specific (Fused) scheduler Node: - Heuristic 1: when the num of non-contiguous `index_expr/load/store` exceeds the threshold, we disable the vectorization. - Heuristic 2: when the total number of elements along the vec dim is less than `tiling_factor/2`, we disable the vectorization. cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

leslie-fang-intel · 2024-08-20T00:32:01Z

Hi @atalman, thanks for the informing. I think I have fixed the failure of inductor/test_cpu_repro.py::CPUReproTests::test__adaptive_avg_pool2d and adding label slow for all the slow jobs of this PR. Let's see what's ci think about now. Thanks!

**Summary** Implement the complete vectorization of `index_expr` functionally. We also add heuristic from performance perspective to resolve the regressions posted below: #122961 (comment) by disabling vectorization of specific (Fused) scheduler Node: - Heuristic 1: when the num of non-contiguous `index_expr/load/store` exceeds the threshold, we disable the vectorization. - Heuristic 2: when the total number of elements along the vec dim is less than `tiling_factor/2`, we disable the vectorization. cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

… int32/int64" **Summary** Implement the complete vectorization of `index_expr` functionally. We also add heuristic from performance perspective to resolve the regressions posted below: #122961 (comment) by disabling vectorization of specific (Fused) scheduler Node: - Heuristic 1: when the num of non-contiguous `index_expr/load/store` exceeds the threshold, we disable the vectorization. - Heuristic 2: when the total number of elements along the vec dim is less than `tiling_factor/2`, we disable the vectorization. cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

leslie-fang-intel · 2024-08-21T13:03:28Z

@pytorchmergebot revert -c nosignal -m "Breaks slow jobs: inductor/test_cpu_repro.py::CPUReproTests::test__adaptive_avg_pool2d GH job link HUD commit link"

Hi @atalman, the failure posted previously passes now and the only failure marked as unrelated in Hud. I am going to re-land these 2 PRs. Thanks.

leslie-fang-intel · 2024-08-21T13:05:14Z

@pytorchbot merge

pytorchmergebot · 2024-08-21T13:07:06Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

**Summary** Enable the vectorization of `load_seed` and `randn`. For now, `randn` is using the reference implementation. **Test Plan** ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_randn ``` Pull Request resolved: #130317 Approved by: https://github.com/jgong5 ghstack dependencies: #122961

masnesral · 2024-08-28T01:51:02Z

@jgong5 @leslie-fang-intel I've got an internal report of a failure that I've been able to repro. The backtrace looks like:

  Traceback (most recent call last):
    File "#link-tree/torch/_inductor/compile_fx.py", line 570, in codegen_and_compile
      compiled_graph = fx_codegen_and_compile(gm, example_inputs, **fx_kwargs)
    File "#link-tree/torch/_inductor/compile_fx.py", line 878, in fx_codegen_and_compile
      compiled_fn = graph.compile_to_fn()
    File "#link-tree/torch/_inductor/graph.py", line 1883, in compile_to_fn
      return self.compile_to_module().call
    File "#link-tree/torch/_inductor/graph.py", line 1809, in compile_to_module
      return self._compile_to_module()
    File "#link-tree/torch/_inductor/graph.py", line 1815, in _compile_to_module
      self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
    File "#link-tree/torch/_inductor/graph.py", line 1754, in codegen
      self.scheduler.codegen()
    File "#link-tree/torch/_inductor/scheduler.py", line 3149, in codegen
      return self._codegen()
    File "#link-tree/torch/_inductor/scheduler.py", line 3207, in _codegen
      self.get_backend(device).codegen_node(node)
    File "#link-tree/torch/_inductor/codegen/cpp.py", line 4389, in codegen_node
      cpp_kernel_proxy.codegen_nodes(nodes)
    File "#link-tree/torch/_inductor/codegen/cpp.py", line 3880, in codegen_nodes
      self.codegen_functions(fn_list, var_sizes_list)
    File "#link-tree/torch/_inductor/codegen/cpp.py", line 3720, in codegen_functions
      tiling_factors, tiling_indices = tiling_select.select_tiling(
    File "#link-tree/torch/_inductor/codegen/cpp.py", line 3358, in select_tiling
      ].args[0]
  AttributeError: 'str' object has no attribute 'args'

Is this stack trace enough to allow you to debug without a full repro (the internal test is rather large and complicated). More importantly, do you have any objection to reverting while we investigate? I can help vet any fixes on the internal workload.

leslie-fang-intel · 2024-08-28T01:57:13Z

Hi @masnesral, could you share the commit ID you used? There are several PR landed in cpp.py recently. So, I am not sure cpp.py:3358 exactly points to which part of code.

leslie-fang-intel · 2024-08-28T01:58:33Z

More importantly, do you have any objection to reverting while we investigate? I can help vet any fixes on the internal workload.

This PR is important to us for PyTorch 2.5, may we try to open a hotfix with new PR at first?

masnesral · 2024-08-28T02:03:04Z

+                                )[
+                                    _node.args[
+                                        1 if _node.target == "index_expr" else 2
+                                    ].args[0]


@leslie-fang-intel this is the line that raises on the internal workload

Thanks, will try to submit a hotfix soon. Still not sure yet, how could the _node.args[1 if _node.target == "index_expr" else 2] be a str, but I guess we can bypass this special case.

I'll keep debugging on my side; unfortunately the failing test takes a while to run :/

Thanks, here we just want to get the name of index expression. So for all of the cases we see in the 3 test suits (TorchBench, Timms, Huggingface) we expect, take index_expr as example:

_node is a FX Node with target index_expr

_node.args[1 if _node.target == "index_expr" else 2] is another FX node with target get_index

_node.args[1 if _node.target == "index_expr" else 2].args[0] is a str for the name of this index expression

For your case, it seems _node.args[1 if _node.target == "index_expr" else 2] is a str instead of the FX node.

@leslie-fang-intel Sorry for the delay. I was trying to debug last night and I'm seeing some extremely strange behavior I can't explain. I thought I might be hallucinating and just needed sleep. When I poke around in the debugger, the element in question is not a string. Amazingly, when I change the code to the following, there is no error and the test runs successfully:

iii = 1 if _node.target == "index_expr" else 2 ... _node.args[ iii ].args[0]

What?!

masnesral · 2024-08-28T02:04:16Z

Hi @masnesral, could you share the commit ID you used? There are several PR landed in cpp.py recently. So, I am not sure cpp.py:3358 exactly points to which part of code.

Yeah, it's a little tricky because we mirror OSS commits to an internal repo asynchronously. But I was able to validate that backing out this PR removes the problem. I'll comment on the line that's raising the exception.

masnesral · 2024-08-28T02:06:49Z

This PR is important to us for PyTorch 2.5, may we try to open a hotfix with new PR at first?

I honestly don't know how difficult that is or how close we are to the deadline. Maybe @atalman can help out here (who's helped me recently with some other, unrelated hotfixes)

[inductor][cpp] complete vectorization for int32/int64

501ebf7

[ghstack-poisoned]

jgong5 mentioned this pull request Mar 29, 2024

[inductor][cpp] unified the vectorized conversion with at::vec::convert for all data types #119979

Closed

pytorch-bot Bot added ciflow/inductor module: inductor labels Mar 29, 2024

This was referenced Mar 29, 2024

[vec] test at::vec::convert in vec_test_all_types #122869

Closed

[vec] test VecMask in vec_test_all_types #122878

Closed

jgong5 marked this pull request as draft March 29, 2024 14:36

pytorchbot added the open source label Mar 29, 2024

jgong5 pushed a commit that referenced this pull request Mar 29, 2024

[inductor][cpp] complete vectorization for int32/int64

ef09a06

ghstack-source-id: dab9448 Pull Request resolved: #122961

jgong5 pushed a commit that referenced this pull request Mar 30, 2024

[inductor][cpp] complete vectorization for int32/int64

ec58b99

ghstack-source-id: 349ee59 Pull Request resolved: #122961

Valentine233 mentioned this pull request Apr 10, 2024

[cpu] Modify inductor opt flag --- ftree-loop-vectorize #121782

Closed

github-actions Bot added the Stale label Jun 6, 2024

jgong5 removed the Stale label Jun 7, 2024

This was referenced Jul 10, 2024

[Inductor][CPP] Support vectorization of remainder #129849

Closed

[Inductor][CPP] Support vectorization of load_seed and randn #130317

Closed

leslie-fang-intel added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category labels Jul 10, 2024

leslie-fang-intel added 3 commits July 10, 2024 03:38

leslie-fang-intel marked this pull request as ready for review July 12, 2024 02:47

pytorchmergebot added the Reverted label Aug 19, 2024

pytorchmergebot reopened this Aug 19, 2024

leslie-fang-intel added the ciflow/slow label Aug 19, 2024

pytorchmergebot added the merging label Aug 21, 2024

pytorchmergebot closed this in 313bc11 Aug 21, 2024

pytorchmergebot removed the merging label Aug 21, 2024

masnesral reviewed Aug 28, 2024

View reviewed changes

leslie-fang-intel mentioned this pull request Aug 28, 2024

[Inductor][CPP] Fix Index name error #134645

Closed

Conversation

jgong5 commented Mar 29, 2024 • edited by leslie-fang-intel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Mar 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122961

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

jgong5 commented Apr 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 6, 2024

Uh oh!

atalman commented Aug 19, 2024

Uh oh!

pytorchmergebot commented Aug 19, 2024

Uh oh!

pytorchmergebot commented Aug 19, 2024

Uh oh!

leslie-fang-intel commented Aug 19, 2024

Uh oh!

leslie-fang-intel commented Aug 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leslie-fang-intel commented Aug 21, 2024

Uh oh!

leslie-fang-intel commented Aug 21, 2024

Uh oh!

pytorchmergebot commented Aug 21, 2024

Merge started

Uh oh!

masnesral commented Aug 28, 2024

Uh oh!

leslie-fang-intel commented Aug 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leslie-fang-intel commented Aug 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

masnesral Aug 28, 2024

Choose a reason for hiding this comment

Uh oh!

leslie-fang-intel Aug 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

masnesral Aug 28, 2024

Choose a reason for hiding this comment

Uh oh!

leslie-fang-intel Aug 28, 2024

Choose a reason for hiding this comment

Uh oh!

masnesral Aug 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

masnesral commented Aug 28, 2024

Uh oh!

masnesral commented Aug 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

jgong5 commented Mar 29, 2024 •

edited by leslie-fang-intel

Loading

pytorch-bot Bot commented Mar 29, 2024 •

edited

Loading

jgong5 commented Apr 7, 2024 •

edited

Loading

leslie-fang-intel commented Aug 20, 2024 •

edited

Loading

leslie-fang-intel commented Aug 28, 2024 •

edited

Loading

leslie-fang-intel commented Aug 28, 2024 •

edited

Loading

leslie-fang-intel Aug 28, 2024 •

edited

Loading

masnesral Aug 28, 2024 •

edited

Loading