Skip to content

[inductor][cpp] complete vectorization for int32/int64#122961

Closed
jgong5 wants to merge 40 commits intogh/jgong5/33/basefrom
gh/jgong5/33/head
Closed

[inductor][cpp] complete vectorization for int32/int64#122961
jgong5 wants to merge 40 commits intogh/jgong5/33/basefrom
gh/jgong5/33/head

Conversation

@jgong5
Copy link
Copy Markdown
Collaborator

@jgong5 jgong5 commented Mar 29, 2024

Stack from ghstack (oldest at bottom):

Summary
Implement the complete vectorization of index_expr functionally. We also add heuristic from performance perspective to resolve the regressions posted below: #122961 (comment) by disabling vectorization of specific (Fused) scheduler Node:

  • Heuristic 1: when the num of non-contiguous index_expr/load/store exceeds the threshold, we disable the vectorization.
  • Heuristic 2: when the total number of elements along the vec dim is less than tiling_factor/2, we disable the vectorization.

cc @voznesenskym @penguinwu @EikanWang @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov @ColinPeppler @amjames @desertfire @chauhang

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Mar 29, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/122961

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 29501d9 with merge base bce0cab (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
jgong5 pushed a commit that referenced this pull request Mar 29, 2024
ghstack-source-id: dab9448
Pull Request resolved: #122961
cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
jgong5 pushed a commit that referenced this pull request Mar 30, 2024
ghstack-source-id: 349ee59
Pull Request resolved: #122961
@jgong5
Copy link
Copy Markdown
Collaborator Author

jgong5 commented Apr 7, 2024

According to the fp32 inference benchmarking on the three suites, It improved perf on some of the models but caused regression on others. Need to figure out why.

Workload single-threaded multi-threaded
Background_Matting 1.19x 1.26x
hf_BigBird 0.83x 0.9x
hf_Longformer 0.84x 0.84x
hf_T5 1.28x 1.19x
hf_T5_base 1.46x 1.33x
hf_T5_large 1.24x 1.16x
lennard_jones 1.06x 0.99x
pyhpc_isoneutral_mixing 1.0x 0.76x
AllenaiLongformerBase 0.83x 0.87x
T5ForConditionalGeneration 1.38x 1.25x
T5Small 1.38x 1.23x

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 6, 2024

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions Bot added the Stale label Jun 6, 2024
@jgong5 jgong5 removed the Stale label Jun 7, 2024
cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
@leslie-fang-intel leslie-fang-intel marked this pull request as ready for review July 12, 2024 02:47
pytorchmergebot pushed a commit that referenced this pull request Aug 17, 2024
**Summary**
Enable the vectorization of `load_seed` and `randn`. For now, `randn` is using the reference implementation.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_randn
```

Pull Request resolved: #130317
Approved by: https://github.com/jgong5
ghstack dependencies: #122961
@atalman
Copy link
Copy Markdown
Contributor

atalman commented Aug 19, 2024

@pytorchmergebot revert -c nosignal -m "Breaks slow jobs: inductor/test_cpu_repro.py::CPUReproTests::test__adaptive_avg_pool2d GH job link HUD commit link"

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@jgong5 your PR has been successfully reverted.

pytorchmergebot added a commit that referenced this pull request Aug 19, 2024
…)"

This reverts commit 99b3b58.

Reverted #122961 on behalf of https://github.com/atalman due to Breaks slow jobs: inductor/test_cpu_repro.py::CPUReproTests::test__adaptive_avg_pool2d [GH job link](https://github.com/pytorch/pytorch/actions/runs/10432403692/job/28893704833) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/a0ef8888e60d934ae7e4ddaec1c1274b12d0d39d) ([comment](#122961 (comment)))
@leslie-fang-intel
Copy link
Copy Markdown
Collaborator

Adding ciflow: slow to reproduce the failure

**Summary**
Implement the complete vectorization of `index_expr` functionally. We also add heuristic from performance perspective to resolve the regressions posted below: #122961 (comment) by disabling vectorization of specific (Fused) scheduler Node:

- Heuristic 1: when the num of non-contiguous `index_expr/load/store` exceeds the threshold, we disable the vectorization.
- Heuristic 2: when the total number of elements along the vec dim is less than `tiling_factor/2`, we disable the vectorization.

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
leslie-fang-intel added a commit that referenced this pull request Aug 20, 2024
… int32/int64"


**Summary**
Implement the complete vectorization of `index_expr` functionally. We also add heuristic from performance perspective to resolve the regressions posted below: #122961 (comment) by disabling vectorization of specific (Fused) scheduler Node:

- Heuristic 1: when the num of non-contiguous `index_expr/load/store` exceeds the threshold, we disable the vectorization.
- Heuristic 2: when the total number of elements along the vec dim is less than `tiling_factor/2`, we disable the vectorization.

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
@leslie-fang-intel
Copy link
Copy Markdown
Collaborator

leslie-fang-intel commented Aug 20, 2024

Hi @atalman, thanks for the informing. I think I have fixed the failure of inductor/test_cpu_repro.py::CPUReproTests::test__adaptive_avg_pool2d and adding label slow for all the slow jobs of this PR. Let's see what's ci think about now. Thanks!

**Summary**
Implement the complete vectorization of `index_expr` functionally. We also add heuristic from performance perspective to resolve the regressions posted below: #122961 (comment) by disabling vectorization of specific (Fused) scheduler Node:

- Heuristic 1: when the num of non-contiguous `index_expr/load/store` exceeds the threshold, we disable the vectorization.
- Heuristic 2: when the total number of elements along the vec dim is less than `tiling_factor/2`, we disable the vectorization.

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
leslie-fang-intel added a commit that referenced this pull request Aug 20, 2024
… int32/int64"


**Summary**
Implement the complete vectorization of `index_expr` functionally. We also add heuristic from performance perspective to resolve the regressions posted below: #122961 (comment) by disabling vectorization of specific (Fused) scheduler Node:

- Heuristic 1: when the num of non-contiguous `index_expr/load/store` exceeds the threshold, we disable the vectorization.
- Heuristic 2: when the total number of elements along the vec dim is less than `tiling_factor/2`, we disable the vectorization.

cc voznesenskym penguinwu EikanWang Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx peterbell10 ipiszy yf225 chenyang78 kadeng muchulee8 aakhundov ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
@leslie-fang-intel
Copy link
Copy Markdown
Collaborator

@pytorchmergebot revert -c nosignal -m "Breaks slow jobs: inductor/test_cpu_repro.py::CPUReproTests::test__adaptive_avg_pool2d GH job link HUD commit link"

Hi @atalman, the failure posted previously passes now and the only failure marked as unrelated in Hud. I am going to re-land these 2 PRs. Thanks.

@leslie-fang-intel
Copy link
Copy Markdown
Collaborator

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Aug 21, 2024
**Summary**
Enable the vectorization of `load_seed` and `randn`. For now, `randn` is using the reference implementation.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_randn
```

Pull Request resolved: #130317
Approved by: https://github.com/jgong5
ghstack dependencies: #122961
@masnesral
Copy link
Copy Markdown
Contributor

@jgong5 @leslie-fang-intel I've got an internal report of a failure that I've been able to repro. The backtrace looks like:

  Traceback (most recent call last):
    File "#link-tree/torch/_inductor/compile_fx.py", line 570, in codegen_and_compile
      compiled_graph = fx_codegen_and_compile(gm, example_inputs, **fx_kwargs)
    File "#link-tree/torch/_inductor/compile_fx.py", line 878, in fx_codegen_and_compile
      compiled_fn = graph.compile_to_fn()
    File "#link-tree/torch/_inductor/graph.py", line 1883, in compile_to_fn
      return self.compile_to_module().call
    File "#link-tree/torch/_inductor/graph.py", line 1809, in compile_to_module
      return self._compile_to_module()
    File "#link-tree/torch/_inductor/graph.py", line 1815, in _compile_to_module
      self.codegen_with_cpp_wrapper() if self.cpp_wrapper else self.codegen()
    File "#link-tree/torch/_inductor/graph.py", line 1754, in codegen
      self.scheduler.codegen()
    File "#link-tree/torch/_inductor/scheduler.py", line 3149, in codegen
      return self._codegen()
    File "#link-tree/torch/_inductor/scheduler.py", line 3207, in _codegen
      self.get_backend(device).codegen_node(node)
    File "#link-tree/torch/_inductor/codegen/cpp.py", line 4389, in codegen_node
      cpp_kernel_proxy.codegen_nodes(nodes)
    File "#link-tree/torch/_inductor/codegen/cpp.py", line 3880, in codegen_nodes
      self.codegen_functions(fn_list, var_sizes_list)
    File "#link-tree/torch/_inductor/codegen/cpp.py", line 3720, in codegen_functions
      tiling_factors, tiling_indices = tiling_select.select_tiling(
    File "#link-tree/torch/_inductor/codegen/cpp.py", line 3358, in select_tiling
      ].args[0]
  AttributeError: 'str' object has no attribute 'args'

Is this stack trace enough to allow you to debug without a full repro (the internal test is rather large and complicated). More importantly, do you have any objection to reverting while we investigate? I can help vet any fixes on the internal workload.

@leslie-fang-intel
Copy link
Copy Markdown
Collaborator

leslie-fang-intel commented Aug 28, 2024

Hi @masnesral, could you share the commit ID you used? There are several PR landed in cpp.py recently. So, I am not sure cpp.py:3358 exactly points to which part of code.

@leslie-fang-intel
Copy link
Copy Markdown
Collaborator

leslie-fang-intel commented Aug 28, 2024

More importantly, do you have any objection to reverting while we investigate? I can help vet any fixes on the internal workload.

This PR is important to us for PyTorch 2.5, may we try to open a hotfix with new PR at first?

)[
_node.args[
1 if _node.target == "index_expr" else 2
].args[0]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leslie-fang-intel this is the line that raises on the internal workload

Copy link
Copy Markdown
Collaborator

@leslie-fang-intel leslie-fang-intel Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will try to submit a hotfix soon. Still not sure yet, how could the _node.args[1 if _node.target == "index_expr" else 2] be a str, but I guess we can bypass this special case.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll keep debugging on my side; unfortunately the failing test takes a while to run :/

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, here we just want to get the name of index expression. So for all of the cases we see in the 3 test suits (TorchBench, Timms, Huggingface) we expect, take index_expr as example:

  • _node is a FX Node with target index_expr
  • _node.args[1 if _node.target == "index_expr" else 2] is another FX node with target get_index
  • _node.args[1 if _node.target == "index_expr" else 2].args[0] is a str for the name of this index expression

For your case, it seems _node.args[1 if _node.target == "index_expr" else 2] is a str instead of the FX node.

Copy link
Copy Markdown
Contributor

@masnesral masnesral Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leslie-fang-intel Sorry for the delay. I was trying to debug last night and I'm seeing some extremely strange behavior I can't explain. I thought I might be hallucinating and just needed sleep. When I poke around in the debugger, the element in question is not a string. Amazingly, when I change the code to the following, there is no error and the test runs successfully:

                            iii = 1 if _node.target == "index_expr" else 2

                           ...

                            _node.args[
                                iii
                            ].args[0]

What?!

@masnesral
Copy link
Copy Markdown
Contributor

Hi @masnesral, could you share the commit ID you used? There are several PR landed in cpp.py recently. So, I am not sure cpp.py:3358 exactly points to which part of code.

Yeah, it's a little tricky because we mirror OSS commits to an internal repo asynchronously. But I was able to validate that backing out this PR removes the problem. I'll comment on the line that's raising the exception.

@masnesral
Copy link
Copy Markdown
Contributor

This PR is important to us for PyTorch 2.5, may we try to open a hotfix with new PR at first?

I honestly don't know how difficult that is or how close we are to the deadline. Maybe @atalman can help out here (who's helped me recently with some other, unrelated hotfixes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants