[Inductor] support masked vectorization for the tail_loop for dynamic shapes#131745
[Inductor] support masked vectorization for the tail_loop for dynamic shapes#131745jiayisunx wants to merge 27 commits intogh/jiayisunx/16/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/131745
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit fcfebe5 with merge base 1754850 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| template <typename T, int M, int N, | ||
| typename std::enable_if_t<std::is_same<T, BFloat16>::value && ((M < 32 && M != 16) || (N < 32 && N != 16)), int> = 0> | ||
| inline void transpose_mxn(const BFloat16* src, int64_t ld_src, BFloat16* dst, int64_t ld_dst) { | ||
| inline void transpose_mxn<BFloat16>(const BFloat16* src, int64_t ld_src, BFloat16* dst, int64_t ld_dst, int M, int N) { |
There was a problem hiding this comment.
I'm concerned about the perf impact. Originally, M and N are template args and compile-time constants.
There was a problem hiding this comment.
I will check it with TAS.
There was a problem hiding this comment.
No performance regression in TAS, and we compared assembly code before and after PR, it doesn't seem to have performance impact.
There was a problem hiding this comment.
Thanks! So perhaps we don't have to keep the version with M and N as template args.
leslie-fang-intel
left a comment
There was a problem hiding this comment.
LGTM, please kindly address Jiong's comment.
A bug introduced by rebase, I have fixed it, please help to review this PR again, thanks! |
| steps_str = ( | ||
| f"{self.var}+=({cexpr_index(self.steps)} == 0 ? " | ||
| f"1 : {cexpr_index(self.steps)})" |
There was a problem hiding this comment.
Can you add comment on why we need this trick here?
| template <typename T, int M, int N, | ||
| typename std::enable_if_t<std::is_same<T, BFloat16>::value && ((M < 32 && M != 16) || (N < 32 && N != 16)), int> = 0> | ||
| inline void transpose_mxn(const BFloat16* src, int64_t ld_src, BFloat16* dst, int64_t ld_dst) { | ||
| inline void transpose_mxn<BFloat16>(const BFloat16* src, int64_t ld_src, BFloat16* dst, int64_t ld_dst, int M, int N) { |
There was a problem hiding this comment.
Thanks! So perhaps we don't have to keep the version with M and N as template args.
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Recent PR: #131745 bring new VLA logical in cpp codegen. And it will raise build fail error on MSVC and error code is `Compiler Error C2131`: https://learn.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2131?view=msvc-170 reproduce UT: ```cmd pytest test\inductor\test_torchinductor_dynamic_shapes.py -v -k test_large_block_sizes_dynamic_shapes_cpu ``` Original generated code: ```c++ alignas(16) float tmp1[static_cast<int64_t>(((-256LL)*(c10::div_floor_integer(static_cast<int64_t>(ks1), static_cast<int64_t>(16LL)))) + (16LL*ks1))]; ``` Changes: allocate a large-enough fixed-sized buffer. New genarated code: ```c++ alignas(16) float tmp1[16*16]; ``` Pull Request resolved: #135307 Approved by: https://github.com/jgong5, https://github.com/jansel
… shapes (pytorch#131745) Pull Request resolved: pytorch#131745 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
Recent PR: pytorch#131745 bring new VLA logical in cpp codegen. And it will raise build fail error on MSVC and error code is `Compiler Error C2131`: https://learn.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2131?view=msvc-170 reproduce UT: ```cmd pytest test\inductor\test_torchinductor_dynamic_shapes.py -v -k test_large_block_sizes_dynamic_shapes_cpu ``` Original generated code: ```c++ alignas(16) float tmp1[static_cast<int64_t>(((-256LL)*(c10::div_floor_integer(static_cast<int64_t>(ks1), static_cast<int64_t>(16LL)))) + (16LL*ks1))]; ``` Changes: allocate a large-enough fixed-sized buffer. New genarated code: ```c++ alignas(16) float tmp1[16*16]; ``` Pull Request resolved: pytorch#135307 Approved by: https://github.com/jgong5, https://github.com/jansel
Stack from ghstack (oldest at bottom):
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang