[inductor][cpu] Fix double-offset issue in GEMM_TEMPLATE#159233
[inductor][cpu] Fix double-offset issue in GEMM_TEMPLATE#159233Phoslight wants to merge 3 commits intopytorch:mainfrom
GEMM_TEMPLATE#159233Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159233
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit eeb09c0 with merge base f636736 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "topic: not user facing" |
@CaoE could you help to take a look of this fix? |
hi @Phoslight, thanks for the fixing and want to understand more about why the offset will cause out-of-bounds. AFAIK, the offset should come from the original view node. |
Thank you for your reply, Leslie. Without this patch, the cpp template and the example inputs both add an offset, which causes the double-offset issue. Hope the above chart helps clarify the issue. |
|
Gentle ping. Any other reviews? Thanks in advance. |
swolchok
left a comment
There was a problem hiding this comment.
not familiar with this code, but approving workflows to run
| # GEMM_TEMPLATE emits code like: | ||
| # W.data_ptr[offset + ...] | ||
| # but the data_ptr already includes the offset. |
There was a problem hiding this comment.
This makes it sound like the correct fix is to remove the offset from the index calculation in the emitted code rather than copy. I assume that would break something else?
There was a problem hiding this comment.
Sorry for the late reply, and thank you @swolchok for approving the workflows and for the review. My earlier code comment caused some confusion, so I’ve updated it — this fix actually "removes the offset from the index calculation in the emitted code". Hopefully it looks fine this time.
For the workflow errors:
- linux-jammy-py3_9-clang9-xla: likely due to a network issue.
- linux-jammy-cuda12.8-py3.10-gcc11-test: appears flaky; other PRs have failed here as well — please see: https://github.com/pytorch/pytorch/issues?q=state%3Aopen%20test_inductor_all_reduce_coalesced
- Lint: fixed in the updated commit (should have used the get_layout() API).
Also refactored the code to keep the main logic free of bulky comments.
swolchok
left a comment
There was a problem hiding this comment.
looks good to me, thanks
|
Thank you for the reviews @leslie-fang-intel @swolchok. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
Thank you @swolchok for the help! |
…59233) Fixes pytorch#158076 Basically, the gemm template generates code like ``` cpp_CppMicroGemmRef_micro_gemm<static_cast<bool>(false), static_cast<bool>(false)>( &(X[static_cast<int64_t>(k_start + 196LL*m_start + 38416LL*ks_b_index)]), &(W[static_cast<int64_t>(200704000LL + n_start + 80LL*k_start + 15680LL*ks_b_index)]), &(local_acc_buf[static_cast<int64_t>(Nr*nci + ((-1LL)*Nr*nc))]), static_cast<int64_t>(m_end + ((-1LL)*m_start)), static_cast<int64_t>(Nr), static_cast<int64_t>(k_end + ((-1LL)*k_start)), static_cast<int64_t>(196LL), static_cast<int64_t>(80LL), static_cast<int64_t>(Nc_blocks*Nr) ); ``` However, when the input tensor W has a storage offset, this results in a double offset issue. That is, the resulting pointer is `2 * 200704000LL` away from `W.storage().data_ptr()`, which causes an out-of-bounds access. The storage offset of `W` is introduced by [this patch](https://github.com/pytorch/pytorch/pull/136421/files), but I think it's a reasonable fix. So `cpp_gemm_template.py` should handle input matrices with storage offsets properly. I think a good way to fix this issue is to create a new matrix that has no storage offset. When `should_block_weights` is true, `block_weight()` creates a clean new matrix, so that branch is not affected by this issue. BTW I've also examined the FX IRs generated by `torch.compile()`, as well as the generated python module, and they are correct. The newly-added test in `test_cpu_select_algorithm.py` can reproduce the issue. With this patch, the crash is fixed. It also resolves the crash reported in pytorch#158076. I ran CPU tests in `test_cpu_select_algorithm.py`, but many of them are skipped due to MKL and AMX. I'd be appreciated if someone can help verify the test. Pull Request resolved: pytorch#159233 Approved by: https://github.com/leslie-fang-intel, https://github.com/swolchok
Fixes #158076
Basically, the gemm template generates code like
However, when the input tensor W has a storage offset, this results in a double offset issue. That is, the resulting pointer is
2 * 200704000LLaway fromW.storage().data_ptr(), which causes an out-of-bounds access.The storage offset of
Wis introduced by this patch, but I think it's a reasonable fix. Socpp_gemm_template.pyshould handle input matrices with storage offsets properly.I think a good way to fix this issue is to create a new matrix that has no storage offset.
When
should_block_weightsis true,block_weight()creates a clean new matrix, so that branch is not affected by this issue.BTW I've also examined the FX IRs generated by
torch.compile(), as well as the generated python module, and they are correct.The newly-added test in
test_cpu_select_algorithm.pycan reproduce the issue. With this patch, the crash is fixed. It also resolves the crash reported in #158076.I ran CPU tests in
test_cpu_select_algorithm.py, but many of them are skipped due to MKL and AMX. I'd be appreciated if someone can help verify the test.cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben