[inductor][cpu] Fix double-offset issue in `GEMM_TEMPLATE` by Phoslight · Pull Request #159233 · pytorch/pytorch

Phoslight · 2025-07-27T23:50:27Z

Basically, the gemm template generates code like

cpp_CppMicroGemmRef_micro_gemm<static_cast<bool>(false), static_cast<bool>(false)>(
            &(X[static_cast<int64_t>(k_start + 196LL*m_start + 38416LL*ks_b_index)]),
            &(W[static_cast<int64_t>(200704000LL + n_start + 80LL*k_start + 15680LL*ks_b_index)]),
            &(local_acc_buf[static_cast<int64_t>(Nr*nci + ((-1LL)*Nr*nc))]),
            static_cast<int64_t>(m_end + ((-1LL)*m_start)),
            static_cast<int64_t>(Nr),
            static_cast<int64_t>(k_end + ((-1LL)*k_start)),
            static_cast<int64_t>(196LL),
            static_cast<int64_t>(80LL),
            static_cast<int64_t>(Nc_blocks*Nr)
        );

However, when the input tensor W has a storage offset, this results in a double offset issue. That is, the resulting pointer is 2 * 200704000LL away from W.storage().data_ptr(), which causes an out-of-bounds access.

The storage offset of W is introduced by this patch, but I think it's a reasonable fix. So cpp_gemm_template.py should handle input matrices with storage offsets properly.

I think a good way to fix this issue is to create a new matrix that has no storage offset.

When should_block_weights is true, block_weight() creates a clean new matrix, so that branch is not affected by this issue.

BTW I've also examined the FX IRs generated by torch.compile(), as well as the generated python module, and they are correct.

The newly-added test in test_cpu_select_algorithm.py can reproduce the issue. With this patch, the crash is fixed. It also resolves the crash reported in #158076.

I ran CPU tests in test_cpu_select_algorithm.py, but many of them are skipped due to MKL and AMX. I'd be appreciated if someone can help verify the test.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

pytorch-bot · 2025-07-27T23:50:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159233

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit eeb09c0 with merge base f636736 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2025-07-27T23:50:31Z

The committers listed above are authorized under a signed CLA.

✅ login: Phoslight / name: Phoslight (4a49acb, eeb09c0, 674a976)

Phoslight · 2025-07-27T23:56:19Z

@pytorchbot label "topic: not user facing"

leslie-fang-intel · 2025-07-29T01:41:05Z

I ran CPU tests in test_cpu_select_algorithm.py, but many of them are skipped due to MKL and AMX. I'd be appreciated if someone can help verify the test.

@CaoE could you help to take a look of this fix?

leslie-fang-intel · 2025-07-29T01:50:34Z

That is, the resulting pointer is 2 * 200704000LL away from W.storage().data_ptr(), which causes an out-of-bounds access.

hi @Phoslight, thanks for the fixing and want to understand more about why the offset will cause out-of-bounds. AFAIK, the offset should come from the original view node.

Phoslight · 2025-07-29T04:29:17Z

That is, the resulting pointer is 2 * 200704000LL away from W.storage().data_ptr(), which causes an out-of-bounds access.

hi @Phoslight, thanks for the fixing and want to understand more about why the offset will cause out-of-bounds. AFAIK, the offset should come from the original view node.

+----------------------------------+
| contiguous W with storage offset |  // should_block_weight == False
+-----+----------------------------+
      |
      |
      v
 compile_fx()
      +
      |
      |
      |  // (1) generates the cpp template:
      +------> run_node()
      |              +
      |              | ...
      |              v
      |        tuned_bmm()
      |              +
      |              | ...
      |              v
      |        CppBmmTemplate::render()   // generates template with W's storage offset
      |                                   // e.g. W[static_cast<int64_t>(200704000LL + n_start + 80LL*k_start + 15680LL*ks_b_index)]
      |                                   //   (in this patch I dropped the offset 200704000LL
      |                                   //    to align with should_block_weight branch in prep_weight)
      |
      |
      |  // (2) generates the example inputs
      +------> do_autotuning()
                     +
                     | ...
                     v
               AlgorithmSelectorCache::benchmark()
               +->benchmark_in_current_process()
                  +> get_inputs()
                     +> benchmark_example_value()
                     |  +> unwrap_view()   // drops the offset from W.data_ptr
                     |
                     | ...
                     |
                     +> DataProcessorChoiceCallerWrapper::benchmark()
                        +> CppGemmTemplate::preprocessor()
                           +> normalize_shapes()   // adds back the offset to W.data_ptr (your fix)

Thank you for your reply, Leslie.

Without this patch, the cpp template and the example inputs both add an offset, which causes the double-offset issue.
My proposed fix is to remove the offset from the cpp template to align with the behavior of the should_block_weight branch.

Hope the above chart helps clarify the issue.

Phoslight · 2025-07-31T03:46:52Z

Gentle ping. Any other reviews? Thanks in advance.

swolchok

not familiar with this code, but approving workflows to run

swolchok · 2025-08-11T23:46:46Z

torch/_inductor/codegen/cpp_gemm_template.py

+                # GEMM_TEMPLATE emits code like:
+                #   W.data_ptr[offset + ...]
+                # but the data_ptr already includes the offset.


This makes it sound like the correct fix is to remove the offset from the index calculation in the emitted code rather than copy. I assume that would break something else?

Sorry for the late reply, and thank you @swolchok for approving the workflows and for the review. My earlier code comment caused some confusion, so I’ve updated it — this fix actually "removes the offset from the index calculation in the emitted code". Hopefully it looks fine this time.

For the workflow errors:

linux-jammy-py3_9-clang9-xla: likely due to a network issue.

linux-jammy-cuda12.8-py3.10-gcc11-test: appears flaky; other PRs have failed here as well — please see: https://github.com/pytorch/pytorch/issues?q=state%3Aopen%20test_inductor_all_reduce_coalesced

Lint: fixed in the updated commit (should have used the get_layout() API).

Also refactored the code to keep the main logic free of bulky comments.

swolchok

looks good to me, thanks

Phoslight · 2025-08-21T23:18:00Z

Thank you for the reviews @leslie-fang-intel @swolchok.
I was wondering if this fix could be merged - it looks like I don't have the authorization to push.

swolchok · 2025-08-22T00:25:11Z

@pytorchbot merge

pytorchmergebot · 2025-08-22T00:27:00Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Phoslight · 2025-08-22T03:48:54Z

Thank you @swolchok for the help!

…59233) Fixes pytorch#158076 Basically, the gemm template generates code like ``` cpp_CppMicroGemmRef_micro_gemm<static_cast<bool>(false), static_cast<bool>(false)>( &(X[static_cast<int64_t>(k_start + 196LL*m_start + 38416LL*ks_b_index)]), &(W[static_cast<int64_t>(200704000LL + n_start + 80LL*k_start + 15680LL*ks_b_index)]), &(local_acc_buf[static_cast<int64_t>(Nr*nci + ((-1LL)*Nr*nc))]), static_cast<int64_t>(m_end + ((-1LL)*m_start)), static_cast<int64_t>(Nr), static_cast<int64_t>(k_end + ((-1LL)*k_start)), static_cast<int64_t>(196LL), static_cast<int64_t>(80LL), static_cast<int64_t>(Nc_blocks*Nr) ); ``` However, when the input tensor W has a storage offset, this results in a double offset issue. That is, the resulting pointer is `2 * 200704000LL` away from `W.storage().data_ptr()`, which causes an out-of-bounds access. The storage offset of `W` is introduced by [this patch](https://github.com/pytorch/pytorch/pull/136421/files), but I think it's a reasonable fix. So `cpp_gemm_template.py` should handle input matrices with storage offsets properly. I think a good way to fix this issue is to create a new matrix that has no storage offset. When `should_block_weights` is true, `block_weight()` creates a clean new matrix, so that branch is not affected by this issue. BTW I've also examined the FX IRs generated by `torch.compile()`, as well as the generated python module, and they are correct. The newly-added test in `test_cpu_select_algorithm.py` can reproduce the issue. With this patch, the crash is fixed. It also resolves the crash reported in pytorch#158076. I ran CPU tests in `test_cpu_select_algorithm.py`, but many of them are skipped due to MKL and AMX. I'd be appreciated if someone can help verify the test. Pull Request resolved: pytorch#159233 Approved by: https://github.com/leslie-fang-intel, https://github.com/swolchok

Fix double-offsetting issue

674a976

pytorch-bot bot added the module: inductor label Jul 27, 2025

pytorch-bot bot added the topic: not user facing topic category label Jul 27, 2025

Phoslight mentioned this pull request Jul 28, 2025

torch.compile on BFloat16 Segment Anything segfaults in cpp_CppMicroGemmRef_micro_gemm<false, false> on Mac #158076

Closed

pytorchbot added the open source label Jul 28, 2025

HDCharles requested review from leslie-fang-intel and swolchok July 28, 2025 21:23

HDCharles added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 28, 2025

leslie-fang-intel requested a review from CaoE July 29, 2025 01:38

leslie-fang-intel approved these changes Jul 29, 2025

View reviewed changes

swolchok reviewed Aug 11, 2025

View reviewed changes

Refactoring & Fixing lint

4a49acb

swolchok approved these changes Aug 15, 2025

View reviewed changes

should have run lintrunner before submission

eeb09c0

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 22, 2025

pytorchmergebot added the merging label Aug 22, 2025

pytorchmergebot added the Merged label Aug 22, 2025

pytorchmergebot closed this in bf8431b Aug 22, 2025

pytorchmergebot removed the merging label Aug 22, 2025

Conversation

Phoslight commented Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159233

✅ No Failures

Uh oh!

linux-foundation-easycla bot commented Jul 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Phoslight commented Jul 27, 2025

Uh oh!

leslie-fang-intel commented Jul 29, 2025

Uh oh!

leslie-fang-intel commented Jul 29, 2025

Uh oh!

Phoslight commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Phoslight commented Jul 31, 2025

Uh oh!

swolchok left a comment

Choose a reason for hiding this comment

Uh oh!

swolchok Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

Phoslight Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

swolchok left a comment

Choose a reason for hiding this comment

Uh oh!

Phoslight commented Aug 21, 2025

Uh oh!

swolchok commented Aug 22, 2025

Uh oh!

pytorchmergebot commented Aug 22, 2025

Merge started

Uh oh!

Phoslight commented Aug 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Phoslight commented Jul 27, 2025 •

edited

Loading

pytorch-bot bot commented Jul 27, 2025 •

edited

Loading

linux-foundation-easycla bot commented Jul 27, 2025 •

edited

Loading

Phoslight commented Jul 29, 2025 •

edited

Loading