[Pytorch] Add option to CPU Blas GEMM to avoid output downcast by cyrusd98 · Pull Request #154012 · pytorch/pytorch

cyrusd98 · 2025-05-21T04:02:03Z

Summary:
Dot product for a single output element consists of 3 steps (both input vectors have elements of type scalar_t):

elementwise vector multiply (scalar_t x scalar_t -> opmath_t)
vector reduction to a scalar value (opmath_t -> opmath_t)
optional downcast if opmath_t != out_t

The current blas kernel performs steps 1 and 2 correctly, but for step 3, it will always downcast to scalar_t even when opmath_t == output_t (and then do an upcast back to output_t), which results in precision loss. This diff fixes the precision loss in the BlasKernel

Test Plan: Attention CI passes

Differential Revision: D75023858

topic: not user facing

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168

pytorch-bot · 2025-05-21T04:02:06Z

This appears to be a diff that was exported from phabricator, but the PR author does not have sufficient permissions to run CI. @cyrusd98, please do step 2 of internal wiki to get write access so you do not need to get CI approvals in the future. If you think this is a mistake, please contact the Pytorch Dev Infra team.

linux-foundation-easycla · 2025-05-21T04:02:07Z

The committers listed above are authorized under a signed CLA.

✅ login: cyrusd98 / name: Cyrus Daruwala (3199dc9)

pytorch-bot · 2025-05-21T04:02:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154012

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit 3199dc9 with merge base 86a1603 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-05-21T04:02:14Z

This pull request was exported from Phabricator. Differential Revision: D75023858

Valentine233

LGTM, thanks for the fix.

aditew01

Thanks for the PR. LGTM!

CaoE · 2025-05-23T00:51:01Z

Thanks for the fix.

facebook-github-bot · 2025-05-23T22:05:05Z

This pull request was exported from Phabricator. Differential Revision: D75023858

…ch#154012) Summary: Pull Request resolved: pytorch#154012 Dot product for a single output element consists of 3 steps (both input vectors have elements of type scalar_t): 1. elementwise vector multiply (scalar_t x scalar_t -> opmath_t) 2. vector reduction to a scalar value (opmath_t -> opmath_t) 3. optional downcast if opmath_t != out_t The current blas kernel performs steps 1 and 2 correctly, but for step 3, it will always downcast to scalar_t even when opmath_t == output_t (and then do an upcast back to output_t), which results in precision loss. This diff fixes the precision loss in the BlasKernel Test Plan: Attention CI passes Differential Revision: D75023858

facebook-github-bot · 2025-05-23T22:09:38Z

This pull request was exported from Phabricator. Differential Revision: D75023858

cyrusd98 · 2025-05-24T00:29:59Z

@pytorchbot merge

pytorchmergebot · 2025-05-24T00:32:07Z

Merge failed

Reason: Approvers from one of the following sets are needed:

superuser (pytorch/metamates)
Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10, ...)
Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet, ...)

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

Followup after #154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and than manually implement the `beta` logic in the codebase ghstack-source-id: 47d17e8 Pull Request resolved: #162001

Followup after #154012 Fixes CPU part of #160841 Pull Request resolved: #161999 Approved by: https://github.com/drisspg

Followup after #154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: #162001 Approved by: https://github.com/drisspg ghstack dependencies: #161999

Followup after #154012 Fixes CPU part of #160841 Pull Request resolved: #161999 Approved by: https://github.com/drisspg

Followup after #154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: #162001 Approved by: https://github.com/drisspg ghstack dependencies: #161999

jeanschmidt · 2025-09-07T20:45:16Z

@pytorchbot revert -m "Breaks ADS internal tests, see D81845017" -c ghfirst

pytorchmergebot · 2025-09-07T20:46:42Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2025-09-07T20:46:50Z

Reverting PR 154012 failed

Reason: Command git -C /home/runner/work/pytorch/pytorch revert --no-edit cfbd99fdfd7282c8969f123d5819a47d408ce78a returned non-zero exit code 1

Auto-merging aten/src/ATen/native/CPUBlas.cpp
CONFLICT (content): Merge conflict in aten/src/ATen/native/CPUBlas.cpp
Auto-merging aten/src/ATen/native/CPUBlas.h
Auto-merging aten/src/ATen/native/cpu/BlasKernel.cpp
CONFLICT (content): Merge conflict in aten/src/ATen/native/cpu/BlasKernel.cpp
error: could not revert cfbd99fdfd7... [Pytorch] Add option to CPU Blas GEMM to avoid output downcast (#154012)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git revert --continue".
hint: You can instead skip this commit with "git revert --skip".
hint: To abort and get back to the state before "git revert",
hint: run "git revert --abort".
hint: Disable this message with "git config set advice.mergeConflict false"

Details for Dev Infra team

Raised by workflow job

jeanschmidt · 2025-09-07T20:50:54Z

forget about this try revert, i followed the wrong link

Followup after pytorch#154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: pytorch#162001 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#161999

Followup after #154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: #162001 Approved by: https://github.com/drisspg ghstack dependencies: #161999

Followup after pytorch#154012 Fixes CPU part of pytorch#160841 Pull Request resolved: pytorch#161999 Approved by: https://github.com/drisspg

Followup after pytorch#154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: pytorch#162001 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#161999

Followup after pytorch#154012 Fixes CPU part of pytorch#160841 Pull Request resolved: pytorch#161999 Approved by: https://github.com/drisspg

Followup after pytorch#154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: pytorch#162001 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#161999

Followup after pytorch#154012 Fixes CPU part of pytorch#160841 Pull Request resolved: pytorch#161999 Approved by: https://github.com/drisspg

Followup after pytorch#154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: pytorch#162001 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#161999

Followup after pytorch#154012 Fixes CPU part of pytorch#160841 Pull Request resolved: pytorch#161999 Approved by: https://github.com/drisspg

Followup after pytorch#154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: pytorch#162001 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#161999

Followup after pytorch#154012 Fixes CPU part of pytorch#160841 Pull Request resolved: pytorch#161999 Approved by: https://github.com/drisspg

Followup after pytorch#154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: pytorch#162001 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#161999

Followup after pytorch#154012 Fixes CPU part of pytorch#160841 Pull Request resolved: pytorch#161999 Approved by: https://github.com/drisspg

Followup after pytorch#154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: pytorch#162001 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#161999

Followup after pytorch#154012 Fixes CPU part of pytorch#160841 Pull Request resolved: pytorch#161999 Approved by: https://github.com/drisspg

Followup after pytorch#154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: pytorch#162001 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#161999

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label May 21, 2025

facebook-github-bot added the fb-exported label May 21, 2025

drisspg added intel This tag is for PR from Intel module: numerical-stability Problems related to numerical stability of operations release notes: nn release notes category module: sdpa All things related to torch.nn.functional.scaled_dot_product_attentiion labels May 21, 2025

drisspg requested review from CaoE, Valentine233 and aditew01 May 21, 2025 19:08

Valentine233 approved these changes May 22, 2025

View reviewed changes

Valentine233 requested a review from mingfeima May 22, 2025 01:41

aditew01 approved these changes May 22, 2025

View reviewed changes

CaoE approved these changes May 23, 2025

View reviewed changes

cyrusd98 force-pushed the export-D75023858 branch from 33d99a0 to ec88fad Compare May 23, 2025 22:04

cyrusd98 force-pushed the export-D75023858 branch from ec88fad to d57c495 Compare May 23, 2025 22:04

cyrusd98 force-pushed the export-D75023858 branch from d57c495 to 3199dc9 Compare May 23, 2025 22:09

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 24, 2025

pytorchmergebot added the merging label May 24, 2025

malfet mentioned this pull request Sep 2, 2025

[BE] Cleanup stale comments/copy from gemm #162001

Closed

pytorchmergebot pushed a commit that referenced this pull request Sep 3, 2025

[BLAS] Avoid downcasts for fp16fp16->fp32 BLAS (#161999)

02c83f1

Followup after #154012 Fixes CPU part of #160841 Pull Request resolved: #161999 Approved by: https://github.com/drisspg

pytorchmergebot pushed a commit that referenced this pull request Sep 4, 2025

[BLAS] Avoid downcasts for fp16fp16->fp32 BLAS (#161999)

d2d4c8e

Followup after #154012 Fixes CPU part of #160841 Pull Request resolved: #161999 Approved by: https://github.com/drisspg

Conversation

cyrusd98 commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 21, 2025

Uh oh!

linux-foundation-easycla bot commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154012

⏳ No Failures, 1 Pending

Uh oh!

facebook-github-bot commented May 21, 2025

Uh oh!

Valentine233 left a comment

Choose a reason for hiding this comment

Uh oh!

aditew01 left a comment

Choose a reason for hiding this comment

Uh oh!

CaoE commented May 23, 2025

Uh oh!

facebook-github-bot commented May 23, 2025

Uh oh!

facebook-github-bot commented May 23, 2025

Uh oh!

cyrusd98 commented May 24, 2025

Uh oh!

pytorchmergebot commented May 24, 2025

Merge failed

Uh oh!

jeanschmidt commented Sep 7, 2025

Uh oh!

pytorchmergebot commented Sep 7, 2025

Uh oh!

pytorchmergebot commented Sep 7, 2025

Reverting PR 154012 failed

Uh oh!

jeanschmidt commented Sep 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

cyrusd98 commented May 21, 2025 •

edited

Loading

linux-foundation-easycla bot commented May 21, 2025 •

edited

Loading

pytorch-bot bot commented May 21, 2025 •

edited

Loading