[Pytorch] Add option to CPU Blas GEMM to avoid output downcast#154012
[Pytorch] Add option to CPU Blas GEMM to avoid output downcast#154012cyrusd98 wants to merge 1 commit intopytorch:mainfrom
Conversation
|
This appears to be a diff that was exported from phabricator, but the PR author does not have sufficient permissions to run CI. @cyrusd98, please do step 2 of internal wiki to get write access so you do not need to get CI approvals in the future. If you think this is a mistake, please contact the Pytorch Dev Infra team. |
|
|
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154012
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 1 PendingAs of commit 3199dc9 with merge base 86a1603 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D75023858 |
Valentine233
left a comment
There was a problem hiding this comment.
LGTM, thanks for the fix.
aditew01
left a comment
There was a problem hiding this comment.
Thanks for the PR. LGTM!
|
Thanks for the fix. |
33d99a0 to
ec88fad
Compare
ec88fad to
d57c495
Compare
|
This pull request was exported from Phabricator. Differential Revision: D75023858 |
…ch#154012) Summary: Pull Request resolved: pytorch#154012 Dot product for a single output element consists of 3 steps (both input vectors have elements of type scalar_t): 1. elementwise vector multiply (scalar_t x scalar_t -> opmath_t) 2. vector reduction to a scalar value (opmath_t -> opmath_t) 3. optional downcast if opmath_t != out_t The current blas kernel performs steps 1 and 2 correctly, but for step 3, it will always downcast to scalar_t even when opmath_t == output_t (and then do an upcast back to output_t), which results in precision loss. This diff fixes the precision loss in the BlasKernel Test Plan: Attention CI passes Differential Revision: D75023858
|
This pull request was exported from Phabricator. Differential Revision: D75023858 |
d57c495 to
3199dc9
Compare
|
@pytorchbot merge |
Merge failedReason: Approvers from one of the following sets are needed:
|
Followup after #154012 Fixes CPU part of #160841 Pull Request resolved: #161999 Approved by: https://github.com/drisspg
Followup after #154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: #162001 Approved by: https://github.com/drisspg ghstack dependencies: #161999
Followup after #154012 Fixes CPU part of #160841 Pull Request resolved: #161999 Approved by: https://github.com/drisspg
Followup after #154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: #162001 Approved by: https://github.com/drisspg ghstack dependencies: #161999
|
@pytorchbot revert -m "Breaks ADS internal tests, see D81845017" -c ghfirst |
|
@pytorchbot successfully started a revert job. Check the current status here. |
Reverting PR 154012 failedReason: Command Details for Dev Infra teamRaised by workflow job |
|
forget about this try revert, i followed the wrong link |
Followup after pytorch#154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: pytorch#162001 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#161999
Followup after #154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: #162001 Approved by: https://github.com/drisspg ghstack dependencies: #161999
Followup after pytorch#154012 Fixes CPU part of pytorch#160841 Pull Request resolved: pytorch#161999 Approved by: https://github.com/drisspg
Followup after pytorch#154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: pytorch#162001 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#161999
Followup after pytorch#154012 Fixes CPU part of pytorch#160841 Pull Request resolved: pytorch#161999 Approved by: https://github.com/drisspg
Followup after pytorch#154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: pytorch#162001 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#161999
Followup after pytorch#154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: pytorch#162001 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#161999
Followup after pytorch#154012 Fixes CPU part of pytorch#160841 Pull Request resolved: pytorch#161999 Approved by: https://github.com/drisspg
Followup after pytorch#154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: pytorch#162001 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#161999
Followup after pytorch#154012 Fixes CPU part of pytorch#160841 Pull Request resolved: pytorch#161999 Approved by: https://github.com/drisspg
Followup after pytorch#154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: pytorch#162001 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#161999
Followup after pytorch#154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: pytorch#162001 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#161999
Followup after pytorch#154012 Fixes CPU part of pytorch#160841 Pull Request resolved: pytorch#161999 Approved by: https://github.com/drisspg
Followup after pytorch#154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: pytorch#162001 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#161999
Followup after pytorch#154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: pytorch#162001 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#161999
Followup after pytorch#154012 Fixes CPU part of pytorch#160841 Pull Request resolved: pytorch#161999 Approved by: https://github.com/drisspg
Followup after pytorch#154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: pytorch#162001 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#161999
Followup after pytorch#154012 Fixes CPU part of pytorch#160841 Pull Request resolved: pytorch#161999 Approved by: https://github.com/drisspg
Followup after pytorch#154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: pytorch#162001 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#161999
Followup after pytorch#154012 Since the introduction of `gemm_no_downcast_stub` it's no longer necessary to allocate temporary array and then manually implement the `beta` logic in the codebase Pull Request resolved: pytorch#162001 Approved by: https://github.com/drisspg ghstack dependencies: pytorch#161999
Summary:
Dot product for a single output element consists of 3 steps (both input vectors have elements of type scalar_t):
The current blas kernel performs steps 1 and 2 correctly, but for step 3, it will always downcast to scalar_t even when opmath_t == output_t (and then do an upcast back to output_t), which results in precision loss. This diff fixes the precision loss in the BlasKernel
Test Plan: Attention CI passes
Differential Revision: D75023858
topic: not user facing
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168