[release/2.3] [ROCm] Correct numerical issues in layer norm backwards kernel (#140259) by jataylo · Pull Request #1766 · ROCm/pytorch

jataylo · 2024-12-04T14:37:29Z

It was raised that the backwards layer norm on AMD was slightly off the accuracy of the equivalent NVIDIA implementation.

On AMD we call into a helper kernel cuLoadWriteStridedInputs which processes strided input and accumulates the partial gradients into shared memory.

In this kernel (pytorch#87635) we truncated mean and rstd from T_ACC type to T which causes numerical issues in the warp buffers created in this kernel. This PR will use the correct accumulator type for mean and rstd.

Note: Only AMD call into this call stack for backwards layer norm, so this was not an issue for NV.

Pull Request resolved: pytorch#140259
Approved by: https://github.com/jianyuh

(cherry picked from commit 001f736)

…ch#140259) It was raised that the backwards layer norm on AMD was slightly off the accuracy of the equivalent NVIDIA implementation. On AMD we call into a helper kernel `cuLoadWriteStridedInputs` which processes strided input and accumulates the partial gradients into shared memory. In this kernel (pytorch#87635) we truncated `mean` and `rstd` from T_ACC type to T which causes numerical issues in the warp buffers created in this kernel. This PR will use the correct accumulator type for mean and rstd. Note: Only AMD call into this call stack for backwards layer norm, so this was not an issue for NV. Pull Request resolved: pytorch#140259 Approved by: https://github.com/jianyuh (cherry picked from commit 001f736)

rocm-repo-management-api · 2024-12-04T19:02:33Z

Jenkins build for 376941ea81baea6726add3a82a7e78144ed2bebe commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

… kernel (pytorch#140259) (#1766) It was raised that the backwards layer norm on AMD was slightly off the accuracy of the equivalent NVIDIA implementation. On AMD we call into a helper kernel `cuLoadWriteStridedInputs` which processes strided input and accumulates the partial gradients into shared memory. In this kernel (pytorch#87635) we truncated `mean` and `rstd` from T_ACC type to T which causes numerical issues in the warp buffers created in this kernel. This PR will use the correct accumulator type for mean and rstd. Note: Only AMD call into this call stack for backwards layer norm, so this was not an issue for NV. Pull Request resolved: pytorch#140259 Approved by: https://github.com/jianyuh (cherry picked from commit 001f736)

jataylo requested review from jithunnair-amd and pruthvistony December 4, 2024 14:37

pruthvistony approved these changes Dec 6, 2024

View reviewed changes

pruthvistony merged commit a7b07f9 into release/2.3 Dec 6, 2024

pruthvistony deleted the rel23-picks-jack branch December 6, 2024 05:57

ROCm deleted a comment from rocm-mici Dec 13, 2024

This was referenced Dec 13, 2024

[AUTOGENERATED] [release/2.3] Cherry-pick PR-1766 #1788

Closed

[AUTOGENERATED] [release/2.4] Cherry-pick PR-1766 #1789

Closed

rocm-mici mentioned this pull request Dec 13, 2024

[AUTOGENERATED] [release/2.5] Cherry-pick PR-1766 #1790

Closed

ROCm deleted a comment from rocm-mici Dec 13, 2024

rocm-mici mentioned this pull request Dec 13, 2024

[AUTOGENERATED] [release/2.5] Cherry-pick PR-1766 #1792

Closed

ROCm deleted a comment from rocm-mici Dec 13, 2024

rocm-mici mentioned this pull request Dec 13, 2024

[AUTOGENERATED] [release/2.5] Cherry-pick PR-1766 #1793

Closed

ROCm deleted a comment from rocm-mici Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release/2.3] [ROCm] Correct numerical issues in layer norm backwards kernel (#140259)#1766

[release/2.3] [ROCm] Correct numerical issues in layer norm backwards kernel (#140259)#1766
pruthvistony merged 1 commit intorelease/2.3from
rel23-picks-jack

jataylo commented Dec 4, 2024

Uh oh!

rocm-repo-management-api Bot commented Dec 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jataylo commented Dec 4, 2024

Uh oh!

rocm-repo-management-api Bot commented Dec 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants