fix for fp16 by mayank31398 · Pull Request #134106 · pytorch/pytorch

mayank31398 · 2024-08-21T13:41:15Z

This PR is a replacement for #133085 for pushing a quick fix for RMSNorm.
The original author is @kkontny

Previous PR summary:
Since FP16 has quite small dynamic range it is very easy to overflow while computing at::pow(input, 2) , and it happens in real world computation.

I've tried to use nn.RMSNorm fused implementation instead of LlamaRMSNorm inside transformers implementation of Llama (src/transformers/models/llama/modeling_llama.py). It started to give wrong answers in Fp16 while still giving good in FP32. I figured out happens due to overflow while computing square of the input tensor.

Original LLamaRMSNorm implementation upcasts input to fp32 to prevent this and give better numerical stability.

class LlamaRMSNorm(nn.Module):
    def __init__(self, hidden_size, eps=1e-6):
        """
        LlamaRMSNorm is equivalent to T5LayerNorm
        """
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps

    def forward(self, hidden_states):
        input_dtype = hidden_states.dtype
        hidden_states = hidden_states.to(torch.float32)
        variance = hidden_states.pow(2).mean(-1, keepdim=True)
        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
        return self.weight * hidden_states.to(input_dtype)

Proposed commit fixed the issue. FP16 in RMSNorm has to be treated in special way, to be usable in real world implementations.

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @gujinghui @PenghuiCheng @jianyuh @min-jean-cho @yanbing-j @Guobing-Chen @Xia-Weiwen @snadampal @voznesenskym @penguinwu @EikanWang @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @rec

pytorch-bot · 2024-08-21T13:41:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134106

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (9 Unrelated Failures)

As of commit 04c9c16 with merge base d7b57c4 ():

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

linux-binary-libtorch-cxx11-abi / libtorch-cpu-shared-with-deps-cxx11-abi-test / test (gh) (trunk failure)
linux-binary-libtorch-pre-cxx11 / libtorch-cpu-shared-with-deps-pre-cxx11-test / test (gh) (trunk failure)
linux-binary-manywheel / manywheel-py3_9-cuda11_8-split-test / test (gh) (trunk failure)
linux-binary-manywheel / manywheel-py3_9-cuda11_8-test / test (gh) (trunk failure)
linux-binary-manywheel / manywheel-py3_9-cuda12_1-split-test / test (gh) (trunk failure)
linux-binary-manywheel / manywheel-py3_9-cuda12_1-test / test (gh) (trunk failure)
linux-binary-manywheel / manywheel-py3_9-cuda12_4-split-test / test (gh) (trunk failure)
linux-binary-manywheel / manywheel-py3_9-cuda12_4-test / test (gh) (trunk failure)
trunk / macos-py3-arm64 / test (default, 2, 3, macos-m1-stable) (gh) (trunk failure)
dynamo/test_subclasses.py::TestNestedTensor::test_inference_tensor

This comment was automatically generated by Dr. CI and updates every 15 minutes.

mikaylagawarecki

Thanks

mikaylagawarecki · 2024-08-21T15:01:28Z

@pytorchbot merge

pytorchmergebot · 2024-08-21T15:03:17Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-08-21T15:19:02Z

Merge failed

Reason: 2 jobs have failed, first few of them are: trunk / macos-py3-arm64-mps / test (mps, 1, 1, macos-m1-13), trunk / macos-py3-arm64-mps / test (mps, 1, 1, macos-m1-14)

Details for Dev Infra team

Raised by workflow job

pytorchmergebot · 2024-08-21T23:24:01Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-08-21T23:24:10Z

Merge failed

Reason: 3 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

mayank31398 · 2024-08-22T15:00:27Z

@mikaylagawarecki can you merge this?
I dont think the failing test is coming from my PR

mayank31398 · 2024-08-23T06:53:50Z

@mikaylagawarecki pinging again for quick resolution for merge.
Unsure about failing tests. pretty sure they are unrelated to this PR

mikaylagawarecki · 2024-08-23T17:31:35Z

@pytorchbot merge -r

mikaylagawarecki · 2024-08-23T17:31:56Z

The failing tests look related to me test_modules.py::TestModuleMPS::test_forward_nn_RMSNorm_mps_float16, rebasing to check

pytorchmergebot · 2024-08-23T17:33:52Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-08-23T17:33:58Z

Successfully rebased fix-rmsnorm onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout fix-rmsnorm && git pull --rebase)

mayank31398 · 2024-08-30T20:33:15Z

@eqy @mikaylagawarecki pinging again

mayank31398 · 2024-09-02T02:48:13Z

guys can we get this merged?

eqy · 2024-09-02T06:01:38Z

nit: could we use OpMath rather than hardcoding the half, float case here? e.g.,

pytorch/aten/src/ATen/OpMathType.h

Line 16 in bf7db4e

struct OpMathType {

@eqy I dont understand how to use this. Can you give an example?

see e.g.,

pytorch/aten/src/ATen/native/cpu/ReduceUtils.h

Line 213 in 3daca18

using opmath_t = at::opmath_type<scalar_t>;

mayank31398 · 2024-09-02T21:31:34Z

@eqy @kkontny my knowledge of C++ isnt great

using opmath_t = opmath_type<scalar_t>;
upcasted_input = input.to(opmath_t())

this doesn't compile, can you push a fix?

mayank31398 · 2024-09-02T21:58:27Z

@eqy figured it out
I think its fixed now.

malfet · 2024-09-02T22:18:38Z

torch/testing/_internal/common_modules.py

        weight = m.weight
        dims = [ndim - i - 1 for i in range(len(normalized_shape))]
-        result = i * torch.rsqrt(i.pow(2).mean(dim=dims, keepdim=True) + m.eps)
+        upcasted_i = i.float()


This would fail if I is complex and would reduce the precision if I is double

Suggested change

upcasted_i = i.float()

upcasted_i = i.to(dtype=torch.float) if i.dtype == torch.half else i

i think half doesn't include bf16 right?

also, this is just for testcases which should pass

mayank31398 · 2024-09-03T20:24:34Z

@eqy pinging again

mayank31398 · 2024-09-06T17:55:00Z

failing tests seem unrelated @eqy @mikaylagawarecki

mayank31398 · 2024-09-10T08:05:50Z

@eqy any updates on this?

mayank31398 · 2024-09-11T18:54:15Z

@eqy @mikaylagawarecki
pinging again

eqy · 2024-09-11T18:56:58Z

@pytorchmergebot merge

pytorchmergebot · 2024-09-11T18:58:50Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

@kkontny

This PR is a replacement for pytorch#133085 for pushing a quick fix for RMSNorm. The original author is @kkontny Previous PR summary: Since FP16 has quite small dynamic range it is very easy to overflow while computing `at::pow(input, 2)` , and it happens in real world computation. I've tried to use `nn.RMSNorm` fused implementation instead of `LlamaRMSNorm` inside `transformers` implementation of Llama (`src/transformers/models/llama/modeling_llama.py`). It started to give wrong answers in Fp16 while still giving good in FP32. I figured out happens due to overflow while computing square of the input tensor. Original `LLamaRMSNorm` implementation upcasts input to fp32 to prevent this and give better numerical stability. ``` class LlamaRMSNorm(nn.Module): def __init__(self, hidden_size, eps=1e-6): """ LlamaRMSNorm is equivalent to T5LayerNorm """ super().__init__() self.weight = nn.Parameter(torch.ones(hidden_size)) self.variance_epsilon = eps def forward(self, hidden_states): input_dtype = hidden_states.dtype hidden_states = hidden_states.to(torch.float32) variance = hidden_states.pow(2).mean(-1, keepdim=True) hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon) return self.weight * hidden_states.to(input_dtype) ``` Proposed commit fixed the issue. FP16 in RMSNorm has to be treated in special way, to be usable in real world implementations. Pull Request resolved: pytorch#134106 Approved by: https://github.com/mikaylagawarecki, https://github.com/eqy

Fixes #134106. This PR moves the `upcasted_result` down-casting after all computation is done. Since the multiplication with the weight_opt input is not done in half precision, the current code path is doing the following: fp16 -> fp32 -> fp16 -> fp32 -> fp16. What we want tho is to avoid down-casting and this PR proposes: fp16 -> fp32 -> fp16. This results in better accuracy as it avoids truncating. Pull Request resolved: #147203 Approved by: https://github.com/eqy

fix for fp16

890eab2

mayank31398 requested a review from mruberry as a code owner August 21, 2024 13:41

mayank31398 mentioned this pull request Aug 21, 2024

Fixing Pytorch RMS norm implementation #133085

Closed

pytorchbot added the open source label Aug 21, 2024

mikaylagawarecki approved these changes Aug 21, 2024

View reviewed changes

mikaylagawarecki added the topic: not user facing topic category label Aug 21, 2024

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Aug 21, 2024

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 21, 2024

pytorchmergebot added the merging label Aug 21, 2024

pytorchmergebot removed the merging label Aug 21, 2024

move casting inside

3704006

pytorchmergebot added the merging label Aug 21, 2024

pytorchmergebot removed the merging label Aug 21, 2024

Merge branch 'main' into fix-rmsnorm

0a3e470

mayank31398 requested a review from mikaylagawarecki August 22, 2024 17:37

pytorchmergebot force-pushed the fix-rmsnorm branch from 0a3e470 to 93a13e6 Compare August 23, 2024 17:34

pytorchmergebot requested review from eqy and syed-ahmed as code owners August 23, 2024 17:34

Merge branch 'main' into fix-rmsnorm

5cbb3c3

use opmath

fd44b4b

mayank31398 force-pushed the fix-rmsnorm branch from 38b7869 to fd44b4b Compare September 2, 2024 21:55

malfet reviewed Sep 2, 2024

View reviewed changes

mayank31398 force-pushed the fix-rmsnorm branch from 0cb9b84 to fd44b4b Compare September 2, 2024 22:51

mayank31398 added 2 commits September 3, 2024 16:52

trigger tests

4501b72

Merge branch 'main' into fix-rmsnorm

04c9c16

pytorchmergebot added the merging label Sep 11, 2024

pytorchmergebot added the Merged label Sep 11, 2024

pytorchmergebot closed this in 9a04cfb Sep 11, 2024

pytorchmergebot removed the merging label Sep 11, 2024

mayank31398 deleted the fix-rmsnorm branch September 12, 2024 03:35

t-vi mentioned this pull request Jan 29, 2025

Input upcast is missing in Thunder's implementation of torch.nn.functional.rms_norm Lightning-AI/lightning-thunder#1713

Closed

riccardofelluga mentioned this pull request Feb 14, 2025

Fix rms_norm in fp16/bf16 #147203

Closed

	upcasted_i = i.float()
	upcasted_i = i.to(dtype=torch.float) if i.dtype == torch.half else i

Conversation

mayank31398 commented Aug 21, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134106

✅ You can merge normally! (9 Unrelated Failures)

Uh oh!

mikaylagawarecki left a comment

Choose a reason for hiding this comment

Uh oh!

mikaylagawarecki commented Aug 21, 2024

Uh oh!

pytorchmergebot commented Aug 21, 2024

Merge started

Uh oh!

pytorchmergebot commented Aug 21, 2024

Merge failed

Uh oh!

pytorchmergebot commented Aug 21, 2024

Merge started

Uh oh!

pytorchmergebot commented Aug 21, 2024

Merge failed

Uh oh!

mayank31398 commented Aug 22, 2024

Uh oh!

mayank31398 commented Aug 23, 2024

Uh oh!

mikaylagawarecki commented Aug 23, 2024

Uh oh!

mikaylagawarecki commented Aug 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorchmergebot commented Aug 23, 2024

Uh oh!

pytorchmergebot commented Aug 23, 2024

Uh oh!

mayank31398 commented Aug 30, 2024

Uh oh!

mayank31398 commented Sep 2, 2024

Uh oh!

eqy commented Sep 2, 2024

Uh oh!

mayank31398 commented Sep 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mayank31398 commented Sep 2, 2024

Uh oh!

malfet Sep 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mayank31398 Sep 2, 2024

Choose a reason for hiding this comment

Uh oh!

mayank31398 Sep 2, 2024

Choose a reason for hiding this comment

Uh oh!

mayank31398 commented Sep 3, 2024

Uh oh!

mayank31398 commented Sep 6, 2024

Uh oh!

mayank31398 commented Sep 10, 2024

Uh oh!

mayank31398 commented Sep 11, 2024

Uh oh!

eqy commented Sep 11, 2024

Uh oh!

pytorchmergebot commented Sep 11, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

mayank31398 commented Aug 21, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Aug 21, 2024 •

edited

Loading

mikaylagawarecki commented Aug 23, 2024 •

edited

Loading

mayank31398 commented Sep 2, 2024 •

edited

Loading

malfet Sep 2, 2024 •

edited

Loading