Fix rms_norm in fp16/bf16 by riccardofelluga · Pull Request #147203 · pytorch/pytorch

riccardofelluga · 2025-02-14T15:29:39Z

Fixes #134106. This PR moves the upcasted_result down-casting after all computation is done.

Since the multiplication with the weight_opt input is not done in half precision, the current code path is doing the following: fp16 -> fp32 -> fp16 -> fp32 -> fp16. What we want tho is to avoid down-casting and this PR proposes: fp16 -> fp32 -> fp16. This results in better accuracy as it avoids truncating.

pytorch-bot · 2025-02-14T15:29:44Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147203

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 5a0c8cf with merge base 81847d0 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

trunk / libtorch-linux-focal-cuda12.6-py3.10-gcc9-debug / build (gh) (trunk failure)
undefined reference to std::__throw_bad_array_new_length()'`

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2025-02-14T15:29:44Z

The committers listed above are authorized under a signed CLA.

✅ login: riccardofelluga / name: Riccardo Felluga (6ecb1d1, 5a0c8cf, 7fd4753)

riccardofelluga · 2025-02-14T15:35:52Z

@pytorchbot label "topic: not user facing"

eqy

Would it make sense to add a test that checks for whether the expected tolerances are met?

riccardofelluga · 2025-03-03T16:23:13Z

@pytorchbot merge

pytorchmergebot · 2025-03-03T16:25:03Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-03-03T16:25:13Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

riccardofelluga · 2025-03-03T20:33:25Z

@pytorchbot merge

pytorch-bot · 2025-03-03T20:33:30Z

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

riccardofelluga · 2025-03-07T11:06:28Z

@pytorchbot rebase

pytorch-bot · 2025-03-07T11:06:33Z

You don't have permissions to rebase this PR since you are a first time contributor. If you think this is a mistake, please contact PyTorch Dev Infra.

riccardofelluga · 2025-03-07T11:06:50Z

@pytorchbot merge

pytorchmergebot · 2025-03-07T11:08:31Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-03-07T11:08:42Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

riccardofelluga · 2025-03-07T11:08:59Z

@pytorchbot merge -i

pytorch-bot · 2025-03-07T11:09:04Z

-i flag is only allowed for users with write permissions

eqy · 2025-03-07T19:02:40Z

@pytorchmergebot rebase

pytorchmergebot · 2025-03-07T19:04:17Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-03-07T19:04:20Z

Successfully rebased rms-cast onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout rms-cast && git pull --rebase)

riccardofelluga · 2025-03-07T22:56:06Z

@pytorchbot merge

pytorchmergebot · 2025-03-07T22:57:49Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

danielvegamyhre · 2025-04-08T17:01:37Z

@riccardofelluga @eqy This PR seems to break torchtitan float8 training with rowwise scales, when RMS norm is used.

Sometime between 2.6.0 and present, a change in pytorch core was introduced that caused loss to not go down and then eventually become NaN after 40 or so steps.
I binary searched the commits in this time range and confirmed this commit is what caused the regression (link)
I also confirmed the issue reproduces using rmsnorm with the latest nightly build, and does not reproduce using layernorm (link)

Can we either revert this change or look into a fix asap please? This NaN issue is currently blocking the release of a blog post on float8 rowwise training, so we are eager to resolve it as soon as possible. Thanks!

cc @vkuzo @lessw2020

danielvegamyhre · 2025-04-08T17:04:10Z

fyi @drisspg on #147203 (comment) as well since Vasiliy is OOO

pytorchbot added the open source label Feb 14, 2025

Skylion007 requested a review from eqy February 14, 2025 15:33

pytorch-bot bot added the topic: not user facing topic category label Feb 14, 2025

This was referenced Feb 14, 2025

Fix rms_norm fp16/bf16 Lightning-AI/lightning-thunder#1751

Merged

Input upcast is missing in Thunder's implementation of torch.nn.functional.rms_norm Lightning-AI/lightning-thunder#1713

Closed

colesbury requested a review from albanD February 18, 2025 16:32

colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 18, 2025

eqy reviewed Feb 21, 2025

View reviewed changes

eqy approved these changes Feb 27, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 3, 2025

pytorchmergebot added the merging label Mar 3, 2025

pytorchmergebot removed the merging label Mar 3, 2025

pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Mar 3, 2025

riccardofelluga requested a review from eqy March 3, 2025 21:48

eqy approved these changes Mar 6, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 7, 2025

pytorchmergebot added the merging label Mar 7, 2025

pytorchmergebot removed the merging label Mar 7, 2025

riccardofelluga added 3 commits March 7, 2025 19:04

keep rms computation in full precision

7fd4753

add numerics test

6ecb1d1

specify only native devices

5a0c8cf

pytorchmergebot force-pushed the rms-cast branch from b15c62b to 5a0c8cf Compare March 7, 2025 19:04

pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Mar 7, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 7, 2025

pytorchmergebot added the merging label Mar 7, 2025

pytorchmergebot added the Merged label Mar 8, 2025

pytorchmergebot closed this in 8f71d45 Mar 8, 2025

pytorchmergebot removed the merging label Mar 8, 2025

danielvegamyhre mentioned this pull request Apr 8, 2025

NaN loss with rowwise FP8 pytorch/torchtitan#1056

Closed

danielvegamyhre mentioned this pull request Apr 8, 2025

RMS norm causes NaNs when used with torch.compile + float8 with rowwise scales #150859

Closed

Conversation

riccardofelluga commented Feb 14, 2025

Uh oh!

pytorch-bot bot commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147203

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

linux-foundation-easycla bot commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

riccardofelluga commented Feb 14, 2025

Uh oh!

eqy left a comment

Choose a reason for hiding this comment

Uh oh!

riccardofelluga commented Mar 3, 2025

Uh oh!

pytorchmergebot commented Mar 3, 2025

Merge started

Uh oh!

pytorchmergebot commented Mar 3, 2025

Merge failed

Uh oh!

riccardofelluga commented Mar 3, 2025

Uh oh!

pytorch-bot bot commented Mar 3, 2025

Uh oh!

riccardofelluga commented Mar 7, 2025

Uh oh!

pytorch-bot bot commented Mar 7, 2025

Uh oh!

riccardofelluga commented Mar 7, 2025

Uh oh!

pytorchmergebot commented Mar 7, 2025

Merge started

Uh oh!

pytorchmergebot commented Mar 7, 2025

Merge failed

Uh oh!

riccardofelluga commented Mar 7, 2025

Uh oh!

pytorch-bot bot commented Mar 7, 2025

Uh oh!

eqy commented Mar 7, 2025

Uh oh!

pytorchmergebot commented Mar 7, 2025

Uh oh!

pytorchmergebot commented Mar 7, 2025

Uh oh!

riccardofelluga commented Mar 7, 2025

Uh oh!

pytorchmergebot commented Mar 7, 2025

Merge started

Uh oh!

danielvegamyhre commented Apr 8, 2025

Uh oh!

danielvegamyhre commented Apr 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pytorch-bot bot commented Feb 14, 2025 •

edited

Loading

linux-foundation-easycla bot commented Feb 14, 2025 •

edited

Loading