[inductor] Add FMA lowering for add-with-alpha on CUDA by mlazos · Pull Request #175838 · pytorch/pytorch

mlazos · 2026-02-26T07:52:11Z

Stack from ghstack (oldest at bottom):

Eager CUDA computes a + alpha * b as fma(b, alpha, a). Without this,
Triton computes b * alpha then adds to a as separate operations, losing
the FMA precision guarantee.

This affects optimizer weight_decay paths which use
grad.add(param, alpha=weight_decay) and _foreach_add with alpha.

Authored with Claude.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

[ghstack-poisoned]

pytorch-bot · 2026-02-26T07:52:14Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175838

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 330cd4f with merge base 0b6476f ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / unit-test / inductor-test / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (detected as infra flaky with no log or failing log classifier)
pull / linux-jammy-py3.10-clang15 / test (dynamo_wrapped, 2, 3, lf.linux.2xlarge) (gh) (disabled by #137892 but the issue was closed recently and a rebase is needed to make it pass)
test/test_autograd.py::TestAutograd::test_post_accumulate_grad_hook_e2e

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2026-02-26T07:52:17Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

[ghstack-poisoned]

Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this, Triton computes `b * alpha` then adds to `a` as separate operations, losing the FMA precision guarantee. This affects optimizer weight_decay paths which use `grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha. Authored with Claude. ghstack-source-id: bac0c83 Pull-Request: #175838

[ghstack-poisoned]

mlazos · 2026-03-01T01:16:01Z

@pytorchbot merge -i

pytorchmergebot · 2026-03-01T01:18:21Z

Merge started

Your change will be merged while ignoring the following 3 checks: trunk / linux-jammy-cuda13.0-py3.10-gcc11 / test (default, 1, 5, linux.g6.4xlarge.experimental.nvidia.gpu), inductor / unit-test / inductor-test / test (inductor, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor / unit-test / inductor-test / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-auto-revert · 2026-03-01T03:30:32Z

@pytorchbot revert -m "Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable" -c autorevert

This PR is attributed to have caused regression in:

trunk: inductor/test_torchinductor_opinfo.py::test_comprehensive_rsub_cuda_float64 (hud)

Please investigate and fix the issues.
@claude Can you please read this revert comment, follow the links and read the errors, to then give a brief diagnostics on the cause of the error? If you judge the error to be legitimate reason for a revert, please provide brief guidance on how the author could fix it.

claude · 2026-03-01T03:30:55Z

Claude finished @pytorch-auto-revert[bot]'s task in 4m 1s —— View job

Diagnosing Auto-Revert for FMA Lowering PR

Read the PR changes to understand what was modified
Investigate the CI failure logs
Diagnose root cause and provide guidance

Root Cause

The revert is legitimate. The bug is in the FMA lowering path in make_pointwise() at torch/_inductor/lowering.py:677-689.

The FMA optimization is applied unconditionally for both add and sub (since both are registered with allow_alpha=True), but it only computes the correct result for add:

# line 689 - always computes a + alpha * b
return _add_with_alpha_fma(inputs[0], inputs[1], alpha)

aten.add(a, b, alpha=k) should compute a + k*b -- FMA gives fma(b, k, a) = b*k + a -- correct
aten.sub(a, b, alpha=k) should compute a - k*b -- FMA gives fma(b, k, a) = b*k + a = a + k*b -- wrong (adds instead of subtracting)

This is why every test failure shows 100% of elements mismatched with large absolute differences (~35), and always with alpha=2 in the sample input. The FMA path computes a + 2*b instead of a - 2*b.

Failing Tests

Test	Root Cause
`test_comprehensive_sub_cuda_float32`	`sub` with `alpha=2` uses FMA → computes `a+2b` instead of `a-2b`
`test_comprehensive_rsub_cuda_float16`	`rsub` decomposes to `sub` → same bug
`test_comprehensive_rsub_cuda_float64` (revert trigger)	Same as above for float64
`test_op_dtype_propagation_sub_cuda_float32`	Same `sub` bug in dtype propagation test

How to Fix

The make_pointwise FMA path needs to be aware of whether it's being called for add or sub. Two approaches:

Negate alpha for sub: Pass fn info into the FMA decision and use -alpha for subtraction:

# In make_pointwise, when the underlying fn is sub:
return _add_with_alpha_fma(inputs[0], inputs[1], -alpha)

Only apply FMA for add: The simplest fix is to restrict the FMA path to add only, since make_pointwise doesn't currently know which op it's wrapping. This could be done by adding a parameter like use_fma_for_alpha=False and only setting it to True in the add registration. Alternatively, the FMA check could be moved out of make_pointwise and into a custom add lowering.

Option 2 is cleaner since it avoids leaking op-specific semantics into the generic make_pointwise helper. A dedicated aten.add lowering (similar to addcmul/addcdiv) would be the most explicit approach.

pytorchmergebot · 2026-03-01T03:32:24Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

…)" This reverts commit 45dfce3. Reverted #175838 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#175838 (comment)))

pytorchmergebot · 2026-03-01T03:32:33Z

@mlazos your PR has been successfully reverted.

[ghstack-poisoned]

mlazos · 2026-03-02T06:49:42Z

@pytorchbot merge

pytorchmergebot · 2026-03-02T06:52:06Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this, Triton computes `b * alpha` then adds to `a` as separate operations, losing the FMA precision guarantee. This affects optimizer weight_decay paths which use `grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha. Authored with Claude. ghstack-source-id: c259808 Pull-Request: pytorch/pytorch#175838

Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this, Triton computes `b * alpha` then adds to `a` as separate operations, losing the FMA precision guarantee. This affects optimizer weight_decay paths which use `grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha. Authored with Claude. Pull Request resolved: pytorch#175838 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174912, pytorch#175309, pytorch#175310

…ch#175838)" This reverts commit 45dfce3. Reverted pytorch#175838 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#175838 (comment)))

Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this, Triton computes `b * alpha` then adds to `a` as separate operations, losing the FMA precision guarantee. This affects optimizer weight_decay paths which use `grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha. Authored with Claude. Pull Request resolved: pytorch#175838 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174912, pytorch#175309, pytorch#175310

Update

c39b941

[ghstack-poisoned]

pytorch-bot Bot added ciflow/inductor module: inductor labels Feb 26, 2026

mlazos added 5 commits February 26, 2026 00:33

Update

10d3fce

[ghstack-poisoned]

Update

80287af

[ghstack-poisoned]

Update

80ede87

[ghstack-poisoned]

Update

c56e3a6

[ghstack-poisoned]

Update

7b88fac

[ghstack-poisoned]

mlazos added 3 commits February 26, 2026 16:58

Update

bbd66bc

[ghstack-poisoned]

Update

f8cd22b

[ghstack-poisoned]

Update

8c5edb1

[ghstack-poisoned]

mlazos added the release notes: inductor label Feb 27, 2026

v0i0 approved these changes Feb 27, 2026

View reviewed changes

mlazos added 2 commits February 27, 2026 15:01

Update

adc728f

[ghstack-poisoned]

Update

ce6853d

[ghstack-poisoned]

mlazos added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 28, 2026

Update

2e50cf2

[ghstack-poisoned]

pytorchmergebot added the merging label Mar 1, 2026

pytorchmergebot closed this in 45dfce3 Mar 1, 2026

pytorchmergebot added Merged and removed merging labels Mar 1, 2026

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Mar 1, 2026

pytorchmergebot reopened this Mar 1, 2026

Update

330cd4f

[ghstack-poisoned]

pytorchmergebot added the merging label Mar 2, 2026

pytorchmergebot closed this in f14ab51 Mar 2, 2026

pytorchmergebot removed the merging label Mar 2, 2026

github-actions Bot deleted the gh/mlazos/109/head branch April 2, 2026 02:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inductor] Add FMA lowering for add-with-alpha on CUDA#175838

[inductor] Add FMA lowering for add-with-alpha on CUDA#175838
mlazos wants to merge 13 commits intogh/mlazos/109/basefrom
gh/mlazos/109/head

mlazos commented Feb 26, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Feb 26, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Feb 26, 2026

Uh oh!

mlazos commented Mar 1, 2026

Uh oh!

pytorchmergebot commented Mar 1, 2026

Uh oh!

pytorch-auto-revert Bot commented Mar 1, 2026

Uh oh!

claude Bot commented Mar 1, 2026 •

edited

Loading

Uh oh!

pytorchmergebot commented Mar 1, 2026

Uh oh!

pytorchmergebot commented Mar 1, 2026

Uh oh!

mlazos commented Mar 2, 2026

Uh oh!

pytorchmergebot commented Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mlazos commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175838

✅ You can merge normally! (2 Unrelated Failures)

Uh oh!

pytorch-bot Bot commented Feb 26, 2026

This PR needs a release notes: label

Uh oh!

mlazos commented Mar 1, 2026

Uh oh!

pytorchmergebot commented Mar 1, 2026

Merge started

Uh oh!

pytorch-auto-revert Bot commented Mar 1, 2026

Uh oh!

claude Bot commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Diagnosing Auto-Revert for FMA Lowering PR

Root Cause

Failing Tests

How to Fix

Uh oh!

pytorchmergebot commented Mar 1, 2026

Uh oh!

pytorchmergebot commented Mar 1, 2026

Uh oh!

mlazos commented Mar 2, 2026

Uh oh!

pytorchmergebot commented Mar 2, 2026

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mlazos commented Feb 26, 2026 •

edited

Loading

pytorch-bot Bot commented Feb 26, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Mar 1, 2026 •

edited

Loading