Skip to content

[inductor] Add FMA lowering for add-with-alpha on CUDA#175838

Closed
mlazos wants to merge 13 commits intogh/mlazos/109/basefrom
gh/mlazos/109/head
Closed

[inductor] Add FMA lowering for add-with-alpha on CUDA#175838
mlazos wants to merge 13 commits intogh/mlazos/109/basefrom
gh/mlazos/109/head

Conversation

@mlazos
Copy link
Copy Markdown
Contributor

@mlazos mlazos commented Feb 26, 2026

Stack from ghstack (oldest at bottom):

Eager CUDA computes a + alpha * b as fma(b, alpha, a). Without this,
Triton computes b * alpha then adds to a as separate operations, losing
the FMA precision guarantee.

This affects optimizer weight_decay paths which use
grad.add(param, alpha=weight_decay) and _foreach_add with alpha.

Authored with Claude.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 26, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175838

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit 330cd4f with merge base 0b6476f (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 26, 2026

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
mlazos added a commit that referenced this pull request Feb 27, 2026
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this,
Triton computes `b * alpha` then adds to `a` as separate operations, losing
the FMA precision guarantee.

This affects optimizer weight_decay paths which use
`grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha.

Authored with Claude.

ghstack-source-id: bac0c83
Pull-Request: #175838
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@mlazos mlazos added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 28, 2026
[ghstack-poisoned]
@mlazos
Copy link
Copy Markdown
Contributor Author

mlazos commented Mar 1, 2026

@pytorchbot merge -i

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged while ignoring the following 3 checks: trunk / linux-jammy-cuda13.0-py3.10-gcc11 / test (default, 1, 5, linux.g6.4xlarge.experimental.nvidia.gpu), inductor / unit-test / inductor-test / test (inductor, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor / unit-test / inductor-test / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorch-auto-revert
Copy link
Copy Markdown

@pytorchbot revert -m "Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable" -c autorevert

This PR is attributed to have caused regression in:

Please investigate and fix the issues.
@claude Can you please read this revert comment, follow the links and read the errors, to then give a brief diagnostics on the cause of the error? If you judge the error to be legitimate reason for a revert, please provide brief guidance on how the author could fix it.

@claude
Copy link
Copy Markdown

claude Bot commented Mar 1, 2026

Claude finished @pytorch-auto-revert[bot]'s task in 4m 1s —— View job


Diagnosing Auto-Revert for FMA Lowering PR

  • Read the PR changes to understand what was modified
  • Investigate the CI failure logs
  • Diagnose root cause and provide guidance

Root Cause

The revert is legitimate. The bug is in the FMA lowering path in make_pointwise() at torch/_inductor/lowering.py:677-689.

The FMA optimization is applied unconditionally for both add and sub (since both are registered with allow_alpha=True), but it only computes the correct result for add:

# line 689 - always computes a + alpha * b
return _add_with_alpha_fma(inputs[0], inputs[1], alpha)
  • aten.add(a, b, alpha=k) should compute a + k*b -- FMA gives fma(b, k, a) = b*k + a -- correct
  • aten.sub(a, b, alpha=k) should compute a - k*b -- FMA gives fma(b, k, a) = b*k + a = a + k*b -- wrong (adds instead of subtracting)

This is why every test failure shows 100% of elements mismatched with large absolute differences (~35), and always with alpha=2 in the sample input. The FMA path computes a + 2*b instead of a - 2*b.

Failing Tests

Test Root Cause
test_comprehensive_sub_cuda_float32 sub with alpha=2 uses FMA → computes a+2b instead of a-2b
test_comprehensive_rsub_cuda_float16 rsub decomposes to sub → same bug
test_comprehensive_rsub_cuda_float64 (revert trigger) Same as above for float64
test_op_dtype_propagation_sub_cuda_float32 Same sub bug in dtype propagation test

How to Fix

The make_pointwise FMA path needs to be aware of whether it's being called for add or sub. Two approaches:

  1. Negate alpha for sub: Pass fn info into the FMA decision and use -alpha for subtraction:

    # In make_pointwise, when the underlying fn is sub:
    return _add_with_alpha_fma(inputs[0], inputs[1], -alpha)
  2. Only apply FMA for add: The simplest fix is to restrict the FMA path to add only, since make_pointwise doesn't currently know which op it's wrapping. This could be done by adding a parameter like use_fma_for_alpha=False and only setting it to True in the add registration. Alternatively, the FMA check could be moved out of make_pointwise and into a custom add lowering.

Option 2 is cleaner since it avoids leaking op-specific semantics into the generic make_pointwise helper. A dedicated aten.add lowering (similar to addcmul/addcdiv) would be the most explicit approach.

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Mar 1, 2026
…)"

This reverts commit 45dfce3.

Reverted #175838 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#175838 (comment)))
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@mlazos your PR has been successfully reverted.

@pytorchmergebot pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Mar 1, 2026
[ghstack-poisoned]
@mlazos
Copy link
Copy Markdown
Contributor Author

mlazos commented Mar 2, 2026

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

sandy-gags pushed a commit to sandy-gags/pytorch that referenced this pull request Mar 12, 2026
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this,
Triton computes `b * alpha` then adds to `a` as separate operations, losing
the FMA precision guarantee.

This affects optimizer weight_decay paths which use
`grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha.

Authored with Claude.

ghstack-source-id: c259808
Pull-Request: pytorch/pytorch#175838
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this,
Triton computes `b * alpha` then adds to `a` as separate operations, losing
the FMA precision guarantee.

This affects optimizer weight_decay paths which use
`grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha.

Authored with Claude.

Pull Request resolved: pytorch#175838
Approved by: https://github.com/v0i0
ghstack dependencies: pytorch#174912, pytorch#175309, pytorch#175310
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
…ch#175838)"

This reverts commit 45dfce3.

Reverted pytorch#175838 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#175838 (comment)))
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this,
Triton computes `b * alpha` then adds to `a` as separate operations, losing
the FMA precision guarantee.

This affects optimizer weight_decay paths which use
`grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha.

Authored with Claude.

Pull Request resolved: pytorch#175838
Approved by: https://github.com/v0i0
ghstack dependencies: pytorch#174912, pytorch#175309, pytorch#175310
@github-actions github-actions Bot deleted the gh/mlazos/109/head branch April 2, 2026 02:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants