[inductor] Skip addcmul decomposition to enable FMA lowering by mlazos · Pull Request #175309 · pytorch/pytorch

mlazos · 2026-02-19T03:33:06Z

Stack from ghstack (oldest at bottom):

Add aten.addcmul and aten._foreach_addcmul.Scalar to decomps_to_exclude
so that the FMA-based addcmul lowering is used instead of decomposition.

This enables bitwise precision parity with eager CUDA for addcmul operations.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

[ghstack-poisoned]

pytorch-bot · 2026-02-19T03:33:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175309

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Large queue of CI runners due to meta's autoscaler not provisioning instances [Please cancel and retry the WHOLE pull.yml workflow]

✅ No Failures

As of commit 1541f87 with merge base 0b6476f ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2026-02-19T03:33:13Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

[ghstack-poisoned]

Add aten.addcmul and aten._foreach_addcmul.Scalar to decomps_to_exclude so that the FMA-based addcmul lowering is used instead of decomposition. This enables bitwise precision parity with eager CUDA for addcmul operations. ghstack-source-id: a3e3223 Pull-Request: #175309

[ghstack-poisoned]

Add aten.addcmul and aten._foreach_addcmul.Scalar to decomps_to_exclude so that the FMA-based addcmul lowering is used instead of decomposition. This enables bitwise precision parity with eager CUDA for addcmul operations. ghstack-source-id: a3e3223 Pull-Request: #175309

[ghstack-poisoned]

Add aten.addcmul and aten._foreach_addcmul.Scalar to decomps_to_exclude so that the FMA-based addcmul lowering is used instead of decomposition. This enables bitwise precision parity with eager CUDA for addcmul operations. ghstack-source-id: a5894f3 Pull-Request: #175309

[ghstack-poisoned]

Add aten.addcmul and aten._foreach_addcmul.Scalar to decomps_to_exclude so that the FMA-based addcmul lowering is used instead of decomposition. This enables bitwise precision parity with eager CUDA for addcmul operations. ghstack-source-id: 82c3ffa Pull-Request: #175309

[ghstack-poisoned]

Add aten.addcmul and aten._foreach_addcmul.Scalar to decomps_to_exclude so that the FMA-based addcmul lowering is used instead of decomposition. This enables bitwise precision parity with eager CUDA for addcmul operations. ghstack-source-id: 02f5f55 Pull-Request: #175309

[ghstack-poisoned]

Add aten.addcmul and aten._foreach_addcmul.Scalar to decomps_to_exclude so that the FMA-based addcmul lowering is used instead of decomposition. This enables bitwise precision parity with eager CUDA for addcmul operations. ghstack-source-id: 0a51cb5 Pull-Request: #175309

[ghstack-poisoned]

mlazos · 2026-02-27T19:56:07Z

@pytorchbot merge

pytorchmergebot · 2026-02-27T19:58:31Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to decomps_to_exclude so that the FMA-based addcdiv lowering is used instead of decomposition. Also simplify the dynamo handlers for addcdiv to always skip inline decomposition. This enables bitwise precision parity with eager CUDA for addcdiv operations used in Adam/AdamW optimizers. Pull Request resolved: #175310 Approved by: https://github.com/v0i0 ghstack dependencies: #174912, #175309

Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this, Triton computes `b * alpha` then adds to `a` as separate operations, losing the FMA precision guarantee. This affects optimizer weight_decay paths which use `grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha. Authored with Claude. Pull Request resolved: #175838 Approved by: https://github.com/v0i0 ghstack dependencies: #174912, #175309, #175310

Add aten.addcmul and aten._foreach_addcmul.Scalar to decomps_to_exclude so that the FMA-based addcmul lowering is used instead of decomposition. This enables bitwise precision parity with eager CUDA for addcmul operations. ghstack-source-id: 92d3504 Pull-Request: pytorch/pytorch#175309

…#175309) Add aten.addcmul and aten._foreach_addcmul.Scalar to decomps_to_exclude so that the FMA-based addcmul lowering is used instead of decomposition. This enables bitwise precision parity with eager CUDA for addcmul operations. Pull Request resolved: pytorch#175309 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174912

…#175310) Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to decomps_to_exclude so that the FMA-based addcdiv lowering is used instead of decomposition. Also simplify the dynamo handlers for addcdiv to always skip inline decomposition. This enables bitwise precision parity with eager CUDA for addcdiv operations used in Adam/AdamW optimizers. Pull Request resolved: pytorch#175310 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174912, pytorch#175309

Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this, Triton computes `b * alpha` then adds to `a` as separate operations, losing the FMA precision guarantee. This affects optimizer weight_decay paths which use `grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha. Authored with Claude. Pull Request resolved: pytorch#175838 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174912, pytorch#175309, pytorch#175310

Update

4904ef0

[ghstack-poisoned]

pytorch-bot Bot added ciflow/inductor module: inductor labels Feb 19, 2026

Update

9c8508d

[ghstack-poisoned]

mlazos requested review from eellison and v0i0 February 19, 2026 19:54

mlazos added 2 commits February 20, 2026 16:21

Update

c3355fa

[ghstack-poisoned]

Update

5b264c0

[ghstack-poisoned]

Update

fc9d4b4

[ghstack-poisoned]

mlazos added 4 commits February 21, 2026 01:14

Update

71835e2

[ghstack-poisoned]

Update

c0a287d

[ghstack-poisoned]

Update

c62fe19

[ghstack-poisoned]

Update

8cf4031

[ghstack-poisoned]

Update

95ef96c

[ghstack-poisoned]

mlazos mentioned this pull request Feb 24, 2026

[inductor] Use CUDA toolkit libdevice for Triton pow precision #175594

Closed

Update

fa88523

[ghstack-poisoned]

Update

795ab68

[ghstack-poisoned]

This was referenced Feb 26, 2026

[inductor] Add FMA lowering for add-with-alpha on CUDA #175838

Closed

[inductor] Make addcmul/addcdiv decomp skip unconditional and add another decomp #175839

Closed

mlazos added 4 commits February 26, 2026 00:51

Update

500f4a5

[ghstack-poisoned]

Update

3d9bdd2

[ghstack-poisoned]

Update

f83d3a7

[ghstack-poisoned]

Update

7474af9

[ghstack-poisoned]

Update

5d1148e

[ghstack-poisoned]

mlazos added the release notes: inductor label Feb 27, 2026

Update

1541f87

[ghstack-poisoned]

v0i0 approved these changes Feb 27, 2026

View reviewed changes

mlazos added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 27, 2026

pytorchmergebot added the merging label Feb 27, 2026

pytorchmergebot added the Merged label Feb 27, 2026

pytorchmergebot closed this in 0c15774 Feb 27, 2026

pytorchmergebot removed the merging label Feb 27, 2026

github-actions Bot deleted the gh/mlazos/106/head branch March 30, 2026 02:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inductor] Skip addcmul decomposition to enable FMA lowering#175309

[inductor] Skip addcmul decomposition to enable FMA lowering#175309
mlazos wants to merge 20 commits intogh/mlazos/106/basefrom
gh/mlazos/106/head

mlazos commented Feb 19, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Feb 19, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Feb 19, 2026

Uh oh!

mlazos commented Feb 27, 2026

Uh oh!

pytorchmergebot commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mlazos commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175309

❗ 1 Active SEVs

✅ No Failures

Uh oh!

pytorch-bot Bot commented Feb 19, 2026

This PR needs a release notes: label

Uh oh!

mlazos commented Feb 27, 2026

Uh oh!

pytorchmergebot commented Feb 27, 2026

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mlazos commented Feb 19, 2026 •

edited

Loading

pytorch-bot Bot commented Feb 19, 2026 •

edited

Loading

This PR needs a `release notes:` label