[inductor] Add FMA-based addcdiv lowering for CUDA parity by mlazos · Pull Request #174912 · pytorch/pytorch

mlazos · 2026-02-12T21:43:25Z

Stack from ghstack (oldest at bottom):

Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add)
to match the precision of ATen's CUDA kernel.

The lowering computes: self + value * (tensor1 / tensor2)

For value=1: self + tensor1 / tensor2 (simple add)
For value!=1: fma(value, tensor1 / tensor2, self)

Uses truediv which internally uses div_rn (round-to-nearest division)
when config.eager_numerics.division_rounding is True, ensuring the
division result is properly rounded before the FMA.

This allows addcdiv operations to be fused into triton kernels while
maintaining bitwise parity with eager execution.

Authored with Claude.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

[ghstack-poisoned]

pytorch-bot · 2026-02-12T21:43:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174912

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c589d47 with merge base 0b6476f ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add) to match the precision of ATen's CUDA kernel. The lowering computes: self + value * (tensor1 / tensor2) - For value=1: self + tensor1 / tensor2 (simple add) - For value!=1: fma(value, tensor1 / tensor2, self) Uses truediv which internally uses div_rn (round-to-nearest division) when config.eager_numerics.division_rounding is True, ensuring the division result is properly rounded before the FMA. This allows addcdiv operations to be fused into triton kernels while maintaining bitwise parity with eager execution. Authored with Claude. ghstack-source-id: 821fd77 Pull-Request: #174912

pytorch-bot · 2026-02-12T21:43:33Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add) to match the precision of ATen's CUDA kernel. The lowering computes: self + value * (tensor1 / tensor2) - For value=1: self + tensor1 / tensor2 (simple add) - For value!=1: fma(value, tensor1 / tensor2, self) Uses truediv which internally uses div_rn (round-to-nearest division) when config.eager_numerics.division_rounding is True, ensuring the division result is properly rounded before the FMA. This allows addcdiv operations to be fused into triton kernels while maintaining bitwise parity with eager execution. Authored with Claude. ghstack-source-id: 821fd77 Pull-Request: #174912

[ghstack-poisoned]

Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add) to match the precision of ATen's CUDA kernel. The lowering computes: self + value * (tensor1 / tensor2) - For value=1: self + tensor1 / tensor2 (simple add) - For value!=1: fma(value, tensor1 / tensor2, self) Uses truediv which internally uses div_rn (round-to-nearest division) when config.eager_numerics.division_rounding is True, ensuring the division result is properly rounded before the FMA. This allows addcdiv operations to be fused into triton kernels while maintaining bitwise parity with eager execution. Authored with Claude. ghstack-source-id: c517831 Pull-Request: #174912

[ghstack-poisoned]

Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add) to match the precision of ATen's CUDA kernel. The lowering computes: self + value * (tensor1 / tensor2) - For value=1: self + tensor1 / tensor2 (simple add) - For value!=1: fma(value, tensor1 / tensor2, self) Uses truediv which internally uses div_rn (round-to-nearest division) when config.eager_numerics.division_rounding is True, ensuring the division result is properly rounded before the FMA. This allows addcdiv operations to be fused into triton kernels while maintaining bitwise parity with eager execution. Authored with Claude. ghstack-source-id: b669460 Pull-Request: #174912

[ghstack-poisoned]

Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add) to match the precision of ATen's CUDA kernel. The lowering computes: self + value * (tensor1 / tensor2) - For value=1: self + tensor1 / tensor2 (simple add) - For value!=1: fma(value, tensor1 / tensor2, self) Uses truediv which internally uses div_rn (round-to-nearest division) when config.eager_numerics.division_rounding is True, ensuring the division result is properly rounded before the FMA. This allows addcdiv operations to be fused into triton kernels while maintaining bitwise parity with eager execution. Authored with Claude. ghstack-source-id: 5d5f934 Pull-Request: #174912

[ghstack-poisoned]

Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add) to match the precision of ATen's CUDA kernel. The lowering computes: self + value * (tensor1 / tensor2) - For value=1: self + tensor1 / tensor2 (simple add) - For value!=1: fma(value, tensor1 / tensor2, self) Uses truediv which internally uses div_rn (round-to-nearest division) when config.eager_numerics.division_rounding is True, ensuring the division result is properly rounded before the FMA. This allows addcdiv operations to be fused into triton kernels while maintaining bitwise parity with eager execution. Authored with Claude. ghstack-source-id: 77ba4b5 Pull-Request: #174912

[ghstack-poisoned]

Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add) to match the precision of ATen's CUDA kernel. The lowering computes: self + value * (tensor1 / tensor2) - For value=1: self + tensor1 / tensor2 (simple add) - For value!=1: fma(value, tensor1 / tensor2, self) Uses truediv which internally uses div_rn (round-to-nearest division) when config.eager_numerics.division_rounding is True, ensuring the division result is properly rounded before the FMA. This allows addcdiv operations to be fused into triton kernels while maintaining bitwise parity with eager execution. Authored with Claude. ghstack-source-id: bc89e41 Pull-Request: #174912

[ghstack-poisoned]

Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add) to match the precision of ATen's CUDA kernel. The lowering computes: self + value * (tensor1 / tensor2) - For value=1: self + tensor1 / tensor2 (simple add) - For value!=1: fma(value, tensor1 / tensor2, self) Uses truediv which internally uses div_rn (round-to-nearest division) when config.eager_numerics.division_rounding is True, ensuring the division result is properly rounded before the FMA. This allows addcdiv operations to be fused into triton kernels while maintaining bitwise parity with eager execution. Authored with Claude. ghstack-source-id: a772c65 Pull-Request: #174912

[ghstack-poisoned]

mlazos · 2026-02-27T01:00:52Z

@pytorchbot merge

pytorchmergebot · 2026-02-27T01:03:27Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

pytorch-bot · 2026-02-27T01:03:31Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

mlazos · 2026-02-27T07:24:26Z

@pytorchbot merge

pytorchmergebot · 2026-02-27T07:26:52Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Add aten.addcmul and aten._foreach_addcmul.Scalar to decomps_to_exclude so that the FMA-based addcmul lowering is used instead of decomposition. This enables bitwise precision parity with eager CUDA for addcmul operations. Pull Request resolved: #175309 Approved by: https://github.com/v0i0 ghstack dependencies: #174912

Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to decomps_to_exclude so that the FMA-based addcdiv lowering is used instead of decomposition. Also simplify the dynamo handlers for addcdiv to always skip inline decomposition. This enables bitwise precision parity with eager CUDA for addcdiv operations used in Adam/AdamW optimizers. Pull Request resolved: #175310 Approved by: https://github.com/v0i0 ghstack dependencies: #174912, #175309

Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this, Triton computes `b * alpha` then adds to `a` as separate operations, losing the FMA precision guarantee. This affects optimizer weight_decay paths which use `grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha. Authored with Claude. Pull Request resolved: #175838 Approved by: https://github.com/v0i0 ghstack dependencies: #174912, #175309, #175310

# Motivation #174912 introduces this feature on CUDA, this PR aims to enable it for XPU. # Additional Context fix #176152 fix #176157 fix #176168 fix #176407 Pull Request resolved: #176163 Approved by: https://github.com/jansel

Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add) to match the precision of ATen's CUDA kernel. The lowering computes: self + value * (tensor1 / tensor2) - For value=1: self + tensor1 / tensor2 (simple add) - For value!=1: fma(value, tensor1 / tensor2, self) Uses truediv which internally uses div_rn (round-to-nearest division) when config.eager_numerics.division_rounding is True, ensuring the division result is properly rounded before the FMA. This allows addcdiv operations to be fused into triton kernels while maintaining bitwise parity with eager execution. Authored with Claude. ghstack-source-id: 2cd4301 Pull-Request: pytorch/pytorch#174912

…4912) Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add) to match the precision of ATen's CUDA kernel. The lowering computes: self + value * (tensor1 / tensor2) - For value=1: self + tensor1 / tensor2 (simple add) - For value!=1: fma(value, tensor1 / tensor2, self) Uses truediv which internally uses div_rn (round-to-nearest division) when config.eager_numerics.division_rounding is True, ensuring the division result is properly rounded before the FMA. This allows addcdiv operations to be fused into triton kernels while maintaining bitwise parity with eager execution. Authored with Claude. Pull Request resolved: pytorch#174912 Approved by: https://github.com/v0i0

…#175309) Add aten.addcmul and aten._foreach_addcmul.Scalar to decomps_to_exclude so that the FMA-based addcmul lowering is used instead of decomposition. This enables bitwise precision parity with eager CUDA for addcmul operations. Pull Request resolved: pytorch#175309 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174912

…#175310) Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to decomps_to_exclude so that the FMA-based addcdiv lowering is used instead of decomposition. Also simplify the dynamo handlers for addcdiv to always skip inline decomposition. This enables bitwise precision parity with eager CUDA for addcdiv operations used in Adam/AdamW optimizers. Pull Request resolved: pytorch#175310 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174912, pytorch#175309

Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this, Triton computes `b * alpha` then adds to `a` as separate operations, losing the FMA precision guarantee. This affects optimizer weight_decay paths which use `grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha. Authored with Claude. Pull Request resolved: pytorch#175838 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174912, pytorch#175309, pytorch#175310

# Motivation pytorch#174912 introduces this feature on CUDA, this PR aims to enable it for XPU. # Additional Context fix pytorch#176152 fix pytorch#176157 fix pytorch#176168 fix pytorch#176407 Pull Request resolved: pytorch#176163 Approved by: https://github.com/jansel

Update

1009b7c

[ghstack-poisoned]

pytorch-bot Bot added ciflow/inductor module: inductor labels Feb 12, 2026

mlazos mentioned this pull request Feb 12, 2026

[inductor] Add FMA-based lerp lowering for CUDA parity #174749

Closed

mlazos mentioned this pull request Feb 12, 2026

[inductor] Add pow_precision config for eager numerics #174750

Closed

Update

df05c8b

[ghstack-poisoned]

Update

b6d17b2

[ghstack-poisoned]

mlazos mentioned this pull request Feb 13, 2026

[inductor] Use CUDA toolkit libdevice for Triton #174933

Closed

Update

5cb5e74

[ghstack-poisoned]

Update

c5a1499

[ghstack-poisoned]

mlazos mentioned this pull request Feb 18, 2026

[inductor] Add inline PTX pow for bitwise CUDA parity #175227

Closed

Update

c4bdc26

[ghstack-poisoned]

Update

0b06072

[ghstack-poisoned]

Update

c589d47

[ghstack-poisoned]

pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 27, 2026

pytorchmergebot added the merging label Feb 27, 2026

pytorchmergebot removed the merging label Feb 27, 2026

mlazos added the release notes: inductor label Feb 27, 2026

pytorchmergebot added the merging label Feb 27, 2026

pytorchmergebot closed this in 5e1a405 Feb 27, 2026

pytorchmergebot added Merged and removed merging labels Feb 27, 2026

guangyey mentioned this pull request Mar 2, 2026

Enable FMA-based addcdiv lowering for XPU #176163

Closed

github-actions Bot deleted the gh/mlazos/101/head branch March 30, 2026 02:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inductor] Add FMA-based addcdiv lowering for CUDA parity#174912

[inductor] Add FMA-based addcdiv lowering for CUDA parity#174912
mlazos wants to merge 29 commits intogh/mlazos/101/basefrom
gh/mlazos/101/head

mlazos commented Feb 12, 2026 •

edited by pytorch-bot Bot

Loading

Uh oh!

pytorch-bot Bot commented Feb 12, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Feb 12, 2026

Uh oh!

mlazos commented Feb 27, 2026

Uh oh!

pytorchmergebot commented Feb 27, 2026

Uh oh!

pytorch-bot Bot commented Feb 27, 2026

Uh oh!

mlazos commented Feb 27, 2026

Uh oh!

pytorchmergebot commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mlazos commented Feb 12, 2026 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174912

✅ No Failures

Uh oh!

pytorch-bot Bot commented Feb 12, 2026

This PR needs a release notes: label

Uh oh!

mlazos commented Feb 27, 2026

Uh oh!

pytorchmergebot commented Feb 27, 2026

Merge failed

Uh oh!

pytorch-bot Bot commented Feb 27, 2026

This PR needs a release notes: label

Uh oh!

mlazos commented Feb 27, 2026

Uh oh!

pytorchmergebot commented Feb 27, 2026

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mlazos commented Feb 12, 2026 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented Feb 12, 2026 •

edited

Loading

This PR needs a `release notes:` label

This PR needs a `release notes:` label