Skip to content

[inductor] Add FMA-based addcdiv lowering for CUDA parity#174912

Closed
mlazos wants to merge 29 commits intogh/mlazos/101/basefrom
gh/mlazos/101/head
Closed

[inductor] Add FMA-based addcdiv lowering for CUDA parity#174912
mlazos wants to merge 29 commits intogh/mlazos/101/basefrom
gh/mlazos/101/head

Conversation

@mlazos
Copy link
Copy Markdown
Contributor

@mlazos mlazos commented Feb 12, 2026

Stack from ghstack (oldest at bottom):

Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add)
to match the precision of ATen's CUDA kernel.

The lowering computes: self + value * (tensor1 / tensor2)

  • For value=1: self + tensor1 / tensor2 (simple add)
  • For value!=1: fma(value, tensor1 / tensor2, self)

Uses truediv which internally uses div_rn (round-to-nearest division)
when config.eager_numerics.division_rounding is True, ensuring the
division result is properly rounded before the FMA.

This allows addcdiv operations to be fused into triton kernels while
maintaining bitwise parity with eager execution.

Authored with Claude.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 12, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174912

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c589d47 with merge base 0b6476f (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

mlazos added a commit that referenced this pull request Feb 12, 2026
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add)
to match the precision of ATen's CUDA kernel.

The lowering computes: self + value * (tensor1 / tensor2)
- For value=1: self + tensor1 / tensor2 (simple add)
- For value!=1: fma(value, tensor1 / tensor2, self)

Uses truediv which internally uses div_rn (round-to-nearest division)
when config.eager_numerics.division_rounding is True, ensuring the
division result is properly rounded before the FMA.

This allows addcdiv operations to be fused into triton kernels while
maintaining bitwise parity with eager execution.

Authored with Claude.


ghstack-source-id: 821fd77
Pull-Request: #174912
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 12, 2026

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

mlazos added a commit that referenced this pull request Feb 12, 2026
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add)
to match the precision of ATen's CUDA kernel.

The lowering computes: self + value * (tensor1 / tensor2)
- For value=1: self + tensor1 / tensor2 (simple add)
- For value!=1: fma(value, tensor1 / tensor2, self)

Uses truediv which internally uses div_rn (round-to-nearest division)
when config.eager_numerics.division_rounding is True, ensuring the
division result is properly rounded before the FMA.

This allows addcdiv operations to be fused into triton kernels while
maintaining bitwise parity with eager execution.

Authored with Claude.


ghstack-source-id: 821fd77
Pull-Request: #174912
[ghstack-poisoned]
mlazos added a commit that referenced this pull request Feb 12, 2026
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add)
to match the precision of ATen's CUDA kernel.

The lowering computes: self + value * (tensor1 / tensor2)
- For value=1: self + tensor1 / tensor2 (simple add)
- For value!=1: fma(value, tensor1 / tensor2, self)

Uses truediv which internally uses div_rn (round-to-nearest division)
when config.eager_numerics.division_rounding is True, ensuring the
division result is properly rounded before the FMA.

This allows addcdiv operations to be fused into triton kernels while
maintaining bitwise parity with eager execution.

Authored with Claude.


ghstack-source-id: c517831
Pull-Request: #174912
[ghstack-poisoned]
mlazos added a commit that referenced this pull request Feb 13, 2026
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add)
to match the precision of ATen's CUDA kernel.

The lowering computes: self + value * (tensor1 / tensor2)
- For value=1: self + tensor1 / tensor2 (simple add)
- For value!=1: fma(value, tensor1 / tensor2, self)

Uses truediv which internally uses div_rn (round-to-nearest division)
when config.eager_numerics.division_rounding is True, ensuring the
division result is properly rounded before the FMA.

This allows addcdiv operations to be fused into triton kernels while
maintaining bitwise parity with eager execution.

Authored with Claude.


ghstack-source-id: b669460
Pull-Request: #174912
[ghstack-poisoned]
mlazos added a commit that referenced this pull request Feb 13, 2026
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add)
to match the precision of ATen's CUDA kernel.

The lowering computes: self + value * (tensor1 / tensor2)
- For value=1: self + tensor1 / tensor2 (simple add)
- For value!=1: fma(value, tensor1 / tensor2, self)

Uses truediv which internally uses div_rn (round-to-nearest division)
when config.eager_numerics.division_rounding is True, ensuring the
division result is properly rounded before the FMA.

This allows addcdiv operations to be fused into triton kernels while
maintaining bitwise parity with eager execution.

Authored with Claude.


ghstack-source-id: 5d5f934
Pull-Request: #174912
mlazos added a commit that referenced this pull request Feb 18, 2026
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add)
to match the precision of ATen's CUDA kernel.

The lowering computes: self + value * (tensor1 / tensor2)
- For value=1: self + tensor1 / tensor2 (simple add)
- For value!=1: fma(value, tensor1 / tensor2, self)

Uses truediv which internally uses div_rn (round-to-nearest division)
when config.eager_numerics.division_rounding is True, ensuring the
division result is properly rounded before the FMA.

This allows addcdiv operations to be fused into triton kernels while
maintaining bitwise parity with eager execution.

Authored with Claude.


ghstack-source-id: 5d5f934
Pull-Request: #174912
[ghstack-poisoned]
mlazos added a commit that referenced this pull request Feb 18, 2026
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add)
to match the precision of ATen's CUDA kernel.

The lowering computes: self + value * (tensor1 / tensor2)
- For value=1: self + tensor1 / tensor2 (simple add)
- For value!=1: fma(value, tensor1 / tensor2, self)

Uses truediv which internally uses div_rn (round-to-nearest division)
when config.eager_numerics.division_rounding is True, ensuring the
division result is properly rounded before the FMA.

This allows addcdiv operations to be fused into triton kernels while
maintaining bitwise parity with eager execution.

Authored with Claude.


ghstack-source-id: 77ba4b5
Pull-Request: #174912
mlazos added a commit that referenced this pull request Feb 18, 2026
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add)
to match the precision of ATen's CUDA kernel.

The lowering computes: self + value * (tensor1 / tensor2)
- For value=1: self + tensor1 / tensor2 (simple add)
- For value!=1: fma(value, tensor1 / tensor2, self)

Uses truediv which internally uses div_rn (round-to-nearest division)
when config.eager_numerics.division_rounding is True, ensuring the
division result is properly rounded before the FMA.

This allows addcdiv operations to be fused into triton kernels while
maintaining bitwise parity with eager execution.

Authored with Claude.


ghstack-source-id: 77ba4b5
Pull-Request: #174912
mlazos added a commit that referenced this pull request Feb 18, 2026
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add)
to match the precision of ATen's CUDA kernel.

The lowering computes: self + value * (tensor1 / tensor2)
- For value=1: self + tensor1 / tensor2 (simple add)
- For value!=1: fma(value, tensor1 / tensor2, self)

Uses truediv which internally uses div_rn (round-to-nearest division)
when config.eager_numerics.division_rounding is True, ensuring the
division result is properly rounded before the FMA.

This allows addcdiv operations to be fused into triton kernels while
maintaining bitwise parity with eager execution.

Authored with Claude.


ghstack-source-id: 77ba4b5
Pull-Request: #174912
[ghstack-poisoned]
mlazos added a commit that referenced this pull request Feb 18, 2026
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add)
to match the precision of ATen's CUDA kernel.

The lowering computes: self + value * (tensor1 / tensor2)
- For value=1: self + tensor1 / tensor2 (simple add)
- For value!=1: fma(value, tensor1 / tensor2, self)

Uses truediv which internally uses div_rn (round-to-nearest division)
when config.eager_numerics.division_rounding is True, ensuring the
division result is properly rounded before the FMA.

This allows addcdiv operations to be fused into triton kernels while
maintaining bitwise parity with eager execution.

Authored with Claude.


ghstack-source-id: bc89e41
Pull-Request: #174912
mlazos added a commit that referenced this pull request Feb 18, 2026
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add)
to match the precision of ATen's CUDA kernel.

The lowering computes: self + value * (tensor1 / tensor2)
- For value=1: self + tensor1 / tensor2 (simple add)
- For value!=1: fma(value, tensor1 / tensor2, self)

Uses truediv which internally uses div_rn (round-to-nearest division)
when config.eager_numerics.division_rounding is True, ensuring the
division result is properly rounded before the FMA.

This allows addcdiv operations to be fused into triton kernels while
maintaining bitwise parity with eager execution.

Authored with Claude.


ghstack-source-id: bc89e41
Pull-Request: #174912
[ghstack-poisoned]
mlazos added a commit that referenced this pull request Feb 27, 2026
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add)
to match the precision of ATen's CUDA kernel.

The lowering computes: self + value * (tensor1 / tensor2)
- For value=1: self + tensor1 / tensor2 (simple add)
- For value!=1: fma(value, tensor1 / tensor2, self)

Uses truediv which internally uses div_rn (round-to-nearest division)
when config.eager_numerics.division_rounding is True, ensuring the
division result is properly rounded before the FMA.

This allows addcdiv operations to be fused into triton kernels while
maintaining bitwise parity with eager execution.

Authored with Claude.

ghstack-source-id: a772c65
Pull-Request: #174912
[ghstack-poisoned]
@mlazos
Copy link
Copy Markdown
Contributor Author

mlazos commented Feb 27, 2026

@pytorchbot merge

@pytorch-bot pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 27, 2026
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team Raised by workflow job

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 27, 2026

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@mlazos
Copy link
Copy Markdown
Contributor Author

mlazos commented Feb 27, 2026

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Feb 27, 2026
Add aten.addcmul and aten._foreach_addcmul.Scalar to decomps_to_exclude
so that the FMA-based addcmul lowering is used instead of decomposition.

This enables bitwise precision parity with eager CUDA for addcmul operations.

Pull Request resolved: #175309
Approved by: https://github.com/v0i0
ghstack dependencies: #174912
pytorchmergebot pushed a commit that referenced this pull request Feb 28, 2026
Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to
decomps_to_exclude so that the FMA-based addcdiv lowering is used
instead of decomposition.

Also simplify the dynamo handlers for addcdiv to always skip inline
decomposition.

This enables bitwise precision parity with eager CUDA for addcdiv
operations used in Adam/AdamW optimizers.

Pull Request resolved: #175310
Approved by: https://github.com/v0i0
ghstack dependencies: #174912, #175309
pytorchmergebot pushed a commit that referenced this pull request Mar 1, 2026
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this,
Triton computes `b * alpha` then adds to `a` as separate operations, losing
the FMA precision guarantee.

This affects optimizer weight_decay paths which use
`grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha.

Authored with Claude.

Pull Request resolved: #175838
Approved by: https://github.com/v0i0
ghstack dependencies: #174912, #175309, #175310
pytorchmergebot pushed a commit that referenced this pull request Mar 2, 2026
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this,
Triton computes `b * alpha` then adds to `a` as separate operations, losing
the FMA precision guarantee.

This affects optimizer weight_decay paths which use
`grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha.

Authored with Claude.

Pull Request resolved: #175838
Approved by: https://github.com/v0i0
ghstack dependencies: #174912, #175309, #175310
pytorchmergebot pushed a commit that referenced this pull request Mar 4, 2026
# Motivation
#174912 introduces this feature on CUDA, this PR aims to enable it for XPU.

# Additional Context
fix #176152
fix #176157
fix #176168
fix #176407

Pull Request resolved: #176163
Approved by: https://github.com/jansel
sandy-gags pushed a commit to sandy-gags/pytorch that referenced this pull request Mar 12, 2026
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add)
to match the precision of ATen's CUDA kernel.

The lowering computes: self + value * (tensor1 / tensor2)
- For value=1: self + tensor1 / tensor2 (simple add)
- For value!=1: fma(value, tensor1 / tensor2, self)

Uses truediv which internally uses div_rn (round-to-nearest division)
when config.eager_numerics.division_rounding is True, ensuring the
division result is properly rounded before the FMA.

This allows addcdiv operations to be fused into triton kernels while
maintaining bitwise parity with eager execution.

Authored with Claude.

ghstack-source-id: 2cd4301
Pull-Request: pytorch/pytorch#174912
@github-actions github-actions Bot deleted the gh/mlazos/101/head branch March 30, 2026 02:24
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
…4912)

Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add)
to match the precision of ATen's CUDA kernel.

The lowering computes: self + value * (tensor1 / tensor2)
- For value=1: self + tensor1 / tensor2 (simple add)
- For value!=1: fma(value, tensor1 / tensor2, self)

Uses truediv which internally uses div_rn (round-to-nearest division)
when config.eager_numerics.division_rounding is True, ensuring the
division result is properly rounded before the FMA.

This allows addcdiv operations to be fused into triton kernels while
maintaining bitwise parity with eager execution.

Authored with Claude.

Pull Request resolved: pytorch#174912
Approved by: https://github.com/v0i0
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
…#175309)

Add aten.addcmul and aten._foreach_addcmul.Scalar to decomps_to_exclude
so that the FMA-based addcmul lowering is used instead of decomposition.

This enables bitwise precision parity with eager CUDA for addcmul operations.

Pull Request resolved: pytorch#175309
Approved by: https://github.com/v0i0
ghstack dependencies: pytorch#174912
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
…#175310)

Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to
decomps_to_exclude so that the FMA-based addcdiv lowering is used
instead of decomposition.

Also simplify the dynamo handlers for addcdiv to always skip inline
decomposition.

This enables bitwise precision parity with eager CUDA for addcdiv
operations used in Adam/AdamW optimizers.

Pull Request resolved: pytorch#175310
Approved by: https://github.com/v0i0
ghstack dependencies: pytorch#174912, pytorch#175309
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this,
Triton computes `b * alpha` then adds to `a` as separate operations, losing
the FMA precision guarantee.

This affects optimizer weight_decay paths which use
`grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha.

Authored with Claude.

Pull Request resolved: pytorch#175838
Approved by: https://github.com/v0i0
ghstack dependencies: pytorch#174912, pytorch#175309, pytorch#175310
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this,
Triton computes `b * alpha` then adds to `a` as separate operations, losing
the FMA precision guarantee.

This affects optimizer weight_decay paths which use
`grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha.

Authored with Claude.

Pull Request resolved: pytorch#175838
Approved by: https://github.com/v0i0
ghstack dependencies: pytorch#174912, pytorch#175309, pytorch#175310
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
# Motivation
pytorch#174912 introduces this feature on CUDA, this PR aims to enable it for XPU.

# Additional Context
fix pytorch#176152
fix pytorch#176157
fix pytorch#176168
fix pytorch#176407

Pull Request resolved: pytorch#176163
Approved by: https://github.com/jansel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants