Skip to content

[inductor] Skip addcdiv decomposition to enable FMA lowering#175310

Closed
mlazos wants to merge 22 commits intogh/mlazos/107/basefrom
gh/mlazos/107/head
Closed

[inductor] Skip addcdiv decomposition to enable FMA lowering#175310
mlazos wants to merge 22 commits intogh/mlazos/107/basefrom
gh/mlazos/107/head

Conversation

@mlazos
Copy link
Copy Markdown
Contributor

@mlazos mlazos commented Feb 19, 2026

Stack from ghstack (oldest at bottom):

Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to
decomps_to_exclude so that the FMA-based addcdiv lowering is used
instead of decomposition.

Also simplify the dynamo handlers for addcdiv to always skip inline
decomposition.

This enables bitwise precision parity with eager CUDA for addcdiv
operations used in Adam/AdamW optimizers.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @Lucaskabela

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 19, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175310

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit b342d39 with merge base 0b6476f (image):

NEW FAILURE - The following job has failed:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 19, 2026

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

[ghstack-poisoned]
[ghstack-poisoned]
@mlazos mlazos changed the title [inductor] Skip addcdiv decomposition for AdamW bitwise precision [inductor] Skip addcdiv decomposition to enable FMA lowering Feb 19, 2026
[ghstack-poisoned]
[ghstack-poisoned]
mlazos added a commit that referenced this pull request Feb 21, 2026
Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to
decomps_to_exclude so that the FMA-based addcdiv lowering is used
instead of decomposition.

Also simplify the dynamo handlers for addcdiv to always skip inline
decomposition.

This enables bitwise precision parity with eager CUDA for addcdiv
operations used in Adam/AdamW optimizers.

ghstack-source-id: 370d088
Pull-Request: #175310
[ghstack-poisoned]
mlazos added a commit that referenced this pull request Feb 21, 2026
Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to
decomps_to_exclude so that the FMA-based addcdiv lowering is used
instead of decomposition.

Also simplify the dynamo handlers for addcdiv to always skip inline
decomposition.

This enables bitwise precision parity with eager CUDA for addcdiv
operations used in Adam/AdamW optimizers.

ghstack-source-id: 370d088
Pull-Request: #175310
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
mlazos added a commit that referenced this pull request Feb 24, 2026
Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to
decomps_to_exclude so that the FMA-based addcdiv lowering is used
instead of decomposition.

Also simplify the dynamo handlers for addcdiv to always skip inline
decomposition.

This enables bitwise precision parity with eager CUDA for addcdiv
operations used in Adam/AdamW optimizers.

ghstack-source-id: 7d5f500
Pull-Request: #175310
[ghstack-poisoned]
mlazos added a commit that referenced this pull request Feb 24, 2026
Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to
decomps_to_exclude so that the FMA-based addcdiv lowering is used
instead of decomposition.

Also simplify the dynamo handlers for addcdiv to always skip inline
decomposition.

This enables bitwise precision parity with eager CUDA for addcdiv
operations used in Adam/AdamW optimizers.

ghstack-source-id: 43afebc
Pull-Request: #175310
[ghstack-poisoned]
mlazos added a commit that referenced this pull request Feb 26, 2026
Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to
decomps_to_exclude so that the FMA-based addcdiv lowering is used
instead of decomposition.

Also simplify the dynamo handlers for addcdiv to always skip inline
decomposition.

This enables bitwise precision parity with eager CUDA for addcdiv
operations used in Adam/AdamW optimizers.

ghstack-source-id: f30177a
Pull-Request: #175310
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
mlazos added a commit that referenced this pull request Feb 27, 2026
Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to
decomps_to_exclude so that the FMA-based addcdiv lowering is used
instead of decomposition.

Also simplify the dynamo handlers for addcdiv to always skip inline
decomposition.

This enables bitwise precision parity with eager CUDA for addcdiv
operations used in Adam/AdamW optimizers.

ghstack-source-id: a180c70
Pull-Request: #175310
[ghstack-poisoned]
[ghstack-poisoned]
@mlazos mlazos added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 27, 2026
[ghstack-poisoned]
@mlazos
Copy link
Copy Markdown
Contributor Author

mlazos commented Feb 28, 2026

@pytorchbot merge -i

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged while ignoring the following 2 checks: inductor / inductor-test / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu), inductor / inductor-test-cuda13 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Mar 1, 2026
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this,
Triton computes `b * alpha` then adds to `a` as separate operations, losing
the FMA precision guarantee.

This affects optimizer weight_decay paths which use
`grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha.

Authored with Claude.

Pull Request resolved: #175838
Approved by: https://github.com/v0i0
ghstack dependencies: #174912, #175309, #175310
pytorchmergebot pushed a commit that referenced this pull request Mar 2, 2026
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this,
Triton computes `b * alpha` then adds to `a` as separate operations, losing
the FMA precision guarantee.

This affects optimizer weight_decay paths which use
`grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha.

Authored with Claude.

Pull Request resolved: #175838
Approved by: https://github.com/v0i0
ghstack dependencies: #174912, #175309, #175310
sandy-gags pushed a commit to sandy-gags/pytorch that referenced this pull request Mar 12, 2026
Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to
decomps_to_exclude so that the FMA-based addcdiv lowering is used
instead of decomposition.

Also simplify the dynamo handlers for addcdiv to always skip inline
decomposition.

This enables bitwise precision parity with eager CUDA for addcdiv
operations used in Adam/AdamW optimizers.

ghstack-source-id: b0ac9d4
Pull-Request: pytorch/pytorch#175310
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
…#175310)

Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to
decomps_to_exclude so that the FMA-based addcdiv lowering is used
instead of decomposition.

Also simplify the dynamo handlers for addcdiv to always skip inline
decomposition.

This enables bitwise precision parity with eager CUDA for addcdiv
operations used in Adam/AdamW optimizers.

Pull Request resolved: pytorch#175310
Approved by: https://github.com/v0i0
ghstack dependencies: pytorch#174912, pytorch#175309
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this,
Triton computes `b * alpha` then adds to `a` as separate operations, losing
the FMA precision guarantee.

This affects optimizer weight_decay paths which use
`grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha.

Authored with Claude.

Pull Request resolved: pytorch#175838
Approved by: https://github.com/v0i0
ghstack dependencies: pytorch#174912, pytorch#175309, pytorch#175310
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this,
Triton computes `b * alpha` then adds to `a` as separate operations, losing
the FMA precision guarantee.

This affects optimizer weight_decay paths which use
`grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha.

Authored with Claude.

Pull Request resolved: pytorch#175838
Approved by: https://github.com/v0i0
ghstack dependencies: pytorch#174912, pytorch#175309, pytorch#175310
@github-actions github-actions Bot deleted the gh/mlazos/107/head branch March 31, 2026 02:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants