[inductor] Skip addcmul decomposition to enable FMA lowering#175309
[inductor] Skip addcmul decomposition to enable FMA lowering#175309mlazos wants to merge 20 commits intogh/mlazos/106/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175309
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit 1541f87 with merge base 0b6476f ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to decomps_to_exclude so that the FMA-based addcdiv lowering is used instead of decomposition. Also simplify the dynamo handlers for addcdiv to always skip inline decomposition. This enables bitwise precision parity with eager CUDA for addcdiv operations used in Adam/AdamW optimizers. Pull Request resolved: #175310 Approved by: https://github.com/v0i0 ghstack dependencies: #174912, #175309
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this, Triton computes `b * alpha` then adds to `a` as separate operations, losing the FMA precision guarantee. This affects optimizer weight_decay paths which use `grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha. Authored with Claude. Pull Request resolved: #175838 Approved by: https://github.com/v0i0 ghstack dependencies: #174912, #175309, #175310
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this, Triton computes `b * alpha` then adds to `a` as separate operations, losing the FMA precision guarantee. This affects optimizer weight_decay paths which use `grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha. Authored with Claude. Pull Request resolved: #175838 Approved by: https://github.com/v0i0 ghstack dependencies: #174912, #175309, #175310
Add aten.addcmul and aten._foreach_addcmul.Scalar to decomps_to_exclude so that the FMA-based addcmul lowering is used instead of decomposition. This enables bitwise precision parity with eager CUDA for addcmul operations. ghstack-source-id: 92d3504 Pull-Request: pytorch/pytorch#175309
…#175309) Add aten.addcmul and aten._foreach_addcmul.Scalar to decomps_to_exclude so that the FMA-based addcmul lowering is used instead of decomposition. This enables bitwise precision parity with eager CUDA for addcmul operations. Pull Request resolved: pytorch#175309 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174912
…#175310) Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to decomps_to_exclude so that the FMA-based addcdiv lowering is used instead of decomposition. Also simplify the dynamo handlers for addcdiv to always skip inline decomposition. This enables bitwise precision parity with eager CUDA for addcdiv operations used in Adam/AdamW optimizers. Pull Request resolved: pytorch#175310 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174912, pytorch#175309
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this, Triton computes `b * alpha` then adds to `a` as separate operations, losing the FMA precision guarantee. This affects optimizer weight_decay paths which use `grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha. Authored with Claude. Pull Request resolved: pytorch#175838 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174912, pytorch#175309, pytorch#175310
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this, Triton computes `b * alpha` then adds to `a` as separate operations, losing the FMA precision guarantee. This affects optimizer weight_decay paths which use `grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha. Authored with Claude. Pull Request resolved: pytorch#175838 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174912, pytorch#175309, pytorch#175310
Stack from ghstack (oldest at bottom):
Add aten.addcmul and aten._foreach_addcmul.Scalar to decomps_to_exclude
so that the FMA-based addcmul lowering is used instead of decomposition.
This enables bitwise precision parity with eager CUDA for addcmul operations.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo