[inductor] Add FMA-based addcdiv lowering for CUDA parity#174912
[inductor] Add FMA-based addcdiv lowering for CUDA parity#174912mlazos wants to merge 29 commits intogh/mlazos/101/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174912
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit c589d47 with merge base 0b6476f ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add) to match the precision of ATen's CUDA kernel. The lowering computes: self + value * (tensor1 / tensor2) - For value=1: self + tensor1 / tensor2 (simple add) - For value!=1: fma(value, tensor1 / tensor2, self) Uses truediv which internally uses div_rn (round-to-nearest division) when config.eager_numerics.division_rounding is True, ensuring the division result is properly rounded before the FMA. This allows addcdiv operations to be fused into triton kernels while maintaining bitwise parity with eager execution. Authored with Claude. ghstack-source-id: 821fd77 Pull-Request: #174912
This PR needs a
|
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add) to match the precision of ATen's CUDA kernel. The lowering computes: self + value * (tensor1 / tensor2) - For value=1: self + tensor1 / tensor2 (simple add) - For value!=1: fma(value, tensor1 / tensor2, self) Uses truediv which internally uses div_rn (round-to-nearest division) when config.eager_numerics.division_rounding is True, ensuring the division result is properly rounded before the FMA. This allows addcdiv operations to be fused into triton kernels while maintaining bitwise parity with eager execution. Authored with Claude. ghstack-source-id: 821fd77 Pull-Request: #174912
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add) to match the precision of ATen's CUDA kernel. The lowering computes: self + value * (tensor1 / tensor2) - For value=1: self + tensor1 / tensor2 (simple add) - For value!=1: fma(value, tensor1 / tensor2, self) Uses truediv which internally uses div_rn (round-to-nearest division) when config.eager_numerics.division_rounding is True, ensuring the division result is properly rounded before the FMA. This allows addcdiv operations to be fused into triton kernels while maintaining bitwise parity with eager execution. Authored with Claude. ghstack-source-id: c517831 Pull-Request: #174912
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add) to match the precision of ATen's CUDA kernel. The lowering computes: self + value * (tensor1 / tensor2) - For value=1: self + tensor1 / tensor2 (simple add) - For value!=1: fma(value, tensor1 / tensor2, self) Uses truediv which internally uses div_rn (round-to-nearest division) when config.eager_numerics.division_rounding is True, ensuring the division result is properly rounded before the FMA. This allows addcdiv operations to be fused into triton kernels while maintaining bitwise parity with eager execution. Authored with Claude. ghstack-source-id: b669460 Pull-Request: #174912
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add) to match the precision of ATen's CUDA kernel. The lowering computes: self + value * (tensor1 / tensor2) - For value=1: self + tensor1 / tensor2 (simple add) - For value!=1: fma(value, tensor1 / tensor2, self) Uses truediv which internally uses div_rn (round-to-nearest division) when config.eager_numerics.division_rounding is True, ensuring the division result is properly rounded before the FMA. This allows addcdiv operations to be fused into triton kernels while maintaining bitwise parity with eager execution. Authored with Claude. ghstack-source-id: 5d5f934 Pull-Request: #174912
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add) to match the precision of ATen's CUDA kernel. The lowering computes: self + value * (tensor1 / tensor2) - For value=1: self + tensor1 / tensor2 (simple add) - For value!=1: fma(value, tensor1 / tensor2, self) Uses truediv which internally uses div_rn (round-to-nearest division) when config.eager_numerics.division_rounding is True, ensuring the division result is properly rounded before the FMA. This allows addcdiv operations to be fused into triton kernels while maintaining bitwise parity with eager execution. Authored with Claude. ghstack-source-id: 5d5f934 Pull-Request: #174912
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add) to match the precision of ATen's CUDA kernel. The lowering computes: self + value * (tensor1 / tensor2) - For value=1: self + tensor1 / tensor2 (simple add) - For value!=1: fma(value, tensor1 / tensor2, self) Uses truediv which internally uses div_rn (round-to-nearest division) when config.eager_numerics.division_rounding is True, ensuring the division result is properly rounded before the FMA. This allows addcdiv operations to be fused into triton kernels while maintaining bitwise parity with eager execution. Authored with Claude. ghstack-source-id: 77ba4b5 Pull-Request: #174912
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add) to match the precision of ATen's CUDA kernel. The lowering computes: self + value * (tensor1 / tensor2) - For value=1: self + tensor1 / tensor2 (simple add) - For value!=1: fma(value, tensor1 / tensor2, self) Uses truediv which internally uses div_rn (round-to-nearest division) when config.eager_numerics.division_rounding is True, ensuring the division result is properly rounded before the FMA. This allows addcdiv operations to be fused into triton kernels while maintaining bitwise parity with eager execution. Authored with Claude. ghstack-source-id: 77ba4b5 Pull-Request: #174912
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add) to match the precision of ATen's CUDA kernel. The lowering computes: self + value * (tensor1 / tensor2) - For value=1: self + tensor1 / tensor2 (simple add) - For value!=1: fma(value, tensor1 / tensor2, self) Uses truediv which internally uses div_rn (round-to-nearest division) when config.eager_numerics.division_rounding is True, ensuring the division result is properly rounded before the FMA. This allows addcdiv operations to be fused into triton kernels while maintaining bitwise parity with eager execution. Authored with Claude. ghstack-source-id: 77ba4b5 Pull-Request: #174912
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add) to match the precision of ATen's CUDA kernel. The lowering computes: self + value * (tensor1 / tensor2) - For value=1: self + tensor1 / tensor2 (simple add) - For value!=1: fma(value, tensor1 / tensor2, self) Uses truediv which internally uses div_rn (round-to-nearest division) when config.eager_numerics.division_rounding is True, ensuring the division result is properly rounded before the FMA. This allows addcdiv operations to be fused into triton kernels while maintaining bitwise parity with eager execution. Authored with Claude. ghstack-source-id: bc89e41 Pull-Request: #174912
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add) to match the precision of ATen's CUDA kernel. The lowering computes: self + value * (tensor1 / tensor2) - For value=1: self + tensor1 / tensor2 (simple add) - For value!=1: fma(value, tensor1 / tensor2, self) Uses truediv which internally uses div_rn (round-to-nearest division) when config.eager_numerics.division_rounding is True, ensuring the division result is properly rounded before the FMA. This allows addcdiv operations to be fused into triton kernels while maintaining bitwise parity with eager execution. Authored with Claude. ghstack-source-id: bc89e41 Pull-Request: #174912
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add) to match the precision of ATen's CUDA kernel. The lowering computes: self + value * (tensor1 / tensor2) - For value=1: self + tensor1 / tensor2 (simple add) - For value!=1: fma(value, tensor1 / tensor2, self) Uses truediv which internally uses div_rn (round-to-nearest division) when config.eager_numerics.division_rounding is True, ensuring the division result is properly rounded before the FMA. This allows addcdiv operations to be fused into triton kernels while maintaining bitwise parity with eager execution. Authored with Claude. ghstack-source-id: a772c65 Pull-Request: #174912
|
@pytorchbot merge |
Merge failedReason: This PR needs a If not, please add the To add a label, you can comment to pytorchbot, for example For more information, see Details for Dev Infra teamRaised by workflow job |
This PR needs a
|
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Add aten.addcmul and aten._foreach_addcmul.Scalar to decomps_to_exclude so that the FMA-based addcmul lowering is used instead of decomposition. This enables bitwise precision parity with eager CUDA for addcmul operations. Pull Request resolved: #175309 Approved by: https://github.com/v0i0 ghstack dependencies: #174912
Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to decomps_to_exclude so that the FMA-based addcdiv lowering is used instead of decomposition. Also simplify the dynamo handlers for addcdiv to always skip inline decomposition. This enables bitwise precision parity with eager CUDA for addcdiv operations used in Adam/AdamW optimizers. Pull Request resolved: #175310 Approved by: https://github.com/v0i0 ghstack dependencies: #174912, #175309
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this, Triton computes `b * alpha` then adds to `a` as separate operations, losing the FMA precision guarantee. This affects optimizer weight_decay paths which use `grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha. Authored with Claude. Pull Request resolved: #175838 Approved by: https://github.com/v0i0 ghstack dependencies: #174912, #175309, #175310
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this, Triton computes `b * alpha` then adds to `a` as separate operations, losing the FMA precision guarantee. This affects optimizer weight_decay paths which use `grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha. Authored with Claude. Pull Request resolved: #175838 Approved by: https://github.com/v0i0 ghstack dependencies: #174912, #175309, #175310
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add) to match the precision of ATen's CUDA kernel. The lowering computes: self + value * (tensor1 / tensor2) - For value=1: self + tensor1 / tensor2 (simple add) - For value!=1: fma(value, tensor1 / tensor2, self) Uses truediv which internally uses div_rn (round-to-nearest division) when config.eager_numerics.division_rounding is True, ensuring the division result is properly rounded before the FMA. This allows addcdiv operations to be fused into triton kernels while maintaining bitwise parity with eager execution. Authored with Claude. ghstack-source-id: 2cd4301 Pull-Request: pytorch/pytorch#174912
…4912) Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add) to match the precision of ATen's CUDA kernel. The lowering computes: self + value * (tensor1 / tensor2) - For value=1: self + tensor1 / tensor2 (simple add) - For value!=1: fma(value, tensor1 / tensor2, self) Uses truediv which internally uses div_rn (round-to-nearest division) when config.eager_numerics.division_rounding is True, ensuring the division result is properly rounded before the FMA. This allows addcdiv operations to be fused into triton kernels while maintaining bitwise parity with eager execution. Authored with Claude. Pull Request resolved: pytorch#174912 Approved by: https://github.com/v0i0
…#175309) Add aten.addcmul and aten._foreach_addcmul.Scalar to decomps_to_exclude so that the FMA-based addcmul lowering is used instead of decomposition. This enables bitwise precision parity with eager CUDA for addcmul operations. Pull Request resolved: pytorch#175309 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174912
…#175310) Add aten.addcdiv, aten.addcdiv_, and aten._foreach_addcdiv.Scalar to decomps_to_exclude so that the FMA-based addcdiv lowering is used instead of decomposition. Also simplify the dynamo handlers for addcdiv to always skip inline decomposition. This enables bitwise precision parity with eager CUDA for addcdiv operations used in Adam/AdamW optimizers. Pull Request resolved: pytorch#175310 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174912, pytorch#175309
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this, Triton computes `b * alpha` then adds to `a` as separate operations, losing the FMA precision guarantee. This affects optimizer weight_decay paths which use `grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha. Authored with Claude. Pull Request resolved: pytorch#175838 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174912, pytorch#175309, pytorch#175310
Eager CUDA computes `a + alpha * b` as `fma(b, alpha, a)`. Without this, Triton computes `b * alpha` then adds to `a` as separate operations, losing the FMA precision guarantee. This affects optimizer weight_decay paths which use `grad.add(param, alpha=weight_decay)` and `_foreach_add` with alpha. Authored with Claude. Pull Request resolved: pytorch#175838 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174912, pytorch#175309, pytorch#175310
# Motivation pytorch#174912 introduces this feature on CUDA, this PR aims to enable it for XPU. # Additional Context fix pytorch#176152 fix pytorch#176157 fix pytorch#176168 fix pytorch#176407 Pull Request resolved: pytorch#176163 Approved by: https://github.com/jansel
Stack from ghstack (oldest at bottom):
Add a custom lowering for aten.addcdiv that uses FMA (fused multiply-add)
to match the precision of ATen's CUDA kernel.
The lowering computes: self + value * (tensor1 / tensor2)
Uses truediv which internally uses div_rn (round-to-nearest division)
when config.eager_numerics.division_rounding is True, ensuring the
division result is properly rounded before the FMA.
This allows addcdiv operations to be fused into triton kernels while
maintaining bitwise parity with eager execution.
Authored with Claude.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo