[inductor] Add FMA-based lerp lowering for CUDA parity#174749
[inductor] Add FMA-based lerp lowering for CUDA parity#174749mlazos wants to merge 18 commits intogh/mlazos/95/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174749
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit c5cbae9 with merge base 197c376 ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
Add a native inductor lowering for lerp and _foreach_lerp that uses FMA (fused multiply-add) to match CUDA's native lerp behavior. CUDA's lerp uses fma(weight, end-start, start) internally, which computes weight*(end-start)+start without intermediate rounding. Test: test_lerp_fma_precision in test_cuda_repro.py ghstack-source-id: 6bd04d1 Pull-Request: #174749
Add a native inductor lowering for lerp and _foreach_lerp that uses FMA (fused multiply-add) to match CUDA's native lerp behavior. CUDA's lerp uses fma(weight, end-start, start) internally, which computes weight*(end-start)+start without intermediate rounding. Test: test_lerp_fma_precision in test_cuda_repro.py ghstack-source-id: 6bd04d1 Pull-Request: #174749
Add a native inductor lowering for lerp and _foreach_lerp that uses FMA (fused multiply-add) to match CUDA's native lerp behavior. CUDA's lerp uses fma(weight, end-start, start) internally, which computes weight*(end-start)+start without intermediate rounding. Test: test_lerp_fma_precision in test_cuda_repro.py ghstack-source-id: c9ba15b Pull-Request: #174749
Add a native inductor lowering for lerp and _foreach_lerp that uses FMA (fused multiply-add) to match CUDA's native lerp behavior. CUDA's lerp uses fma(weight, end-start, start) internally, which computes weight*(end-start)+start without intermediate rounding. Test: test_lerp_fma_precision in test_cuda_repro.py ghstack-source-id: 2481f3c Pull-Request: #174749
Add a native inductor lowering for lerp and _foreach_lerp that uses FMA (fused multiply-add) to match CUDA's native lerp behavior. CUDA's lerp uses fma(weight, end-start, start) internally, which computes weight*(end-start)+start without intermediate rounding. Test: test_lerp_fma_precision in test_cuda_repro.py ghstack-source-id: a9ec2c4 Pull-Request: #174749
Add a native inductor lowering for lerp and _foreach_lerp that uses FMA (fused multiply-add) to match CUDA's native lerp behavior. CUDA's lerp uses fma(weight, end-start, start) internally, which computes weight*(end-start)+start without intermediate rounding. Test: test_lerp_fma_precision in test_cuda_repro.py ghstack-source-id: 0c57474 Pull-Request: #174749
Add a native inductor lowering for lerp and _foreach_lerp that uses FMA (fused multiply-add) to match CUDA's native lerp behavior. CUDA's lerp uses fma(weight, end-start, start) internally, which computes weight*(end-start)+start without intermediate rounding. Test: test_lerp_fma_precision in test_cuda_repro.py Pull Request resolved: #174749 Approved by: https://github.com/eellison
…)" This reverts commit 03af6e5. Reverted #174749 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#174749 (comment)))
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Configure Triton to use the CUDA toolkit's libdevice instead of its bundled version. The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension. A warning is emitted if the CUDA libdevice cannot be found. Pull Request resolved: #174933 Approved by: https://github.com/v0i0, https://github.com/eellison ghstack dependencies: #174749
Use float32 constant instead of int for reciprocal to ensure proper floating-point division when emulating eager division rounding. Test: test_div_precision_rounding in test_cuda_repro.py Pull Request resolved: #174751 Approved by: https://github.com/v0i0 ghstack dependencies: #174749, #174933
Configure Triton to use the CUDA toolkit's libdevice instead of its bundled version. The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension. A warning is emitted if the CUDA libdevice cannot be found. Pull Request resolved: #174933 Approved by: https://github.com/v0i0, https://github.com/eellison ghstack dependencies: #174749
Use float32 constant instead of int for reciprocal to ensure proper floating-point division when emulating eager division rounding. Test: test_div_precision_rounding in test_cuda_repro.py Pull Request resolved: #174751 Approved by: https://github.com/v0i0 ghstack dependencies: #174749, #174933
Add a native inductor lowering for lerp and _foreach_lerp that uses FMA (fused multiply-add) to match CUDA's native lerp behavior. CUDA's lerp uses fma(weight, end-start, start) internally, which computes weight*(end-start)+start without intermediate rounding. Test: test_lerp_fma_precision in test_cuda_repro.py ghstack-source-id: c0cd4f5 Pull-Request: pytorch/pytorch#174749
Add a native inductor lowering for lerp and _foreach_lerp that uses FMA (fused multiply-add) to match CUDA's native lerp behavior. CUDA's lerp uses fma(weight, end-start, start) internally, which computes weight*(end-start)+start without intermediate rounding. Test: test_lerp_fma_precision in test_cuda_repro.py ghstack-source-id: 714b4e2 Pull-Request: pytorch/pytorch#174749
Add a native inductor lowering for lerp and _foreach_lerp that uses FMA (fused multiply-add) to match CUDA's native lerp behavior. CUDA's lerp uses fma(weight, end-start, start) internally, which computes weight*(end-start)+start without intermediate rounding. Test: test_lerp_fma_precision in test_cuda_repro.py ghstack-source-id: 714b4e2 Pull-Request: pytorch/pytorch#174749
Add a native inductor lowering for lerp and _foreach_lerp that uses FMA (fused multiply-add) to match CUDA's native lerp behavior. CUDA's lerp uses fma(weight, end-start, start) internally, which computes weight*(end-start)+start without intermediate rounding. Test: test_lerp_fma_precision in test_cuda_repro.py ghstack-source-id: 3895307 Pull-Request: pytorch/pytorch#174749
Add a native inductor lowering for lerp and _foreach_lerp that uses FMA (fused multiply-add) to match CUDA's native lerp behavior. CUDA's lerp uses fma(weight, end-start, start) internally, which computes weight*(end-start)+start without intermediate rounding. Test: test_lerp_fma_precision in test_cuda_repro.py ghstack-source-id: 2481f3c Pull-Request: pytorch/pytorch#174749
Add a native inductor lowering for lerp and _foreach_lerp that uses FMA (fused multiply-add) to match CUDA's native lerp behavior. CUDA's lerp uses fma(weight, end-start, start) internally, which computes weight*(end-start)+start without intermediate rounding. Test: test_lerp_fma_precision in test_cuda_repro.py Pull Request resolved: pytorch#174749 Approved by: https://github.com/eellison
…ch#174749)" This reverts commit 03af6e5. Reverted pytorch#174749 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#174749 (comment)))
Add a native inductor lowering for lerp and _foreach_lerp that uses FMA (fused multiply-add) to match CUDA's native lerp behavior. CUDA's lerp uses fma(weight, end-start, start) internally, which computes weight*(end-start)+start without intermediate rounding. Test: test_lerp_fma_precision in test_cuda_repro.py Pull Request resolved: pytorch#174749 Approved by: https://github.com/eellison
Configure Triton to use the CUDA toolkit's libdevice instead of its bundled version. The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension. A warning is emitted if the CUDA libdevice cannot be found. Pull Request resolved: pytorch#174933 Approved by: https://github.com/v0i0, https://github.com/eellison ghstack dependencies: pytorch#174749
…ch#174751) Use float32 constant instead of int for reciprocal to ensure proper floating-point division when emulating eager division rounding. Test: test_div_precision_rounding in test_cuda_repro.py Pull Request resolved: pytorch#174751 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174749, pytorch#174933
Configure Triton to use the CUDA toolkit's libdevice instead of its bundled version. The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension. A warning is emitted if the CUDA libdevice cannot be found. Pull Request resolved: pytorch#174933 Approved by: https://github.com/v0i0, https://github.com/eellison ghstack dependencies: pytorch#174749
…ch#174751) Use float32 constant instead of int for reciprocal to ensure proper floating-point division when emulating eager division rounding. Test: test_div_precision_rounding in test_cuda_repro.py Pull Request resolved: pytorch#174751 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174749, pytorch#174933
Stack from ghstack (oldest at bottom):
Add a native inductor lowering for lerp and _foreach_lerp that uses
FMA (fused multiply-add) to match CUDA's native lerp behavior.
CUDA's lerp uses fma(weight, end-start, start) internally, which
computes weight*(end-start)+start without intermediate rounding.
Test: test_lerp_fma_precision in test_cuda_repro.py
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @Lucaskabela