Skip to content

[inductor] Add FMA-based lerp lowering for CUDA parity#174749

Closed
mlazos wants to merge 18 commits intogh/mlazos/95/basefrom
gh/mlazos/95/head
Closed

[inductor] Add FMA-based lerp lowering for CUDA parity#174749
mlazos wants to merge 18 commits intogh/mlazos/95/basefrom
gh/mlazos/95/head

Conversation

@mlazos
Copy link
Copy Markdown
Contributor

@mlazos mlazos commented Feb 11, 2026

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 11, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174749

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit c5cbae9 with merge base 197c376 (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 11, 2026

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

mlazos added a commit that referenced this pull request Feb 11, 2026
Add a native inductor lowering for lerp and _foreach_lerp that uses
FMA (fused multiply-add) to match CUDA's native lerp behavior.

CUDA's lerp uses fma(weight, end-start, start) internally, which
computes weight*(end-start)+start without intermediate rounding.

Test: test_lerp_fma_precision in test_cuda_repro.py
ghstack-source-id: 6bd04d1
Pull-Request: #174749
[ghstack-poisoned]
mlazos added a commit that referenced this pull request Feb 11, 2026
Add a native inductor lowering for lerp and _foreach_lerp that uses
FMA (fused multiply-add) to match CUDA's native lerp behavior.

CUDA's lerp uses fma(weight, end-start, start) internally, which
computes weight*(end-start)+start without intermediate rounding.

Test: test_lerp_fma_precision in test_cuda_repro.py
ghstack-source-id: 6bd04d1
Pull-Request: #174749
[ghstack-poisoned]
Copy link
Copy Markdown
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, although, if we're testing cuda parity, it would really be great to unify with a more generic test suite. cc @v0i0

Comment thread torch/_dynamo/variables/torch.py Outdated
Comment thread torch/_inductor/decomposition.py Outdated
Comment thread torch/_inductor/decomposition.py Outdated
Comment thread torch/_inductor/lowering.py Outdated
Comment thread torch/_inductor/lowering.py Outdated
Comment thread torch/_inductor/lowering.py Outdated
Comment thread torch/_inductor/lowering.py Outdated
mlazos added a commit that referenced this pull request Feb 12, 2026
Add a native inductor lowering for lerp and _foreach_lerp that uses
FMA (fused multiply-add) to match CUDA's native lerp behavior.

CUDA's lerp uses fma(weight, end-start, start) internally, which
computes weight*(end-start)+start without intermediate rounding.

Test: test_lerp_fma_precision in test_cuda_repro.py
ghstack-source-id: c9ba15b
Pull-Request: #174749
[ghstack-poisoned]
mlazos added a commit that referenced this pull request Feb 18, 2026
Add a native inductor lowering for lerp and _foreach_lerp that uses
FMA (fused multiply-add) to match CUDA's native lerp behavior.

CUDA's lerp uses fma(weight, end-start, start) internally, which
computes weight*(end-start)+start without intermediate rounding.

Test: test_lerp_fma_precision in test_cuda_repro.py
ghstack-source-id: 2481f3c
Pull-Request: #174749
[ghstack-poisoned]
mlazos added a commit that referenced this pull request Feb 18, 2026
Add a native inductor lowering for lerp and _foreach_lerp that uses
FMA (fused multiply-add) to match CUDA's native lerp behavior.

CUDA's lerp uses fma(weight, end-start, start) internally, which
computes weight*(end-start)+start without intermediate rounding.

Test: test_lerp_fma_precision in test_cuda_repro.py
ghstack-source-id: a9ec2c4
Pull-Request: #174749
[ghstack-poisoned]
mlazos added a commit that referenced this pull request Feb 24, 2026
Add a native inductor lowering for lerp and _foreach_lerp that uses
FMA (fused multiply-add) to match CUDA's native lerp behavior.

CUDA's lerp uses fma(weight, end-start, start) internally, which
computes weight*(end-start)+start without intermediate rounding.

Test: test_lerp_fma_precision in test_cuda_repro.py
ghstack-source-id: 0c57474
Pull-Request: #174749
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
norx1991 pushed a commit that referenced this pull request Feb 24, 2026
Add a native inductor lowering for lerp and _foreach_lerp that uses
FMA (fused multiply-add) to match CUDA's native lerp behavior.

CUDA's lerp uses fma(weight, end-start, start) internally, which
computes weight*(end-start)+start without intermediate rounding.

Test: test_lerp_fma_precision in test_cuda_repro.py

Pull Request resolved: #174749
Approved by: https://github.com/eellison
norx1991 pushed a commit that referenced this pull request Feb 24, 2026
…)"

This reverts commit 03af6e5.

Reverted #174749 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#174749 (comment)))
@mlazos
Copy link
Copy Markdown
Contributor Author

mlazos commented Feb 25, 2026

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Feb 27, 2026
Configure Triton to use the CUDA toolkit's libdevice instead of its bundled
version.

The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension.
A warning is emitted if the CUDA libdevice cannot be found.

Pull Request resolved: #174933
Approved by: https://github.com/v0i0, https://github.com/eellison
ghstack dependencies: #174749
pytorchmergebot pushed a commit that referenced this pull request Feb 27, 2026
Use float32 constant instead of int for reciprocal to ensure proper
floating-point division when emulating eager division rounding.

Test: test_div_precision_rounding in test_cuda_repro.py

Pull Request resolved: #174751
Approved by: https://github.com/v0i0
ghstack dependencies: #174749, #174933
pytorchmergebot pushed a commit that referenced this pull request Feb 28, 2026
Configure Triton to use the CUDA toolkit's libdevice instead of its bundled
version.

The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension.
A warning is emitted if the CUDA libdevice cannot be found.

Pull Request resolved: #174933
Approved by: https://github.com/v0i0, https://github.com/eellison
ghstack dependencies: #174749
pytorchmergebot pushed a commit that referenced this pull request Feb 28, 2026
Use float32 constant instead of int for reciprocal to ensure proper
floating-point division when emulating eager division rounding.

Test: test_div_precision_rounding in test_cuda_repro.py

Pull Request resolved: #174751
Approved by: https://github.com/v0i0
ghstack dependencies: #174749, #174933
sandy-gags pushed a commit to sandy-gags/pytorch that referenced this pull request Mar 12, 2026
Add a native inductor lowering for lerp and _foreach_lerp that uses
FMA (fused multiply-add) to match CUDA's native lerp behavior.

CUDA's lerp uses fma(weight, end-start, start) internally, which
computes weight*(end-start)+start without intermediate rounding.

Test: test_lerp_fma_precision in test_cuda_repro.py
ghstack-source-id: c0cd4f5
Pull-Request: pytorch/pytorch#174749
sandy-gags pushed a commit to sandy-gags/pytorch that referenced this pull request Mar 12, 2026
Add a native inductor lowering for lerp and _foreach_lerp that uses
FMA (fused multiply-add) to match CUDA's native lerp behavior.

CUDA's lerp uses fma(weight, end-start, start) internally, which
computes weight*(end-start)+start without intermediate rounding.

Test: test_lerp_fma_precision in test_cuda_repro.py
ghstack-source-id: 714b4e2
Pull-Request: pytorch/pytorch#174749
sandy-gags pushed a commit to sandy-gags/pytorch that referenced this pull request Mar 12, 2026
Add a native inductor lowering for lerp and _foreach_lerp that uses
FMA (fused multiply-add) to match CUDA's native lerp behavior.

CUDA's lerp uses fma(weight, end-start, start) internally, which
computes weight*(end-start)+start without intermediate rounding.

Test: test_lerp_fma_precision in test_cuda_repro.py
ghstack-source-id: 714b4e2
Pull-Request: pytorch/pytorch#174749
sandy-gags pushed a commit to sandy-gags/pytorch that referenced this pull request Mar 12, 2026
Add a native inductor lowering for lerp and _foreach_lerp that uses
FMA (fused multiply-add) to match CUDA's native lerp behavior.

CUDA's lerp uses fma(weight, end-start, start) internally, which
computes weight*(end-start)+start without intermediate rounding.

Test: test_lerp_fma_precision in test_cuda_repro.py
ghstack-source-id: 3895307
Pull-Request: pytorch/pytorch#174749
sandy-gags pushed a commit to sandy-gags/pytorch that referenced this pull request Mar 13, 2026
Add a native inductor lowering for lerp and _foreach_lerp that uses
FMA (fused multiply-add) to match CUDA's native lerp behavior.

CUDA's lerp uses fma(weight, end-start, start) internally, which
computes weight*(end-start)+start without intermediate rounding.

Test: test_lerp_fma_precision in test_cuda_repro.py
ghstack-source-id: 2481f3c
Pull-Request: pytorch/pytorch#174749
@github-actions github-actions Bot deleted the gh/mlazos/95/head branch March 28, 2026 02:22
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
Add a native inductor lowering for lerp and _foreach_lerp that uses
FMA (fused multiply-add) to match CUDA's native lerp behavior.

CUDA's lerp uses fma(weight, end-start, start) internally, which
computes weight*(end-start)+start without intermediate rounding.

Test: test_lerp_fma_precision in test_cuda_repro.py

Pull Request resolved: pytorch#174749
Approved by: https://github.com/eellison
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
…ch#174749)"

This reverts commit 03af6e5.

Reverted pytorch#174749 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#174749 (comment)))
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
Add a native inductor lowering for lerp and _foreach_lerp that uses
FMA (fused multiply-add) to match CUDA's native lerp behavior.

CUDA's lerp uses fma(weight, end-start, start) internally, which
computes weight*(end-start)+start without intermediate rounding.

Test: test_lerp_fma_precision in test_cuda_repro.py

Pull Request resolved: pytorch#174749
Approved by: https://github.com/eellison
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
Configure Triton to use the CUDA toolkit's libdevice instead of its bundled
version.

The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension.
A warning is emitted if the CUDA libdevice cannot be found.

Pull Request resolved: pytorch#174933
Approved by: https://github.com/v0i0, https://github.com/eellison
ghstack dependencies: pytorch#174749
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
…ch#174751)

Use float32 constant instead of int for reciprocal to ensure proper
floating-point division when emulating eager division rounding.

Test: test_div_precision_rounding in test_cuda_repro.py

Pull Request resolved: pytorch#174751
Approved by: https://github.com/v0i0
ghstack dependencies: pytorch#174749, pytorch#174933
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
Configure Triton to use the CUDA toolkit's libdevice instead of its bundled
version.

The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension.
A warning is emitted if the CUDA libdevice cannot be found.

Pull Request resolved: pytorch#174933
Approved by: https://github.com/v0i0, https://github.com/eellison
ghstack dependencies: pytorch#174749
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
…ch#174751)

Use float32 constant instead of int for reciprocal to ensure proper
floating-point division when emulating eager division rounding.

Test: test_div_precision_rounding in test_cuda_repro.py

Pull Request resolved: pytorch#174751
Approved by: https://github.com/v0i0
ghstack dependencies: pytorch#174749, pytorch#174933
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants