[inductor] Add FMA-based lerp lowering for CUDA parity by mlazos · Pull Request #174749 · pytorch/pytorch

mlazos · 2026-02-11T03:22:35Z

Stack from ghstack (oldest at bottom):

Add a native inductor lowering for lerp and _foreach_lerp that uses
FMA (fused multiply-add) to match CUDA's native lerp behavior.

CUDA's lerp uses fma(weight, end-start, start) internally, which
computes weight*(end-start)+start without intermediate rounding.

Test: test_lerp_fma_precision in test_cuda_repro.py

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo @Lucaskabela

[ghstack-poisoned]

pytorch-bot · 2026-02-11T03:22:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174749

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit c5cbae9 with merge base 197c376 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / linux-jammy-rocm-py3.10 / test (default, 6, 6, linux.rocm.gpu.gfx950.1) (gh) (similar failure)
test/inductor/test_combo_kernels.py::ComboKernelTests::test_combo_kernel_dynamic_shapes_grid_changes

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2026-02-11T03:22:41Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Add a native inductor lowering for lerp and _foreach_lerp that uses FMA (fused multiply-add) to match CUDA's native lerp behavior. CUDA's lerp uses fma(weight, end-start, start) internally, which computes weight*(end-start)+start without intermediate rounding. Test: test_lerp_fma_precision in test_cuda_repro.py ghstack-source-id: 6bd04d1 Pull-Request: #174749

[ghstack-poisoned]

Add a native inductor lowering for lerp and _foreach_lerp that uses FMA (fused multiply-add) to match CUDA's native lerp behavior. CUDA's lerp uses fma(weight, end-start, start) internally, which computes weight*(end-start)+start without intermediate rounding. Test: test_lerp_fma_precision in test_cuda_repro.py ghstack-source-id: 6bd04d1 Pull-Request: #174749

[ghstack-poisoned]

eellison

Looks good, although, if we're testing cuda parity, it would really be great to unify with a more generic test suite. cc @v0i0

Add a native inductor lowering for lerp and _foreach_lerp that uses FMA (fused multiply-add) to match CUDA's native lerp behavior. CUDA's lerp uses fma(weight, end-start, start) internally, which computes weight*(end-start)+start without intermediate rounding. Test: test_lerp_fma_precision in test_cuda_repro.py ghstack-source-id: c9ba15b Pull-Request: #174749

[ghstack-poisoned]

Add a native inductor lowering for lerp and _foreach_lerp that uses FMA (fused multiply-add) to match CUDA's native lerp behavior. CUDA's lerp uses fma(weight, end-start, start) internally, which computes weight*(end-start)+start without intermediate rounding. Test: test_lerp_fma_precision in test_cuda_repro.py ghstack-source-id: 2481f3c Pull-Request: #174749

[ghstack-poisoned]

Add a native inductor lowering for lerp and _foreach_lerp that uses FMA (fused multiply-add) to match CUDA's native lerp behavior. CUDA's lerp uses fma(weight, end-start, start) internally, which computes weight*(end-start)+start without intermediate rounding. Test: test_lerp_fma_precision in test_cuda_repro.py ghstack-source-id: a9ec2c4 Pull-Request: #174749

[ghstack-poisoned]

Add a native inductor lowering for lerp and _foreach_lerp that uses FMA (fused multiply-add) to match CUDA's native lerp behavior. CUDA's lerp uses fma(weight, end-start, start) internally, which computes weight*(end-start)+start without intermediate rounding. Test: test_lerp_fma_precision in test_cuda_repro.py ghstack-source-id: 0c57474 Pull-Request: #174749

[ghstack-poisoned]

Add a native inductor lowering for lerp and _foreach_lerp that uses FMA (fused multiply-add) to match CUDA's native lerp behavior. CUDA's lerp uses fma(weight, end-start, start) internally, which computes weight*(end-start)+start without intermediate rounding. Test: test_lerp_fma_precision in test_cuda_repro.py Pull Request resolved: #174749 Approved by: https://github.com/eellison

…)" This reverts commit 03af6e5. Reverted #174749 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](#174749 (comment)))

mlazos · 2026-02-25T03:14:36Z

@pytorchbot merge

pytorchmergebot · 2026-02-25T03:17:23Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Configure Triton to use the CUDA toolkit's libdevice instead of its bundled version. The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension. A warning is emitted if the CUDA libdevice cannot be found. Pull Request resolved: #174933 Approved by: https://github.com/v0i0, https://github.com/eellison ghstack dependencies: #174749

Use float32 constant instead of int for reciprocal to ensure proper floating-point division when emulating eager division rounding. Test: test_div_precision_rounding in test_cuda_repro.py Pull Request resolved: #174751 Approved by: https://github.com/v0i0 ghstack dependencies: #174749, #174933

Configure Triton to use the CUDA toolkit's libdevice instead of its bundled version. The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension. A warning is emitted if the CUDA libdevice cannot be found. Pull Request resolved: #174933 Approved by: https://github.com/v0i0, https://github.com/eellison ghstack dependencies: #174749

Use float32 constant instead of int for reciprocal to ensure proper floating-point division when emulating eager division rounding. Test: test_div_precision_rounding in test_cuda_repro.py Pull Request resolved: #174751 Approved by: https://github.com/v0i0 ghstack dependencies: #174749, #174933

Add a native inductor lowering for lerp and _foreach_lerp that uses FMA (fused multiply-add) to match CUDA's native lerp behavior. CUDA's lerp uses fma(weight, end-start, start) internally, which computes weight*(end-start)+start without intermediate rounding. Test: test_lerp_fma_precision in test_cuda_repro.py ghstack-source-id: c0cd4f5 Pull-Request: pytorch/pytorch#174749

Add a native inductor lowering for lerp and _foreach_lerp that uses FMA (fused multiply-add) to match CUDA's native lerp behavior. CUDA's lerp uses fma(weight, end-start, start) internally, which computes weight*(end-start)+start without intermediate rounding. Test: test_lerp_fma_precision in test_cuda_repro.py ghstack-source-id: 714b4e2 Pull-Request: pytorch/pytorch#174749

Add a native inductor lowering for lerp and _foreach_lerp that uses FMA (fused multiply-add) to match CUDA's native lerp behavior. CUDA's lerp uses fma(weight, end-start, start) internally, which computes weight*(end-start)+start without intermediate rounding. Test: test_lerp_fma_precision in test_cuda_repro.py ghstack-source-id: 3895307 Pull-Request: pytorch/pytorch#174749

Add a native inductor lowering for lerp and _foreach_lerp that uses FMA (fused multiply-add) to match CUDA's native lerp behavior. CUDA's lerp uses fma(weight, end-start, start) internally, which computes weight*(end-start)+start without intermediate rounding. Test: test_lerp_fma_precision in test_cuda_repro.py ghstack-source-id: 2481f3c Pull-Request: pytorch/pytorch#174749

Add a native inductor lowering for lerp and _foreach_lerp that uses FMA (fused multiply-add) to match CUDA's native lerp behavior. CUDA's lerp uses fma(weight, end-start, start) internally, which computes weight*(end-start)+start without intermediate rounding. Test: test_lerp_fma_precision in test_cuda_repro.py Pull Request resolved: pytorch#174749 Approved by: https://github.com/eellison

…ch#174749)" This reverts commit 03af6e5. Reverted pytorch#174749 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#174749 (comment)))

Add a native inductor lowering for lerp and _foreach_lerp that uses FMA (fused multiply-add) to match CUDA's native lerp behavior. CUDA's lerp uses fma(weight, end-start, start) internally, which computes weight*(end-start)+start without intermediate rounding. Test: test_lerp_fma_precision in test_cuda_repro.py Pull Request resolved: pytorch#174749 Approved by: https://github.com/eellison

Configure Triton to use the CUDA toolkit's libdevice instead of its bundled version. The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension. A warning is emitted if the CUDA libdevice cannot be found. Pull Request resolved: pytorch#174933 Approved by: https://github.com/v0i0, https://github.com/eellison ghstack dependencies: pytorch#174749

…ch#174751) Use float32 constant instead of int for reciprocal to ensure proper floating-point division when emulating eager division rounding. Test: test_div_precision_rounding in test_cuda_repro.py Pull Request resolved: pytorch#174751 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174749, pytorch#174933

Configure Triton to use the CUDA toolkit's libdevice instead of its bundled version. The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension. A warning is emitted if the CUDA libdevice cannot be found. Pull Request resolved: pytorch#174933 Approved by: https://github.com/v0i0, https://github.com/eellison ghstack dependencies: pytorch#174749

…ch#174751) Use float32 constant instead of int for reciprocal to ensure proper floating-point division when emulating eager division rounding. Test: test_div_precision_rounding in test_cuda_repro.py Pull Request resolved: pytorch#174751 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174749, pytorch#174933

Update

df285e7

[ghstack-poisoned]

pytorch-bot Bot added ciflow/inductor module: dynamo module: inductor labels Feb 11, 2026

This was referenced Feb 11, 2026

[inductor] Add pow_precision config for eager numerics #174750

Closed

[inductor] Fix reciprocal to use float32 for division_rounding #174751

Closed

Update

2ec4970

[ghstack-poisoned]

mlazos added the release notes: inductor label Feb 11, 2026

Update

239dee8

[ghstack-poisoned]

mlazos requested a review from eellison February 11, 2026 21:43

mlazos assigned v0i0 Feb 11, 2026

mlazos mentioned this pull request Feb 11, 2026

[inductor] Add compiled Adam bitwise test #174815

Closed

mlazos unassigned v0i0 Feb 11, 2026

mlazos requested a review from v0i0 February 11, 2026 21:44

eellison approved these changes Feb 11, 2026

View reviewed changes

This was referenced Feb 12, 2026

[inductor] Skip addcdiv decomposition for AdamW bitwise precision #174910

Closed

[inductor] Add bitwise tests for compiled Adam/AdamW #174911

Closed

[inductor] Add FMA-based addcdiv lowering for CUDA parity #174912

Closed

Update

7f350b6

[ghstack-poisoned]

mlazos mentioned this pull request Feb 13, 2026

[inductor] Use CUDA toolkit libdevice for Triton #174933

Closed

Update

30abe64

[ghstack-poisoned]

mlazos mentioned this pull request Feb 18, 2026

[inductor] Add inline PTX pow for bitwise CUDA parity #175227

Closed

Update

a3d35db

[ghstack-poisoned]

mlazos mentioned this pull request Feb 24, 2026

[inductor] Use CUDA toolkit libdevice for Triton pow precision #175594

Closed

mlazos added 3 commits February 23, 2026 17:33

Update

323b6eb

[ghstack-poisoned]

Update

4a159d2

[ghstack-poisoned]

Update

c5cbae9

[ghstack-poisoned]

pytorchmergebot added the merging label Feb 25, 2026

pytorchmergebot closed this in ebb2099 Feb 25, 2026

pytorchmergebot removed the merging label Feb 25, 2026

github-actions Bot deleted the gh/mlazos/95/head branch March 28, 2026 02:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inductor] Add FMA-based lerp lowering for CUDA parity#174749

[inductor] Add FMA-based lerp lowering for CUDA parity#174749
mlazos wants to merge 18 commits intogh/mlazos/95/basefrom
gh/mlazos/95/head

mlazos commented Feb 11, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Feb 11, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Feb 11, 2026

Uh oh!

eellison left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mlazos commented Feb 25, 2026

Uh oh!

pytorchmergebot commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mlazos commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174749

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

pytorch-bot Bot commented Feb 11, 2026

This PR needs a release notes: label

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mlazos commented Feb 25, 2026

Uh oh!

pytorchmergebot commented Feb 25, 2026

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mlazos commented Feb 11, 2026 •

edited

Loading

pytorch-bot Bot commented Feb 11, 2026 •

edited

Loading

This PR needs a `release notes:` label