[inductor] Add inline PTX pow for bitwise CUDA parity by mlazos · Pull Request #175227 · pytorch/pytorch

mlazos · 2026-02-18T04:18:57Z

Stack from ghstack (oldest at bottom):

Add powf_cuda helper using inline PTX with non-FTZ (flush-to-zero)
instructions to match CUDA's powf exactly.

Triton's libdevice.pow uses FTZ instructions (fma.rn.ftz.f32) which
cause 1-3 ULP differences compared to CUDA's powf (fma.rn.f32).
This is used when eager_numerics.pow_precision is enabled (will evaluate whether that can be on by default)

Co-authored-by: Claude noreply@anthropic.com

[ghstack-poisoned]

pytorch-bot · 2026-02-18T04:19:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175227

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3f8e00b with merge base 197c376 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2026-02-18T04:19:03Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Add powf_cuda helper using inline PTX with non-FTZ (flush-to-zero) instructions to match CUDA's powf exactly. Triton's libdevice.pow uses FTZ instructions (fma.rn.ftz.f32) which cause 1-3 ULP differences compared to CUDA's powf (fma.rn.f32). This is used when config.emulate_precision_casts is enabled. Co-authored-by: Claude <noreply@anthropic.com> ghstack-source-id: 71954f9 Pull-Request: #175227

Add powf_cuda helper using inline PTX with non-FTZ (flush-to-zero) instructions to match CUDA's powf exactly. Triton's libdevice.pow uses FTZ instructions (fma.rn.ftz.f32) which cause 1-3 ULP differences compared to CUDA's powf (fma.rn.f32). This is enabled via config.eager_numerics.pow_precision. Co-authored-by: Claude <noreply@anthropic.com> ghstack-source-id: 71954f9 Pull-Request: #175227

[ghstack-poisoned]

Add powf_cuda helper using inline PTX with non-FTZ (flush-to-zero) instructions to match CUDA's powf exactly. Triton's libdevice.pow uses FTZ instructions (fma.rn.ftz.f32) which cause 1-3 ULP differences compared to CUDA's powf (fma.rn.f32). This is enabled via config.eager_numerics.pow_precision. Co-authored-by: Claude <noreply@anthropic.com> ghstack-source-id: 127a4f5 Pull-Request: #175227

[ghstack-poisoned]

v0i0 · 2026-02-18T23:37:51Z

how does this relate to the libdevice fix, was that insufficient?

eellison · 2026-02-19T00:28:15Z

+    result = tl.inline_asm_elementwise(
+        asm="""
+        {
+            .reg .pred p2, p3, p4, p5, p6, p7, p8;
+            .reg .f32 f1, f2, f3, f4, f5, f6, f7, f8, f9, f10;
+            .reg .f32 f11, f12, f13, f14, f15, f16, f17, f18, f19, f20;
+            .reg .f32 f21, f22, f23, f24, f25, f26, f27, f28, f29, f30;
+            .reg .f32 f31, f32, f33, f34, f35, f36, f37, f38, f39, f40;
+            .reg .f32 f41, f42, f43, f44, f45, f46, f47, f48, f49, f50;
+            .reg .f32 f51, f52, f53, f54, f55, f56, f57, f58, f59, f60;
+            .reg .f32 f61, f62, f63, f64, f65, f66, f67, f68, f69, f70;
+            .reg .f32 f71, f72, f73, f74, f75, f76, f77, f78, f79, f80;
+            .reg .f32 base_in, exp_in, result_out;
+            .reg .s32 r6, r7, r8, r9, r10, r11, r12, r13, r14;
+


i wonder if this would be a bit more maintanable/readable if we had triton.jit version of each of the composed operations, instead of one huge blob. like, what is the equivalent cuda operations? could we just triton.jit each as a helper, and then compose ?

I don't think we can use triton jit at all, it will then convert to ftz

ah there is another flag for that i recently added, disable_ftz in inductor, enable_reflect_ftz in triton

Yeah it didn't work @markus, I think there is a bug in triton w/ respecting the ftz stuff for libdevice?

mlazos · 2026-02-19T02:01:07Z

how does this relate to the libdevice fix, was that insufficient?

Yeah I actually don't need the libdevice fix with this. I left the libdevice one though because I think it's a good change anyway.

[ghstack-poisoned]

Add powf_cuda helper using inline PTX with non-FTZ (flush-to-zero) instructions to match CUDA's powf exactly. Triton's libdevice.pow uses FTZ instructions (fma.rn.ftz.f32) which cause 1-3 ULP differences compared to CUDA's powf (fma.rn.f32). This is enabled via config.eager_numerics.pow_precision. Co-authored-by: Claude <noreply@anthropic.com> ghstack-source-id: 6c669f3 Pull-Request: #175227

[ghstack-poisoned]

Add powf_cuda helper using inline PTX with non-FTZ (flush-to-zero) instructions to match CUDA's powf exactly. Triton's libdevice.pow uses FTZ instructions (fma.rn.ftz.f32) which cause 1-3 ULP differences compared to CUDA's powf (fma.rn.f32). This is enabled via config.eager_numerics.pow_precision. Co-authored-by: Claude <noreply@anthropic.com> ghstack-source-id: 6c669f3 Pull-Request: #175227

[ghstack-poisoned]

v0i0 · 2026-02-23T23:42:54Z

how does this relate to the libdevice fix, was that insufficient?

Yeah I actually don't need the libdevice fix with this. I left the libdevice one though because I think it's a good change anyway.

see my other comment above. i am not convinced this will match results on all inputs, i'd expect libdevice to be more accurate for a bunch of values

[ghstack-poisoned]

EikanWang · 2026-02-24T02:22:14Z

I'm not trying to comment on this PR, but just to study the impacts of 1-3ULP. @mlazos , may I know if there is any workload that suffers from the 1-3ULP difference? Are there any principles for evaluating the impact of different ULPs on accuracy? It would be helpful for other hardware backends as well.

mlazos · 2026-02-24T07:34:53Z

I'm not trying to comment on this PR, but just to study the impacts of 1-3ULP. @mlazos , may I know if there is any workload that suffers from the 1-3ULP difference? Are there any principles for evaluating the impact of different ULPs on accuracy? It would be helpful for other hardware backends as well.

We've established exact bitwise matching as a priority this half - on large and long training runs it's easier to just ensure bitwise matching vs figure out how 1-3 ULP changed it over time.

[ghstack-poisoned]

mlazos · 2026-02-26T20:27:25Z

Closing - no longer needed

Add powf_cuda helper using inline PTX with non-FTZ (flush-to-zero) instructions to match CUDA's powf exactly. Triton's libdevice.pow uses FTZ instructions (fma.rn.ftz.f32) which cause 1-3 ULP differences compared to CUDA's powf (fma.rn.f32). This is enabled via config.eager_numerics.pow_precision. Co-authored-by: Claude <noreply@anthropic.com> ghstack-source-id: d6c25af Pull-Request: pytorch/pytorch#175227

Add powf_cuda helper using inline PTX with non-FTZ (flush-to-zero) instructions to match CUDA's powf exactly. Triton's libdevice.pow uses FTZ instructions (fma.rn.ftz.f32) which cause 1-3 ULP differences compared to CUDA's powf (fma.rn.f32). This is enabled via config.eager_numerics.pow_precision. Co-authored-by: Claude <noreply@anthropic.com> ghstack-source-id: f912d1a Pull-Request: pytorch/pytorch#175227

Update

7bc0311

[ghstack-poisoned]

pytorch-bot Bot added ciflow/inductor module: inductor labels Feb 18, 2026

mlazos requested review from eellison and v0i0 February 18, 2026 04:19

Update

d96bac7

[ghstack-poisoned]

Update

9487bf5

[ghstack-poisoned]

eellison reviewed Feb 18, 2026

View reviewed changes

Comment thread test/inductor/test_cuda_repro.py

Comment thread torch/_inductor/codegen/triton.py

Comment thread torch/_inductor/runtime/triton_helpers.py

Update

9b6a3a0

[ghstack-poisoned]

mlazos mentioned this pull request Feb 18, 2026

[inductor] Fix pow precision helper for fp64 inputs #175268

Closed

eellison reviewed Feb 19, 2026

View reviewed changes

Update

62c8866

[ghstack-poisoned]

This was referenced Feb 19, 2026

[inductor] Skip addcmul decomposition to enable FMA lowering #175309

Closed

[inductor] Skip addcdiv decomposition to enable FMA lowering #175310

Closed

Update

36e5602

[ghstack-poisoned]

mlazos requested a review from eellison February 19, 2026 19:54

mlazos added 2 commits February 20, 2026 16:21

Update

988a99b

[ghstack-poisoned]

Update

840b15d

[ghstack-poisoned]

Update

b773a7a

[ghstack-poisoned]

mlazos added 2 commits February 21, 2026 01:14

Update

de3d999

[ghstack-poisoned]

Update

aa3c393

[ghstack-poisoned]

mlazos added 3 commits February 23, 2026 15:49

Update

4dfa1cb

[ghstack-poisoned]

Update

b7ce3c2

[ghstack-poisoned]

Update

b8c96e6

[ghstack-poisoned]

v0i0 approved these changes Feb 24, 2026

View reviewed changes

mlazos added 2 commits February 24, 2026 00:31

Update

64d3ea4

[ghstack-poisoned]

Update

3f8e00b

[ghstack-poisoned]

mlazos closed this Feb 26, 2026

github-actions Bot deleted the gh/mlazos/104/head branch March 29, 2026 02:23

Conversation

mlazos commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175227

✅ No Failures

Uh oh!

pytorch-bot Bot commented Feb 18, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Uh oh!

Uh oh!

v0i0 commented Feb 18, 2026

Uh oh!

eellison Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

mlazos Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

v0i0 Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

mlazos Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

mlazos commented Feb 19, 2026

Uh oh!

v0i0 commented Feb 23, 2026

Uh oh!

EikanWang commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mlazos commented Feb 24, 2026

Uh oh!

mlazos commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mlazos commented Feb 18, 2026 •

edited

Loading

pytorch-bot Bot commented Feb 18, 2026 •

edited

Loading

This PR needs a `release notes:` label

EikanWang commented Feb 24, 2026 •

edited

Loading