Skip to content

[inductor] Add inline PTX pow for bitwise CUDA parity#175227

Closed
mlazos wants to merge 16 commits intogh/mlazos/104/basefrom
gh/mlazos/104/head
Closed

[inductor] Add inline PTX pow for bitwise CUDA parity#175227
mlazos wants to merge 16 commits intogh/mlazos/104/basefrom
gh/mlazos/104/head

Conversation

@mlazos
Copy link
Copy Markdown
Contributor

@mlazos mlazos commented Feb 18, 2026

Stack from ghstack (oldest at bottom):

Add powf_cuda helper using inline PTX with non-FTZ (flush-to-zero)
instructions to match CUDA's powf exactly.

Triton's libdevice.pow uses FTZ instructions (fma.rn.ftz.f32) which
cause 1-3 ULP differences compared to CUDA's powf (fma.rn.f32).
This is used when eager_numerics.pow_precision is enabled (will evaluate whether that can be on by default)

Co-authored-by: Claude noreply@anthropic.com

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 18, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175227

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 3f8e00b with merge base 197c376 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 18, 2026

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@mlazos mlazos requested review from eellison and v0i0 February 18, 2026 04:19
mlazos added a commit that referenced this pull request Feb 18, 2026
Add powf_cuda helper using inline PTX with non-FTZ (flush-to-zero)
instructions to match CUDA's powf exactly.

Triton's libdevice.pow uses FTZ instructions (fma.rn.ftz.f32) which
cause 1-3 ULP differences compared to CUDA's powf (fma.rn.f32).
This is used when config.emulate_precision_casts is enabled.

Co-authored-by: Claude <noreply@anthropic.com>
ghstack-source-id: 71954f9
Pull-Request: #175227
mlazos added a commit that referenced this pull request Feb 18, 2026
Add powf_cuda helper using inline PTX with non-FTZ (flush-to-zero)
instructions to match CUDA's powf exactly.

Triton's libdevice.pow uses FTZ instructions (fma.rn.ftz.f32) which
cause 1-3 ULP differences compared to CUDA's powf (fma.rn.f32).

This is enabled via config.eager_numerics.pow_precision.

Co-authored-by: Claude <noreply@anthropic.com>
ghstack-source-id: 71954f9
Pull-Request: #175227
[ghstack-poisoned]
mlazos added a commit that referenced this pull request Feb 18, 2026
Add powf_cuda helper using inline PTX with non-FTZ (flush-to-zero)
instructions to match CUDA's powf exactly.

Triton's libdevice.pow uses FTZ instructions (fma.rn.ftz.f32) which
cause 1-3 ULP differences compared to CUDA's powf (fma.rn.f32).

This is enabled via config.eager_numerics.pow_precision.

Co-authored-by: Claude <noreply@anthropic.com>
ghstack-source-id: 127a4f5
Pull-Request: #175227
[ghstack-poisoned]
Comment thread test/inductor/test_cuda_repro.py
Comment thread torch/_inductor/codegen/triton.py
Comment thread torch/_inductor/runtime/triton_helpers.py
[ghstack-poisoned]
@v0i0
Copy link
Copy Markdown
Contributor

v0i0 commented Feb 18, 2026

how does this relate to the libdevice fix, was that insufficient?

Comment on lines +779 to +793
result = tl.inline_asm_elementwise(
asm="""
{
.reg .pred p2, p3, p4, p5, p6, p7, p8;
.reg .f32 f1, f2, f3, f4, f5, f6, f7, f8, f9, f10;
.reg .f32 f11, f12, f13, f14, f15, f16, f17, f18, f19, f20;
.reg .f32 f21, f22, f23, f24, f25, f26, f27, f28, f29, f30;
.reg .f32 f31, f32, f33, f34, f35, f36, f37, f38, f39, f40;
.reg .f32 f41, f42, f43, f44, f45, f46, f47, f48, f49, f50;
.reg .f32 f51, f52, f53, f54, f55, f56, f57, f58, f59, f60;
.reg .f32 f61, f62, f63, f64, f65, f66, f67, f68, f69, f70;
.reg .f32 f71, f72, f73, f74, f75, f76, f77, f78, f79, f80;
.reg .f32 base_in, exp_in, result_out;
.reg .s32 r6, r7, r8, r9, r10, r11, r12, r13, r14;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if this would be a bit more maintanable/readable if we had triton.jit version of each of the composed operations, instead of one huge blob. like, what is the equivalent cuda operations? could we just triton.jit each as a helper, and then compose ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can use triton jit at all, it will then convert to ftz

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah there is another flag for that i recently added, disable_ftz in inductor, enable_reflect_ftz in triton

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it didn't work @markus, I think there is a bug in triton w/ respecting the ftz stuff for libdevice?

@mlazos
Copy link
Copy Markdown
Contributor Author

mlazos commented Feb 19, 2026

how does this relate to the libdevice fix, was that insufficient?

Yeah I actually don't need the libdevice fix with this. I left the libdevice one though because I think it's a good change anyway.

[ghstack-poisoned]
[ghstack-poisoned]
@mlazos mlazos requested a review from eellison February 19, 2026 19:54
[ghstack-poisoned]
[ghstack-poisoned]
mlazos added a commit that referenced this pull request Feb 21, 2026
Add powf_cuda helper using inline PTX with non-FTZ (flush-to-zero)
instructions to match CUDA's powf exactly.

Triton's libdevice.pow uses FTZ instructions (fma.rn.ftz.f32) which
cause 1-3 ULP differences compared to CUDA's powf (fma.rn.f32).

This is enabled via config.eager_numerics.pow_precision.

Co-authored-by: Claude <noreply@anthropic.com>
ghstack-source-id: 6c669f3
Pull-Request: #175227
[ghstack-poisoned]
mlazos added a commit that referenced this pull request Feb 21, 2026
Add powf_cuda helper using inline PTX with non-FTZ (flush-to-zero)
instructions to match CUDA's powf exactly.

Triton's libdevice.pow uses FTZ instructions (fma.rn.ftz.f32) which
cause 1-3 ULP differences compared to CUDA's powf (fma.rn.f32).

This is enabled via config.eager_numerics.pow_precision.

Co-authored-by: Claude <noreply@anthropic.com>
ghstack-source-id: 6c669f3
Pull-Request: #175227
[ghstack-poisoned]
[ghstack-poisoned]
@v0i0
Copy link
Copy Markdown
Contributor

v0i0 commented Feb 23, 2026

how does this relate to the libdevice fix, was that insufficient?

Yeah I actually don't need the libdevice fix with this. I left the libdevice one though because I think it's a good change anyway.

see my other comment above. i am not convinced this will match results on all inputs, i'd expect libdevice to be more accurate for a bunch of values

[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@EikanWang
Copy link
Copy Markdown
Collaborator

EikanWang commented Feb 24, 2026

I'm not trying to comment on this PR, but just to study the impacts of 1-3ULP. @mlazos , may I know if there is any workload that suffers from the 1-3ULP difference? Are there any principles for evaluating the impact of different ULPs on accuracy? It would be helpful for other hardware backends as well.

@mlazos
Copy link
Copy Markdown
Contributor Author

mlazos commented Feb 24, 2026

I'm not trying to comment on this PR, but just to study the impacts of 1-3ULP. @mlazos , may I know if there is any workload that suffers from the 1-3ULP difference? Are there any principles for evaluating the impact of different ULPs on accuracy? It would be helpful for other hardware backends as well.

We've established exact bitwise matching as a priority this half - on large and long training runs it's easier to just ensure bitwise matching vs figure out how 1-3 ULP changed it over time.

[ghstack-poisoned]
[ghstack-poisoned]
@mlazos
Copy link
Copy Markdown
Contributor Author

mlazos commented Feb 26, 2026

Closing - no longer needed

@mlazos mlazos closed this Feb 26, 2026
sandy-gags pushed a commit to sandy-gags/pytorch that referenced this pull request Mar 12, 2026
Add powf_cuda helper using inline PTX with non-FTZ (flush-to-zero)
instructions to match CUDA's powf exactly.

Triton's libdevice.pow uses FTZ instructions (fma.rn.ftz.f32) which
cause 1-3 ULP differences compared to CUDA's powf (fma.rn.f32).

This is enabled via config.eager_numerics.pow_precision.

Co-authored-by: Claude <noreply@anthropic.com>
ghstack-source-id: d6c25af
Pull-Request: pytorch/pytorch#175227
sandy-gags pushed a commit to sandy-gags/pytorch that referenced this pull request Mar 12, 2026
Add powf_cuda helper using inline PTX with non-FTZ (flush-to-zero)
instructions to match CUDA's powf exactly.

Triton's libdevice.pow uses FTZ instructions (fma.rn.ftz.f32) which
cause 1-3 ULP differences compared to CUDA's powf (fma.rn.f32).

This is enabled via config.eager_numerics.pow_precision.

Co-authored-by: Claude <noreply@anthropic.com>
ghstack-source-id: d6c25af
Pull-Request: pytorch/pytorch#175227
sandy-gags pushed a commit to sandy-gags/pytorch that referenced this pull request Mar 12, 2026
Add powf_cuda helper using inline PTX with non-FTZ (flush-to-zero)
instructions to match CUDA's powf exactly.

Triton's libdevice.pow uses FTZ instructions (fma.rn.ftz.f32) which
cause 1-3 ULP differences compared to CUDA's powf (fma.rn.f32).

This is enabled via config.eager_numerics.pow_precision.

Co-authored-by: Claude <noreply@anthropic.com>
ghstack-source-id: f912d1a
Pull-Request: pytorch/pytorch#175227
@github-actions github-actions Bot deleted the gh/mlazos/104/head branch March 29, 2026 02:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants