[inductor] Add inline PTX pow for bitwise CUDA parity#175227
[inductor] Add inline PTX pow for bitwise CUDA parity#175227mlazos wants to merge 16 commits intogh/mlazos/104/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175227
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 3f8e00b with merge base 197c376 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
Add powf_cuda helper using inline PTX with non-FTZ (flush-to-zero) instructions to match CUDA's powf exactly. Triton's libdevice.pow uses FTZ instructions (fma.rn.ftz.f32) which cause 1-3 ULP differences compared to CUDA's powf (fma.rn.f32). This is used when config.emulate_precision_casts is enabled. Co-authored-by: Claude <noreply@anthropic.com> ghstack-source-id: 71954f9 Pull-Request: #175227
Add powf_cuda helper using inline PTX with non-FTZ (flush-to-zero) instructions to match CUDA's powf exactly. Triton's libdevice.pow uses FTZ instructions (fma.rn.ftz.f32) which cause 1-3 ULP differences compared to CUDA's powf (fma.rn.f32). This is enabled via config.eager_numerics.pow_precision. Co-authored-by: Claude <noreply@anthropic.com> ghstack-source-id: 71954f9 Pull-Request: #175227
Add powf_cuda helper using inline PTX with non-FTZ (flush-to-zero) instructions to match CUDA's powf exactly. Triton's libdevice.pow uses FTZ instructions (fma.rn.ftz.f32) which cause 1-3 ULP differences compared to CUDA's powf (fma.rn.f32). This is enabled via config.eager_numerics.pow_precision. Co-authored-by: Claude <noreply@anthropic.com> ghstack-source-id: 127a4f5 Pull-Request: #175227
|
how does this relate to the libdevice fix, was that insufficient? |
| result = tl.inline_asm_elementwise( | ||
| asm=""" | ||
| { | ||
| .reg .pred p2, p3, p4, p5, p6, p7, p8; | ||
| .reg .f32 f1, f2, f3, f4, f5, f6, f7, f8, f9, f10; | ||
| .reg .f32 f11, f12, f13, f14, f15, f16, f17, f18, f19, f20; | ||
| .reg .f32 f21, f22, f23, f24, f25, f26, f27, f28, f29, f30; | ||
| .reg .f32 f31, f32, f33, f34, f35, f36, f37, f38, f39, f40; | ||
| .reg .f32 f41, f42, f43, f44, f45, f46, f47, f48, f49, f50; | ||
| .reg .f32 f51, f52, f53, f54, f55, f56, f57, f58, f59, f60; | ||
| .reg .f32 f61, f62, f63, f64, f65, f66, f67, f68, f69, f70; | ||
| .reg .f32 f71, f72, f73, f74, f75, f76, f77, f78, f79, f80; | ||
| .reg .f32 base_in, exp_in, result_out; | ||
| .reg .s32 r6, r7, r8, r9, r10, r11, r12, r13, r14; | ||
|
|
There was a problem hiding this comment.
i wonder if this would be a bit more maintanable/readable if we had triton.jit version of each of the composed operations, instead of one huge blob. like, what is the equivalent cuda operations? could we just triton.jit each as a helper, and then compose ?
There was a problem hiding this comment.
I don't think we can use triton jit at all, it will then convert to ftz
There was a problem hiding this comment.
ah there is another flag for that i recently added, disable_ftz in inductor, enable_reflect_ftz in triton
There was a problem hiding this comment.
Yeah it didn't work @markus, I think there is a bug in triton w/ respecting the ftz stuff for libdevice?
Yeah I actually don't need the libdevice fix with this. I left the libdevice one though because I think it's a good change anyway. |
Add powf_cuda helper using inline PTX with non-FTZ (flush-to-zero) instructions to match CUDA's powf exactly. Triton's libdevice.pow uses FTZ instructions (fma.rn.ftz.f32) which cause 1-3 ULP differences compared to CUDA's powf (fma.rn.f32). This is enabled via config.eager_numerics.pow_precision. Co-authored-by: Claude <noreply@anthropic.com> ghstack-source-id: 6c669f3 Pull-Request: #175227
Add powf_cuda helper using inline PTX with non-FTZ (flush-to-zero) instructions to match CUDA's powf exactly. Triton's libdevice.pow uses FTZ instructions (fma.rn.ftz.f32) which cause 1-3 ULP differences compared to CUDA's powf (fma.rn.f32). This is enabled via config.eager_numerics.pow_precision. Co-authored-by: Claude <noreply@anthropic.com> ghstack-source-id: 6c669f3 Pull-Request: #175227
see my other comment above. i am not convinced this will match results on all inputs, i'd expect libdevice to be more accurate for a bunch of values |
|
I'm not trying to comment on this PR, but just to study the impacts of 1-3ULP. @mlazos , may I know if there is any workload that suffers from the 1-3ULP difference? Are there any principles for evaluating the impact of different ULPs on accuracy? It would be helpful for other hardware backends as well. |
We've established exact bitwise matching as a priority this half - on large and long training runs it's easier to just ensure bitwise matching vs figure out how 1-3 ULP changed it over time. |
|
Closing - no longer needed |
Add powf_cuda helper using inline PTX with non-FTZ (flush-to-zero) instructions to match CUDA's powf exactly. Triton's libdevice.pow uses FTZ instructions (fma.rn.ftz.f32) which cause 1-3 ULP differences compared to CUDA's powf (fma.rn.f32). This is enabled via config.eager_numerics.pow_precision. Co-authored-by: Claude <noreply@anthropic.com> ghstack-source-id: d6c25af Pull-Request: pytorch/pytorch#175227
Add powf_cuda helper using inline PTX with non-FTZ (flush-to-zero) instructions to match CUDA's powf exactly. Triton's libdevice.pow uses FTZ instructions (fma.rn.ftz.f32) which cause 1-3 ULP differences compared to CUDA's powf (fma.rn.f32). This is enabled via config.eager_numerics.pow_precision. Co-authored-by: Claude <noreply@anthropic.com> ghstack-source-id: d6c25af Pull-Request: pytorch/pytorch#175227
Add powf_cuda helper using inline PTX with non-FTZ (flush-to-zero) instructions to match CUDA's powf exactly. Triton's libdevice.pow uses FTZ instructions (fma.rn.ftz.f32) which cause 1-3 ULP differences compared to CUDA's powf (fma.rn.f32). This is enabled via config.eager_numerics.pow_precision. Co-authored-by: Claude <noreply@anthropic.com> ghstack-source-id: f912d1a Pull-Request: pytorch/pytorch#175227
Stack from ghstack (oldest at bottom):
Add powf_cuda helper using inline PTX with non-FTZ (flush-to-zero)
instructions to match CUDA's powf exactly.
Triton's libdevice.pow uses FTZ instructions (fma.rn.ftz.f32) which
cause 1-3 ULP differences compared to CUDA's powf (fma.rn.f32).
This is used when eager_numerics.pow_precision is enabled (will evaluate whether that can be on by default)
Co-authored-by: Claude noreply@anthropic.com