[inductor] Use CUDA toolkit libdevice for Triton by mlazos · Pull Request #174933 · pytorch/pytorch

mlazos · 2026-02-13T01:37:46Z

Stack from ghstack (oldest at bottom):

Configure Triton to use the CUDA toolkit's libdevice instead of its bundled
version.

The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension.
A warning is emitted if the CUDA libdevice cannot be found.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

[ghstack-poisoned]

pytorch-bot · 2026-02-13T01:37:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174933

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 4 Unrelated Failures

As of commit 681a870 with merge base 197c376 ():

NEW FAILURES - The following jobs have failed:

Nitpicker / triage (gh)
API rate limit exceeded for installation. If you reach out to GitHub Support for help, please include the request ID 1802:33016A:6615ED:1B58A14:69A2284D and timestamp 2026-02-27 23:27:09 UTC. For more on scraping GitHub and how it may affect your rights, please review our Terms of Service (https://docs.github.com/en/site-policy/github-terms/github-terms-of-service)
trunk / linux-jammy-rocm-py3.10 / test (distributed, 3, 3, linux.rocm.gpu.gfx950.4) (gh)
distributed/_composable/test_replicate_training

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / unit-test / inductor-test / test (inductor, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_linear_cuda_float16
inductor / unit-test / inductor-test / test (inductor, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
test/inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_linalg_lu_factor_ex_cuda_float32
pull / linux-jammy-py3.14t-clang15 / test (crossref, 1, 2, lf.linux.2xlarge) (gh) (similar failure)
test/test_serialization.py::TestSerialization::test_serialization_4gb_file
trunk / linux-jammy-rocm-py3.10 / test (default, 5, 6, linux.rocm.gpu.gfx950.1) (gh) (similar failure)
test/inductor/test_combo_kernels.py::ComboKernelTests::test_combo_kernel_dynamic_shapes_grid_changes

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2026-02-13T01:37:53Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

[ghstack-poisoned]

Configure Triton to use the CUDA toolkit's libdevice instead of its bundled version. This ensures Triton's libdevice.pow matches CUDA's powf for bitwise precision in compiled optimizers. Triton bundles its own libdevice.10.bc which uses different polynomial coefficients than CUDA's version, causing 1-5 ULP differences in pow results: pow(0.9, 3.0): eager=0x3f3a9fbd triton=0x3f3a9fbe (1 ULP) pow(0.9, 100.0): eager=0x37ded005 triton=0x37ded000 (5 ULP) By setting TRITON_LIBDEVICE_PATH to the CUDA toolkit's libdevice, we eliminate these precision differences without needing ATen fallbacks that would break kernel fusion. The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension. A warning is emitted if the CUDA libdevice cannot be found. Test: test_pow_scalar_tensor_precision in test_cuda_repro.py ghstack-source-id: 9dfbf52 Pull-Request: #174933

[ghstack-poisoned]

Configure Triton to use the CUDA toolkit's libdevice instead of its bundled version. This ensures Triton's libdevice.pow matches CUDA's powf for bitwise precision in compiled optimizers. Triton bundles its own libdevice.10.bc which uses different polynomial coefficients than CUDA's version, causing 1-5 ULP differences in pow results: pow(0.9, 3.0): eager=0x3f3a9fbd triton=0x3f3a9fbe (1 ULP) pow(0.9, 100.0): eager=0x37ded005 triton=0x37ded000 (5 ULP) By setting TRITON_LIBDEVICE_PATH to the CUDA toolkit's libdevice, we eliminate these precision differences without needing ATen fallbacks that would break kernel fusion. The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension. A warning is emitted if the CUDA libdevice cannot be found. Test: test_sets_cuda_libdevice_path in test_compile_worker.py ghstack-source-id: 9dfbf52 Pull-Request: #174933

[ghstack-poisoned]

Configure Triton to use the CUDA toolkit's libdevice instead of its bundled version. This ensures Triton's libdevice.pow matches CUDA's powf for bitwise precision in compiled optimizers. Triton bundles its own libdevice.10.bc which uses different polynomial coefficients than CUDA's version, causing 1-5 ULP differences in pow results: pow(0.9, 3.0): eager=0x3f3a9fbd triton=0x3f3a9fbe (1 ULP) pow(0.9, 100.0): eager=0x37ded005 triton=0x37ded000 (5 ULP) By setting TRITON_LIBDEVICE_PATH to the CUDA toolkit's libdevice, we eliminate these precision differences without needing ATen fallbacks that would break kernel fusion. The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension. A warning is emitted if the CUDA libdevice cannot be found. Test: test_sets_cuda_libdevice_path in test_compile_worker.py ghstack-source-id: b7b0de4 Pull-Request: #174933

[ghstack-poisoned]

eellison

Looks good but you lost your test checking correctness. Can you add it back ? and add one for subprocess/ one for main proces

mlazos · 2026-02-18T19:15:59Z

Looks good but you lost your test checking correctness. Can you add it back ? and add one for subprocess/ one for main proces

Yeah so it turns out the next PR is what was actually needed but I still thought this change is beneficial. The final explanation is that triton still ends up with FTZ instructions even when it's using the same libdevice because triton's compilation settings dictate that. (it only links with the libdevice, and then emits code based on its own compilation settings).

I'll add the subprocess test back. Claude put it in the next PR and then I had it delete it by mistake.

[ghstack-poisoned]

Use float32 constant instead of int for reciprocal to ensure proper floating-point division when emulating eager division rounding. Test: test_div_precision_rounding in test_cuda_repro.py Pull Request resolved: #174751 Approved by: https://github.com/v0i0 ghstack dependencies: #174749, #174933

jeanschmidt · 2026-02-27T23:20:49Z

@pytorchbot revert -m "Need to revert in order to revert #175555 - see D94699526" -c ghfirst

pytorchmergebot · 2026-02-27T23:22:42Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

#174751)" This reverts commit 1b9046a. Reverted #174751 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert #175555 - see D94699526 ([comment](#174933 (comment)))

This reverts commit ba59c42. Reverted #174933 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert #175555 - see D94699526 ([comment](#174933 (comment)))

pytorchmergebot · 2026-02-27T23:22:57Z

@mlazos your PR has been successfully reverted.

pytorchmergebot · 2026-02-28T02:10:21Z

Starting merge as part of PR stack under #174751

mlazos · 2026-02-28T02:10:22Z

@pytorchbot merge -i

pytorchmergebot · 2026-02-28T02:13:16Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

pytorch-bot · 2026-02-28T02:13:22Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Use float32 constant instead of int for reciprocal to ensure proper floating-point division when emulating eager division rounding. Test: test_div_precision_rounding in test_cuda_repro.py Pull Request resolved: #174751 Approved by: https://github.com/v0i0 ghstack dependencies: #174749, #174933

Configure Triton to use the CUDA toolkit's libdevice instead of its bundled version. This ensures Triton's libdevice.pow matches CUDA's powf for bitwise precision in compiled optimizers. Triton bundles its own libdevice.10.bc which uses different polynomial coefficients than CUDA's version, causing 1-5 ULP differences in pow results: pow(0.9, 3.0): eager=0x3f3a9fbd triton=0x3f3a9fbe (1 ULP) pow(0.9, 100.0): eager=0x37ded005 triton=0x37ded000 (5 ULP) By setting TRITON_LIBDEVICE_PATH to the CUDA toolkit's libdevice, we eliminate these precision differences without needing ATen fallbacks that would break kernel fusion. The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension. A warning is emitted if the CUDA libdevice cannot be found. Test: test_sets_cuda_libdevice_path in test_compile_worker.py ghstack-source-id: 9013dfc Pull-Request: pytorch/pytorch#174933

Configure Triton to use the CUDA toolkit's libdevice instead of its bundled version. This ensures Triton's libdevice.pow matches CUDA's powf for bitwise precision in compiled optimizers. Triton bundles its own libdevice.10.bc which uses different polynomial coefficients than CUDA's version, causing 1-5 ULP differences in pow results: pow(0.9, 3.0): eager=0x3f3a9fbd triton=0x3f3a9fbe (1 ULP) pow(0.9, 100.0): eager=0x37ded005 triton=0x37ded000 (5 ULP) By setting TRITON_LIBDEVICE_PATH to the CUDA toolkit's libdevice, we eliminate these precision differences without needing ATen fallbacks that would break kernel fusion. The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension. A warning is emitted if the CUDA libdevice cannot be found. Test: test_sets_cuda_libdevice_path in test_compile_worker.py ghstack-source-id: 2d30b32 Pull-Request: pytorch/pytorch#174933

Configure Triton to use the CUDA toolkit's libdevice instead of its bundled version. The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension. A warning is emitted if the CUDA libdevice cannot be found. Pull Request resolved: pytorch#174933 Approved by: https://github.com/v0i0, https://github.com/eellison ghstack dependencies: pytorch#174749

…ch#174751) Use float32 constant instead of int for reciprocal to ensure proper floating-point division when emulating eager division rounding. Test: test_div_precision_rounding in test_cuda_repro.py Pull Request resolved: pytorch#174751 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174749, pytorch#174933

pytorch#174751)" This reverts commit 1b9046a. Reverted pytorch#174751 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert pytorch#175555 - see D94699526 ([comment](pytorch#174933 (comment)))

)" This reverts commit ba59c42. Reverted pytorch#174933 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert pytorch#175555 - see D94699526 ([comment](pytorch#174933 (comment)))

Configure Triton to use the CUDA toolkit's libdevice instead of its bundled version. The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension. A warning is emitted if the CUDA libdevice cannot be found. Pull Request resolved: pytorch#174933 Approved by: https://github.com/v0i0, https://github.com/eellison ghstack dependencies: pytorch#174749

…ch#174751) Use float32 constant instead of int for reciprocal to ensure proper floating-point division when emulating eager division rounding. Test: test_div_precision_rounding in test_cuda_repro.py Pull Request resolved: pytorch#174751 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174749, pytorch#174933

Update

d9932db

[ghstack-poisoned]

pytorch-bot Bot added ciflow/inductor module: inductor labels Feb 13, 2026

mlazos requested review from eellison and v0i0 February 13, 2026 01:39

Update

2651ddc

[ghstack-poisoned]

Skylion007 reviewed Feb 13, 2026

View reviewed changes

Comment thread torch/_inductor/runtime/compile_tasks.py

Comment thread torch/_inductor/runtime/compile_tasks.py Outdated

eellison reviewed Feb 13, 2026

View reviewed changes

Comment thread test/inductor/test_cuda_repro.py Outdated

Comment thread test/inductor/test_cuda_repro.py Outdated

v0i0 approved these changes Feb 17, 2026

View reviewed changes

Update

77ef5db

[ghstack-poisoned]

mlazos mentioned this pull request Feb 18, 2026

[inductor] Add inline PTX pow for bitwise CUDA parity #175227

Closed

Update

bffc28f

[ghstack-poisoned]

Update

19c5da6

[ghstack-poisoned]

eellison approved these changes Feb 18, 2026

View reviewed changes

mlazos changed the title ~~[inductor] Use CUDA toolkit libdevice for Triton pow precision~~ [inductor] Use CUDA toolkit libdevice for Triton Feb 18, 2026

Update

64e02eb

[ghstack-poisoned]

mlazos mentioned this pull request Feb 18, 2026

[inductor] Fix pow precision helper for fp64 inputs #175268

Closed

pytorchmergebot added the Merged label Feb 27, 2026

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Feb 27, 2026

pytorchmergebot reopened this Feb 27, 2026

pytorchmergebot added the merging label Feb 28, 2026

pytorchmergebot removed the merging label Feb 28, 2026

mlazos added the release notes: inductor label Feb 28, 2026

pytorchmergebot closed this in 220e6dd Feb 28, 2026

github-actions Bot deleted the gh/mlazos/103/head branch March 31, 2026 02:23

Conversation

mlazos commented Feb 13, 2026 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174933

❌ 2 New Failures, 4 Unrelated Failures

Uh oh!

pytorch-bot Bot commented Feb 13, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eellison left a comment

Choose a reason for hiding this comment

Uh oh!

mlazos commented Feb 18, 2026

Uh oh!

jeanschmidt commented Feb 27, 2026

Uh oh!

pytorchmergebot commented Feb 27, 2026

Uh oh!

pytorchmergebot commented Feb 27, 2026

Uh oh!

pytorchmergebot commented Feb 28, 2026

Uh oh!

mlazos commented Feb 28, 2026

Uh oh!

pytorchmergebot commented Feb 28, 2026

Merge failed

Uh oh!

pytorch-bot Bot commented Feb 28, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

mlazos commented Feb 13, 2026 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented Feb 13, 2026 •

edited

Loading

This PR needs a `release notes:` label

This PR needs a `release notes:` label