Skip to content

[inductor] Use CUDA toolkit libdevice for Triton#174933

Closed
mlazos wants to merge 21 commits intogh/mlazos/103/basefrom
gh/mlazos/103/head
Closed

[inductor] Use CUDA toolkit libdevice for Triton#174933
mlazos wants to merge 21 commits intogh/mlazos/103/basefrom
gh/mlazos/103/head

Conversation

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 13, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174933

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 4 Unrelated Failures

As of commit 681a870 with merge base 197c376 (image):

NEW FAILURES - The following jobs have failed:

  • Nitpicker / triage (gh)
    API rate limit exceeded for installation. If you reach out to GitHub Support for help, please include the request ID 1802:33016A:6615ED:1B58A14:69A2284D and timestamp 2026-02-27 23:27:09 UTC. For more on scraping GitHub and how it may affect your rights, please review our Terms of Service (https://docs.github.com/en/site-policy/github-terms/github-terms-of-service)
  • trunk / linux-jammy-rocm-py3.10 / test (distributed, 3, 3, linux.rocm.gpu.gfx950.4) (gh)
    distributed/_composable/test_replicate_training

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 13, 2026

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

[ghstack-poisoned]
Comment thread torch/_inductor/runtime/compile_tasks.py
Comment thread torch/_inductor/runtime/compile_tasks.py Outdated
Comment thread test/inductor/test_cuda_repro.py Outdated
Comment thread test/inductor/test_cuda_repro.py Outdated
mlazos added a commit that referenced this pull request Feb 18, 2026
Configure Triton to use the CUDA toolkit's libdevice instead of its bundled
version. This ensures Triton's libdevice.pow matches CUDA's powf for bitwise
precision in compiled optimizers.

Triton bundles its own libdevice.10.bc which uses different polynomial
coefficients than CUDA's version, causing 1-5 ULP differences in pow results:
  pow(0.9, 3.0):   eager=0x3f3a9fbd  triton=0x3f3a9fbe  (1 ULP)
  pow(0.9, 100.0): eager=0x37ded005  triton=0x37ded000  (5 ULP)

By setting TRITON_LIBDEVICE_PATH to the CUDA toolkit's libdevice, we eliminate
these precision differences without needing ATen fallbacks that would break
kernel fusion.

The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension.
A warning is emitted if the CUDA libdevice cannot be found.

Test: test_pow_scalar_tensor_precision in test_cuda_repro.py
ghstack-source-id: 9dfbf52
Pull-Request: #174933
[ghstack-poisoned]
mlazos added a commit that referenced this pull request Feb 18, 2026
Configure Triton to use the CUDA toolkit's libdevice instead of its bundled
version. This ensures Triton's libdevice.pow matches CUDA's powf for bitwise
precision in compiled optimizers.

Triton bundles its own libdevice.10.bc which uses different polynomial
coefficients than CUDA's version, causing 1-5 ULP differences in pow results:
  pow(0.9, 3.0):   eager=0x3f3a9fbd  triton=0x3f3a9fbe  (1 ULP)
  pow(0.9, 100.0): eager=0x37ded005  triton=0x37ded000  (5 ULP)

By setting TRITON_LIBDEVICE_PATH to the CUDA toolkit's libdevice, we eliminate
these precision differences without needing ATen fallbacks that would break
kernel fusion.

The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension.
A warning is emitted if the CUDA libdevice cannot be found.

Test: test_sets_cuda_libdevice_path in test_compile_worker.py
ghstack-source-id: 9dfbf52
Pull-Request: #174933
mlazos added a commit that referenced this pull request Feb 18, 2026
Configure Triton to use the CUDA toolkit's libdevice instead of its bundled
version. This ensures Triton's libdevice.pow matches CUDA's powf for bitwise
precision in compiled optimizers.

Triton bundles its own libdevice.10.bc which uses different polynomial
coefficients than CUDA's version, causing 1-5 ULP differences in pow results:
  pow(0.9, 3.0):   eager=0x3f3a9fbd  triton=0x3f3a9fbe  (1 ULP)
  pow(0.9, 100.0): eager=0x37ded005  triton=0x37ded000  (5 ULP)

By setting TRITON_LIBDEVICE_PATH to the CUDA toolkit's libdevice, we eliminate
these precision differences without needing ATen fallbacks that would break
kernel fusion.

The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension.
A warning is emitted if the CUDA libdevice cannot be found.

Test: test_sets_cuda_libdevice_path in test_compile_worker.py
ghstack-source-id: 9dfbf52
Pull-Request: #174933
[ghstack-poisoned]
mlazos added a commit that referenced this pull request Feb 18, 2026
Configure Triton to use the CUDA toolkit's libdevice instead of its bundled
version. This ensures Triton's libdevice.pow matches CUDA's powf for bitwise
precision in compiled optimizers.

Triton bundles its own libdevice.10.bc which uses different polynomial
coefficients than CUDA's version, causing 1-5 ULP differences in pow results:
  pow(0.9, 3.0):   eager=0x3f3a9fbd  triton=0x3f3a9fbe  (1 ULP)
  pow(0.9, 100.0): eager=0x37ded005  triton=0x37ded000  (5 ULP)

By setting TRITON_LIBDEVICE_PATH to the CUDA toolkit's libdevice, we eliminate
these precision differences without needing ATen fallbacks that would break
kernel fusion.

The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension.
A warning is emitted if the CUDA libdevice cannot be found.

Test: test_sets_cuda_libdevice_path in test_compile_worker.py
ghstack-source-id: b7b0de4
Pull-Request: #174933
[ghstack-poisoned]
Copy link
Copy Markdown
Contributor

@eellison eellison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but you lost your test checking correctness. Can you add it back ? and add one for subprocess/ one for main proces

@mlazos mlazos changed the title [inductor] Use CUDA toolkit libdevice for Triton pow precision [inductor] Use CUDA toolkit libdevice for Triton Feb 18, 2026
@mlazos
Copy link
Copy Markdown
Contributor Author

mlazos commented Feb 18, 2026

Looks good but you lost your test checking correctness. Can you add it back ? and add one for subprocess/ one for main proces

Yeah so it turns out the next PR is what was actually needed but I still thought this change is beneficial. The final explanation is that triton still ends up with FTZ instructions even when it's using the same libdevice because triton's compilation settings dictate that. (it only links with the libdevice, and then emits code based on its own compilation settings).

I'll add the subprocess test back. Claude put it in the next PR and then I had it delete it by mistake.

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Feb 27, 2026
Use float32 constant instead of int for reciprocal to ensure proper
floating-point division when emulating eager division rounding.

Test: test_div_precision_rounding in test_cuda_repro.py

Pull Request resolved: #174751
Approved by: https://github.com/v0i0
ghstack dependencies: #174749, #174933
@jeanschmidt
Copy link
Copy Markdown
Contributor

@pytorchbot revert -m "Need to revert in order to revert #175555 - see D94699526" -c ghfirst

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot added a commit that referenced this pull request Feb 27, 2026
#174751)"

This reverts commit 1b9046a.

Reverted #174751 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert #175555 - see D94699526 ([comment](#174933 (comment)))
pytorchmergebot added a commit that referenced this pull request Feb 27, 2026
This reverts commit ba59c42.

Reverted #174933 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert #175555 - see D94699526 ([comment](#174933 (comment)))
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

@mlazos your PR has been successfully reverted.

@pytorchmergebot pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Feb 27, 2026
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Starting merge as part of PR stack under #174751

@mlazos
Copy link
Copy Markdown
Contributor Author

mlazos commented Feb 28, 2026

@pytorchbot merge -i

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team Raised by workflow job

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 28, 2026

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

pytorchmergebot pushed a commit that referenced this pull request Feb 28, 2026
Use float32 constant instead of int for reciprocal to ensure proper
floating-point division when emulating eager division rounding.

Test: test_div_precision_rounding in test_cuda_repro.py

Pull Request resolved: #174751
Approved by: https://github.com/v0i0
ghstack dependencies: #174749, #174933
sandy-gags pushed a commit to sandy-gags/pytorch that referenced this pull request Mar 12, 2026
Configure Triton to use the CUDA toolkit's libdevice instead of its bundled
version. This ensures Triton's libdevice.pow matches CUDA's powf for bitwise
precision in compiled optimizers.

Triton bundles its own libdevice.10.bc which uses different polynomial
coefficients than CUDA's version, causing 1-5 ULP differences in pow results:
  pow(0.9, 3.0):   eager=0x3f3a9fbd  triton=0x3f3a9fbe  (1 ULP)
  pow(0.9, 100.0): eager=0x37ded005  triton=0x37ded000  (5 ULP)

By setting TRITON_LIBDEVICE_PATH to the CUDA toolkit's libdevice, we eliminate
these precision differences without needing ATen fallbacks that would break
kernel fusion.

The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension.
A warning is emitted if the CUDA libdevice cannot be found.

Test: test_sets_cuda_libdevice_path in test_compile_worker.py
ghstack-source-id: 9013dfc
Pull-Request: pytorch/pytorch#174933
sandy-gags pushed a commit to sandy-gags/pytorch that referenced this pull request Mar 12, 2026
Configure Triton to use the CUDA toolkit's libdevice instead of its bundled
version. This ensures Triton's libdevice.pow matches CUDA's powf for bitwise
precision in compiled optimizers.

Triton bundles its own libdevice.10.bc which uses different polynomial
coefficients than CUDA's version, causing 1-5 ULP differences in pow results:
  pow(0.9, 3.0):   eager=0x3f3a9fbd  triton=0x3f3a9fbe  (1 ULP)
  pow(0.9, 100.0): eager=0x37ded005  triton=0x37ded000  (5 ULP)

By setting TRITON_LIBDEVICE_PATH to the CUDA toolkit's libdevice, we eliminate
these precision differences without needing ATen fallbacks that would break
kernel fusion.

The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension.
A warning is emitted if the CUDA libdevice cannot be found.

Test: test_sets_cuda_libdevice_path in test_compile_worker.py
ghstack-source-id: 2d30b32
Pull-Request: pytorch/pytorch#174933
sandy-gags pushed a commit to sandy-gags/pytorch that referenced this pull request Mar 12, 2026
Configure Triton to use the CUDA toolkit's libdevice instead of its bundled
version. This ensures Triton's libdevice.pow matches CUDA's powf for bitwise
precision in compiled optimizers.

Triton bundles its own libdevice.10.bc which uses different polynomial
coefficients than CUDA's version, causing 1-5 ULP differences in pow results:
  pow(0.9, 3.0):   eager=0x3f3a9fbd  triton=0x3f3a9fbe  (1 ULP)
  pow(0.9, 100.0): eager=0x37ded005  triton=0x37ded000  (5 ULP)

By setting TRITON_LIBDEVICE_PATH to the CUDA toolkit's libdevice, we eliminate
these precision differences without needing ATen fallbacks that would break
kernel fusion.

The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension.
A warning is emitted if the CUDA libdevice cannot be found.

Test: test_sets_cuda_libdevice_path in test_compile_worker.py
ghstack-source-id: 2d30b32
Pull-Request: pytorch/pytorch#174933
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
Configure Triton to use the CUDA toolkit's libdevice instead of its bundled
version.

The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension.
A warning is emitted if the CUDA libdevice cannot be found.

Pull Request resolved: pytorch#174933
Approved by: https://github.com/v0i0, https://github.com/eellison
ghstack dependencies: pytorch#174749
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
…ch#174751)

Use float32 constant instead of int for reciprocal to ensure proper
floating-point division when emulating eager division rounding.

Test: test_div_precision_rounding in test_cuda_repro.py

Pull Request resolved: pytorch#174751
Approved by: https://github.com/v0i0
ghstack dependencies: pytorch#174749, pytorch#174933
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
pytorch#174751)"

This reverts commit 1b9046a.

Reverted pytorch#174751 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert pytorch#175555 - see D94699526 ([comment](pytorch#174933 (comment)))
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
)"

This reverts commit ba59c42.

Reverted pytorch#174933 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert pytorch#175555 - see D94699526 ([comment](pytorch#174933 (comment)))
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
Configure Triton to use the CUDA toolkit's libdevice instead of its bundled
version.

The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension.
A warning is emitted if the CUDA libdevice cannot be found.

Pull Request resolved: pytorch#174933
Approved by: https://github.com/v0i0, https://github.com/eellison
ghstack dependencies: pytorch#174749
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
…ch#174751)

Use float32 constant instead of int for reciprocal to ensure proper
floating-point division when emulating eager division rounding.

Test: test_div_precision_rounding in test_cuda_repro.py

Pull Request resolved: pytorch#174751
Approved by: https://github.com/v0i0
ghstack dependencies: pytorch#174749, pytorch#174933
@github-actions github-actions Bot deleted the gh/mlazos/103/head branch March 31, 2026 02:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants