[inductor] Use CUDA toolkit libdevice for Triton#174933
[inductor] Use CUDA toolkit libdevice for Triton#174933mlazos wants to merge 21 commits intogh/mlazos/103/basefrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/174933
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 4 Unrelated FailuresAs of commit 681a870 with merge base 197c376 ( NEW FAILURES - The following jobs have failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
Configure Triton to use the CUDA toolkit's libdevice instead of its bundled version. This ensures Triton's libdevice.pow matches CUDA's powf for bitwise precision in compiled optimizers. Triton bundles its own libdevice.10.bc which uses different polynomial coefficients than CUDA's version, causing 1-5 ULP differences in pow results: pow(0.9, 3.0): eager=0x3f3a9fbd triton=0x3f3a9fbe (1 ULP) pow(0.9, 100.0): eager=0x37ded005 triton=0x37ded000 (5 ULP) By setting TRITON_LIBDEVICE_PATH to the CUDA toolkit's libdevice, we eliminate these precision differences without needing ATen fallbacks that would break kernel fusion. The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension. A warning is emitted if the CUDA libdevice cannot be found. Test: test_pow_scalar_tensor_precision in test_cuda_repro.py ghstack-source-id: 9dfbf52 Pull-Request: #174933
Configure Triton to use the CUDA toolkit's libdevice instead of its bundled version. This ensures Triton's libdevice.pow matches CUDA's powf for bitwise precision in compiled optimizers. Triton bundles its own libdevice.10.bc which uses different polynomial coefficients than CUDA's version, causing 1-5 ULP differences in pow results: pow(0.9, 3.0): eager=0x3f3a9fbd triton=0x3f3a9fbe (1 ULP) pow(0.9, 100.0): eager=0x37ded005 triton=0x37ded000 (5 ULP) By setting TRITON_LIBDEVICE_PATH to the CUDA toolkit's libdevice, we eliminate these precision differences without needing ATen fallbacks that would break kernel fusion. The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension. A warning is emitted if the CUDA libdevice cannot be found. Test: test_sets_cuda_libdevice_path in test_compile_worker.py ghstack-source-id: 9dfbf52 Pull-Request: #174933
Configure Triton to use the CUDA toolkit's libdevice instead of its bundled version. This ensures Triton's libdevice.pow matches CUDA's powf for bitwise precision in compiled optimizers. Triton bundles its own libdevice.10.bc which uses different polynomial coefficients than CUDA's version, causing 1-5 ULP differences in pow results: pow(0.9, 3.0): eager=0x3f3a9fbd triton=0x3f3a9fbe (1 ULP) pow(0.9, 100.0): eager=0x37ded005 triton=0x37ded000 (5 ULP) By setting TRITON_LIBDEVICE_PATH to the CUDA toolkit's libdevice, we eliminate these precision differences without needing ATen fallbacks that would break kernel fusion. The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension. A warning is emitted if the CUDA libdevice cannot be found. Test: test_sets_cuda_libdevice_path in test_compile_worker.py ghstack-source-id: 9dfbf52 Pull-Request: #174933
Configure Triton to use the CUDA toolkit's libdevice instead of its bundled version. This ensures Triton's libdevice.pow matches CUDA's powf for bitwise precision in compiled optimizers. Triton bundles its own libdevice.10.bc which uses different polynomial coefficients than CUDA's version, causing 1-5 ULP differences in pow results: pow(0.9, 3.0): eager=0x3f3a9fbd triton=0x3f3a9fbe (1 ULP) pow(0.9, 100.0): eager=0x37ded005 triton=0x37ded000 (5 ULP) By setting TRITON_LIBDEVICE_PATH to the CUDA toolkit's libdevice, we eliminate these precision differences without needing ATen fallbacks that would break kernel fusion. The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension. A warning is emitted if the CUDA libdevice cannot be found. Test: test_sets_cuda_libdevice_path in test_compile_worker.py ghstack-source-id: b7b0de4 Pull-Request: #174933
eellison
left a comment
There was a problem hiding this comment.
Looks good but you lost your test checking correctness. Can you add it back ? and add one for subprocess/ one for main proces
Yeah so it turns out the next PR is what was actually needed but I still thought this change is beneficial. The final explanation is that triton still ends up with FTZ instructions even when it's using the same libdevice because triton's compilation settings dictate that. (it only links with the libdevice, and then emits code based on its own compilation settings). I'll add the subprocess test back. Claude put it in the next PR and then I had it delete it by mistake. |
Use float32 constant instead of int for reciprocal to ensure proper floating-point division when emulating eager division rounding. Test: test_div_precision_rounding in test_cuda_repro.py Pull Request resolved: #174751 Approved by: https://github.com/v0i0 ghstack dependencies: #174749, #174933
|
@pytorchbot revert -m "Need to revert in order to revert #175555 - see D94699526" -c ghfirst |
|
@pytorchbot successfully started a revert job. Check the current status here. |
#174751)" This reverts commit 1b9046a. Reverted #174751 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert #175555 - see D94699526 ([comment](#174933 (comment)))
This reverts commit ba59c42. Reverted #174933 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert #175555 - see D94699526 ([comment](#174933 (comment)))
|
@mlazos your PR has been successfully reverted. |
|
Starting merge as part of PR stack under #174751 |
|
@pytorchbot merge -i |
Merge failedReason: This PR needs a If not, please add the To add a label, you can comment to pytorchbot, for example For more information, see Details for Dev Infra teamRaised by workflow job |
This PR needs a
|
Use float32 constant instead of int for reciprocal to ensure proper floating-point division when emulating eager division rounding. Test: test_div_precision_rounding in test_cuda_repro.py Pull Request resolved: #174751 Approved by: https://github.com/v0i0 ghstack dependencies: #174749, #174933
Configure Triton to use the CUDA toolkit's libdevice instead of its bundled version. This ensures Triton's libdevice.pow matches CUDA's powf for bitwise precision in compiled optimizers. Triton bundles its own libdevice.10.bc which uses different polynomial coefficients than CUDA's version, causing 1-5 ULP differences in pow results: pow(0.9, 3.0): eager=0x3f3a9fbd triton=0x3f3a9fbe (1 ULP) pow(0.9, 100.0): eager=0x37ded005 triton=0x37ded000 (5 ULP) By setting TRITON_LIBDEVICE_PATH to the CUDA toolkit's libdevice, we eliminate these precision differences without needing ATen fallbacks that would break kernel fusion. The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension. A warning is emitted if the CUDA libdevice cannot be found. Test: test_sets_cuda_libdevice_path in test_compile_worker.py ghstack-source-id: 9013dfc Pull-Request: pytorch/pytorch#174933
Configure Triton to use the CUDA toolkit's libdevice instead of its bundled version. This ensures Triton's libdevice.pow matches CUDA's powf for bitwise precision in compiled optimizers. Triton bundles its own libdevice.10.bc which uses different polynomial coefficients than CUDA's version, causing 1-5 ULP differences in pow results: pow(0.9, 3.0): eager=0x3f3a9fbd triton=0x3f3a9fbe (1 ULP) pow(0.9, 100.0): eager=0x37ded005 triton=0x37ded000 (5 ULP) By setting TRITON_LIBDEVICE_PATH to the CUDA toolkit's libdevice, we eliminate these precision differences without needing ATen fallbacks that would break kernel fusion. The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension. A warning is emitted if the CUDA libdevice cannot be found. Test: test_sets_cuda_libdevice_path in test_compile_worker.py ghstack-source-id: 2d30b32 Pull-Request: pytorch/pytorch#174933
Configure Triton to use the CUDA toolkit's libdevice instead of its bundled version. This ensures Triton's libdevice.pow matches CUDA's powf for bitwise precision in compiled optimizers. Triton bundles its own libdevice.10.bc which uses different polynomial coefficients than CUDA's version, causing 1-5 ULP differences in pow results: pow(0.9, 3.0): eager=0x3f3a9fbd triton=0x3f3a9fbe (1 ULP) pow(0.9, 100.0): eager=0x37ded005 triton=0x37ded000 (5 ULP) By setting TRITON_LIBDEVICE_PATH to the CUDA toolkit's libdevice, we eliminate these precision differences without needing ATen fallbacks that would break kernel fusion. The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension. A warning is emitted if the CUDA libdevice cannot be found. Test: test_sets_cuda_libdevice_path in test_compile_worker.py ghstack-source-id: 2d30b32 Pull-Request: pytorch/pytorch#174933
Configure Triton to use the CUDA toolkit's libdevice instead of its bundled version. The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension. A warning is emitted if the CUDA libdevice cannot be found. Pull Request resolved: pytorch#174933 Approved by: https://github.com/v0i0, https://github.com/eellison ghstack dependencies: pytorch#174749
…ch#174751) Use float32 constant instead of int for reciprocal to ensure proper floating-point division when emulating eager division rounding. Test: test_div_precision_rounding in test_cuda_repro.py Pull Request resolved: pytorch#174751 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174749, pytorch#174933
pytorch#174751)" This reverts commit 1b9046a. Reverted pytorch#174751 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert pytorch#175555 - see D94699526 ([comment](pytorch#174933 (comment)))
)" This reverts commit ba59c42. Reverted pytorch#174933 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert pytorch#175555 - see D94699526 ([comment](pytorch#174933 (comment)))
Configure Triton to use the CUDA toolkit's libdevice instead of its bundled version. The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension. A warning is emitted if the CUDA libdevice cannot be found. Pull Request resolved: pytorch#174933 Approved by: https://github.com/v0i0, https://github.com/eellison ghstack dependencies: pytorch#174749
…ch#174751) Use float32 constant instead of int for reciprocal to ensure proper floating-point division when emulating eager division rounding. Test: test_div_precision_rounding in test_cuda_repro.py Pull Request resolved: pytorch#174751 Approved by: https://github.com/v0i0 ghstack dependencies: pytorch#174749, pytorch#174933
Stack from ghstack (oldest at bottom):
Configure Triton to use the CUDA toolkit's libdevice instead of its bundled
version.
The path is auto-detected via CUDA_HOME from torch.utils.cpp_extension.
A warning is emitted if the CUDA libdevice cannot be found.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo