Fix AOTI incorrect loads from bool tensor pointers in user-defined Triton kernels#176353
Fix AOTI incorrect loads from bool tensor pointers in user-defined Triton kernels#176353mergennachin wants to merge 1 commit intomainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/176353
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 1 Unrelated FailureAs of commit b0a89d6 with merge base da0eb66 ( NEW FAILURE - The following job has failed:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
2e75703 to
6212b3f
Compare
…iton kernels User-defined Triton kernels (via @triton.jit or @triton_op) that take bool tensor arguments produce incorrect results when compiled through AOTI. The root cause is that Triton's mangle_type maps torch.bool tensors to *i1/*u1 (1-bit pointer), but PyTorch stores bool tensors as uint8 (1 byte per element). The compiled cubin kernel generates bit-packed loads for *i1/*u1 pointers, reading garbled data from the byte-addressed memory. Inductor-generated kernels already work around this (Triton issue #2151) by adding .to(tl.int1) after loads and converting to int8 for stores. But user-defined kernels don't get these workarounds since their code is user-written. Fix: override *i1/*u1 -> *u8 in the mangle_type signature for user-defined kernels. This makes the compiled kernel use byte-addressed loads matching PyTorch's bool memory layout.
6212b3f to
b0a89d6
Compare
| result = "*u8" | ||
| return result | ||
|
|
||
| else: |
There was a problem hiding this comment.
I guess the else branch is for older versions of Triton. Probably no need to worry about it.
|
@pytorchbot merge |
Merge failedReason: This PR needs a If not, please add the To add a label, you can comment to pytorchbot, for example For more information, see Details for Dev Infra teamRaised by workflow job |
This PR needs a
|
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: linux-aarch64 / linux-jammy-aarch64-py3.10 / test (openreg, 1, 1, lf.linux.arm64.m7g.4xlarge) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -f "Unrelated CI failures" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: <urlopen error [Errno 111] Connection refused> Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -f "Unrelated CI failures" |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Summary: 1. #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time. Forward fix by updating the code checking target for cpp-wrapper. 2. #176353 also had land race. Skip now and the fix is coming later. [ghstack-poisoned]
Summary: 1. #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time. Forward fix by updating the code checking target for cpp-wrapper. 2. #176353 also had land race. Skip now and the fix is coming later. ghstack-source-id: c856a94 Pull Request resolved: #176745
Summary: 1. #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time. Forward fix by updating the code checking target for cpp-wrapper. 2. #176353 also had land race. Skip now and the fix is coming later. Pull Request resolved: #176745 Approved by: https://github.com/AmesingFlank, https://github.com/zou3519
…iton kernels (pytorch#176353) User-defined Triton kernels (via @triton.jit or @triton_op) that take bool tensor arguments produce incorrect results when compiled through AOTI. The root cause is that Triton's mangle_type maps torch.bool tensors to *i1/*u1 (1-bit pointer), but PyTorch stores bool tensors as uint8 (1 byte per element). The compiled cubin kernel generates bit-packed loads for *i1/*u1 pointers, reading garbled data from the byte-addressed memory. Inductor-generated kernels already work around this (Triton issue triton-lang/triton#2151 and corresponding workaround in pytorch https://github.com/pytorch/pytorch/blob/da0eb6647126f1b0e57112a79a83f55393de635f/torch/_inductor/codegen/triton.py#L3657-L3661) by adding .to(tl.int1) after loads and converting to int8 for stores. But user-defined kernels don't get these workarounds since their code is user-written. Fix: override *i1/*u1 -> *u8 in the mangle_type signature for user-defined kernels. This makes the compiled kernel use byte-addressed loads matching PyTorch's bool memory layout. Test Plan: ``` # Existing bool param test (should still pass) python -m pytest test/inductor/test_aot_inductor.py -k "test_triton_kernel_bool_param" -x -v # New bool tensor test python -m pytest test/inductor/test_aot_inductor.py -k "test_triton_kernel_bool_tensor_arg" -x -v # Inductor torch.compile path python -m pytest test/inductor/test_torchinductor.py -k "test_triton_kernel_bool_tensor_arg" -x -v # Broader regression check — all user-defined triton kernel tests python -m pytest test/inductor/test_aot_inductor.py -k "triton_kernel" -x -v ``` Pull Request resolved: pytorch#176353 Approved by: https://github.com/desertfire
Summary: 1. pytorch#173662 added more tests to test/inductor/test_triton_kernels.py, and pytorch#175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and pytorch#173662 didn't have the failing CI signal at the landing time. Forward fix by updating the code checking target for cpp-wrapper. 2. pytorch#176353 also had land race. Skip now and the fix is coming later. Pull Request resolved: pytorch#176745 Approved by: https://github.com/AmesingFlank, https://github.com/zou3519
User-defined Triton kernels (via @triton.jit or @triton_op) that take
bool tensor arguments produce incorrect results when compiled through
AOTI. The root cause is that Triton's mangle_type maps torch.bool
tensors to *i1/*u1 (1-bit pointer), but PyTorch stores bool tensors as
uint8 (1 byte per element). The compiled cubin kernel generates
bit-packed loads for *i1/*u1 pointers, reading garbled data from the
byte-addressed memory.
Inductor-generated kernels already work around this (Triton issue triton-lang/triton#2151 and corresponding workaround in pytorch
pytorch/torch/_inductor/codegen/triton.py
Lines 3657 to 3661 in da0eb66
by adding .to(tl.int1) after loads and converting to int8 for stores.
But user-defined kernels don't get these workarounds since their code is
user-written.
Fix: override *i1/*u1 -> *u8 in the mangle_type signature for
user-defined kernels. This makes the compiled kernel use byte-addressed
loads matching PyTorch's bool memory layout.
Test Plan:
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo