Skip to content

Fix AOTI incorrect loads from bool tensor pointers in user-defined Triton kernels#176353

Closed
mergennachin wants to merge 1 commit intomainfrom
fix-triton-bool-tensor-aoti
Closed

Fix AOTI incorrect loads from bool tensor pointers in user-defined Triton kernels#176353
mergennachin wants to merge 1 commit intomainfrom
fix-triton-bool-tensor-aoti

Conversation

@mergennachin
Copy link
Copy Markdown
Contributor

@mergennachin mergennachin commented Mar 4, 2026

User-defined Triton kernels (via @triton.jit or @triton_op) that take
bool tensor arguments produce incorrect results when compiled through
AOTI. The root cause is that Triton's mangle_type maps torch.bool
tensors to *i1/*u1 (1-bit pointer), but PyTorch stores bool tensors as
uint8 (1 byte per element). The compiled cubin kernel generates
bit-packed loads for *i1/*u1 pointers, reading garbled data from the
byte-addressed memory.

Inductor-generated kernels already work around this (Triton issue triton-lang/triton#2151 and corresponding workaround in pytorch

# Workaround for https://github.com/triton-lang/triton/issues/2151
# tl.load returns int8 when loading from pointer to int1
# NOTE: Currently causes hangs on bool UTs for ROCm
line += ".to(tl.int1)"
dtype = torch.bool
)
by adding .to(tl.int1) after loads and converting to int8 for stores.
But user-defined kernels don't get these workarounds since their code is
user-written.

Fix: override *i1/*u1 -> *u8 in the mangle_type signature for
user-defined kernels. This makes the compiled kernel use byte-addressed
loads matching PyTorch's bool memory layout.

Test Plan:

  # Existing bool param test (should still pass)
  python -m pytest test/inductor/test_aot_inductor.py -k "test_triton_kernel_bool_param" -x -v

  # New bool tensor test
  python -m pytest test/inductor/test_aot_inductor.py -k "test_triton_kernel_bool_tensor_arg" -x -v

  # Inductor torch.compile path
  python -m pytest test/inductor/test_torchinductor.py -k "test_triton_kernel_bool_tensor_arg" -x -v

  # Broader regression check — all user-defined triton kernel tests
  python -m pytest test/inductor/test_aot_inductor.py -k "triton_kernel" -x -v

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Mar 4, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/176353

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit b0a89d6 with merge base da0eb66 (image):

NEW FAILURE - The following job has failed:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Mar 4, 2026

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@mergennachin mergennachin requested review from dolpm, jansel and oulgen March 4, 2026 00:12
@mergennachin mergennachin marked this pull request as draft March 4, 2026 00:16
@mergennachin mergennachin requested a review from desertfire March 4, 2026 00:18
@mergennachin mergennachin force-pushed the fix-triton-bool-tensor-aoti branch from 2e75703 to 6212b3f Compare March 4, 2026 04:55
…iton kernels

User-defined Triton kernels (via @triton.jit or @triton_op) that take
bool tensor arguments produce incorrect results when compiled through
AOTI. The root cause is that Triton's mangle_type maps torch.bool
tensors to *i1/*u1 (1-bit pointer), but PyTorch stores bool tensors as
uint8 (1 byte per element). The compiled cubin kernel generates
bit-packed loads for *i1/*u1 pointers, reading garbled data from the
byte-addressed memory.

Inductor-generated kernels already work around this (Triton issue #2151)
by adding .to(tl.int1) after loads and converting to int8 for stores.
But user-defined kernels don't get these workarounds since their code is
user-written.

Fix: override *i1/*u1 -> *u8 in the mangle_type signature for
user-defined kernels. This makes the compiled kernel use byte-addressed
loads matching PyTorch's bool memory layout.
@mergennachin mergennachin force-pushed the fix-triton-bool-tensor-aoti branch from 6212b3f to b0a89d6 Compare March 4, 2026 04:56
@mergennachin mergennachin marked this pull request as ready for review March 4, 2026 13:40
@mergennachin mergennachin added the topic: bug fixes topic category label Mar 4, 2026
result = "*u8"
return result

else:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the else branch is for older versions of Triton. Probably no need to worry about it.

@mergennachin
Copy link
Copy Markdown
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 4, 2026
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team Raised by workflow job

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Mar 4, 2026

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@mergennachin
Copy link
Copy Markdown
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: linux-aarch64 / linux-jammy-aarch64-py3.10 / test (openreg, 1, 1, lf.linux.arm64.m7g.4xlarge)

Details for Dev Infra team Raised by workflow job

@mergennachin
Copy link
Copy Markdown
Contributor Author

@pytorchbot merge -f "Unrelated CI failures"

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: <urlopen error [Errno 111] Connection refused>

Details for Dev Infra team Raised by workflow job

@mergennachin
Copy link
Copy Markdown
Contributor Author

@pytorchbot merge -f "Unrelated CI failures"

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

desertfire added a commit that referenced this pull request Mar 6, 2026
Summary:
1. #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time.

Forward fix by updating the code checking target for cpp-wrapper.

2. #176353 also had land race. Skip now and the fix is coming later.

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Mar 6, 2026
Summary:
1. #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time.

Forward fix by updating the code checking target for cpp-wrapper.

2. #176353 also had land race. Skip now and the fix is coming later.

ghstack-source-id: c856a94
Pull Request resolved: #176745
pytorchmergebot pushed a commit that referenced this pull request Mar 7, 2026
Summary:
1. #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time.

Forward fix by updating the code checking target for cpp-wrapper.

2. #176353 also had land race. Skip now and the fix is coming later.
Pull Request resolved: #176745
Approved by: https://github.com/AmesingFlank, https://github.com/zou3519
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
…iton kernels (pytorch#176353)

User-defined Triton kernels (via @triton.jit or @triton_op) that take
bool tensor arguments produce incorrect results when compiled through
AOTI. The root cause is that Triton's mangle_type maps torch.bool
tensors to *i1/*u1 (1-bit pointer), but PyTorch stores bool tensors as
uint8 (1 byte per element). The compiled cubin kernel generates
bit-packed loads for *i1/*u1 pointers, reading garbled data from the
byte-addressed memory.

Inductor-generated kernels already work around this (Triton issue triton-lang/triton#2151 and corresponding workaround in pytorch https://github.com/pytorch/pytorch/blob/da0eb6647126f1b0e57112a79a83f55393de635f/torch/_inductor/codegen/triton.py#L3657-L3661)
by adding .to(tl.int1) after loads and converting to int8 for stores.
But user-defined kernels don't get these workarounds since their code is
user-written.

Fix: override *i1/*u1 -> *u8 in the mangle_type signature for
user-defined kernels. This makes the compiled kernel use byte-addressed
loads matching PyTorch's bool memory layout.

Test Plan:

```
  # Existing bool param test (should still pass)
  python -m pytest test/inductor/test_aot_inductor.py -k "test_triton_kernel_bool_param" -x -v

  # New bool tensor test
  python -m pytest test/inductor/test_aot_inductor.py -k "test_triton_kernel_bool_tensor_arg" -x -v

  # Inductor torch.compile path
  python -m pytest test/inductor/test_torchinductor.py -k "test_triton_kernel_bool_tensor_arg" -x -v

  # Broader regression check — all user-defined triton kernel tests
  python -m pytest test/inductor/test_aot_inductor.py -k "triton_kernel" -x -v
  ```

Pull Request resolved: pytorch#176353
Approved by: https://github.com/desertfire
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
Summary:
1. pytorch#173662 added more tests to test/inductor/test_triton_kernels.py, and pytorch#175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and pytorch#173662 didn't have the failing CI signal at the landing time.

Forward fix by updating the code checking target for cpp-wrapper.

2. pytorch#176353 also had land race. Skip now and the fix is coming later.
Pull Request resolved: pytorch#176745
Approved by: https://github.com/AmesingFlank, https://github.com/zou3519
@github-actions github-actions Bot deleted the fix-triton-bool-tensor-aoti branch April 4, 2026 02:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants