Fix AOTI incorrect loads from bool tensor pointers in user-defined Triton kernels by mergennachin · Pull Request #176353 · pytorch/pytorch

mergennachin · 2026-03-04T00:11:55Z

User-defined Triton kernels (via @triton.jit or @triton_op) that take
bool tensor arguments produce incorrect results when compiled through
AOTI. The root cause is that Triton's mangle_type maps torch.bool
tensors to *i1/*u1 (1-bit pointer), but PyTorch stores bool tensors as
uint8 (1 byte per element). The compiled cubin kernel generates
bit-packed loads for *i1/*u1 pointers, reading garbled data from the
byte-addressed memory.

Inductor-generated kernels already work around this (Triton issue triton-lang/triton#2151 and corresponding workaround in pytorch

pytorch/torch/_inductor/codegen/triton.py

Lines 3657 to 3661 in da0eb66

    
           # Workaround for https://github.com/triton-lang/triton/issues/2151 
        
           # tl.load returns int8 when loading from pointer to int1 
        
           # NOTE: Currently causes hangs on bool UTs for ROCm 
        
           line += ".to(tl.int1)" 
        
           dtype = torch.bool

)
by adding .to(tl.int1) after loads and converting to int8 for stores.
But user-defined kernels don't get these workarounds since their code is
user-written.

Fix: override *i1/*u1 -> *u8 in the mangle_type signature for
user-defined kernels. This makes the compiled kernel use byte-addressed
loads matching PyTorch's bool memory layout.

Test Plan:

  # Existing bool param test (should still pass)
  python -m pytest test/inductor/test_aot_inductor.py -k "test_triton_kernel_bool_param" -x -v

  # New bool tensor test
  python -m pytest test/inductor/test_aot_inductor.py -k "test_triton_kernel_bool_tensor_arg" -x -v

  # Inductor torch.compile path
  python -m pytest test/inductor/test_torchinductor.py -k "test_triton_kernel_bool_tensor_arg" -x -v

  # Broader regression check — all user-defined triton kernel tests
  python -m pytest test/inductor/test_aot_inductor.py -k "triton_kernel" -x -v

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

pytorch-bot · 2026-03-04T00:11:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/176353

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Unrelated Failure

As of commit b0a89d6 with merge base da0eb66 ():

NEW FAILURE - The following job has failed:

linux-aarch64 / linux-jammy-aarch64-py3.10 / test (openreg, 1, 1, lf.linux.arm64.m7g.4xlarge) (gh)
'Test'

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 1, 2, linux.2xlarge.amx, unstable) (gh) (#174929)
detectron2_maskrcnn_r_50_fpn

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2026-03-04T00:12:01Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…iton kernels User-defined Triton kernels (via @triton.jit or @triton_op) that take bool tensor arguments produce incorrect results when compiled through AOTI. The root cause is that Triton's mangle_type maps torch.bool tensors to *i1/*u1 (1-bit pointer), but PyTorch stores bool tensors as uint8 (1 byte per element). The compiled cubin kernel generates bit-packed loads for *i1/*u1 pointers, reading garbled data from the byte-addressed memory. Inductor-generated kernels already work around this (Triton issue #2151) by adding .to(tl.int1) after loads and converting to int8 for stores. But user-defined kernels don't get these workarounds since their code is user-written. Fix: override *i1/*u1 -> *u8 in the mangle_type signature for user-defined kernels. This makes the compiled kernel use byte-addressed loads matching PyTorch's bool memory layout.

desertfire · 2026-03-04T17:59:17Z

+                result = "*u8"
+            return result

    else:


I guess the else branch is for older versions of Triton. Probably no need to worry about it.

mergennachin · 2026-03-04T18:02:57Z

@pytorchbot merge

pytorchmergebot · 2026-03-04T18:05:38Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

pytorch-bot · 2026-03-04T18:05:44Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

mergennachin · 2026-03-04T18:55:42Z

@pytorchbot merge

pytorchmergebot · 2026-03-04T18:58:03Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2026-03-04T21:46:37Z

Merge failed

Reason: 1 jobs have failed, first few of them are: linux-aarch64 / linux-jammy-aarch64-py3.10 / test (openreg, 1, 1, lf.linux.arm64.m7g.4xlarge)

Details for Dev Infra team

Raised by workflow job

mergennachin · 2026-03-04T21:57:58Z

@pytorchbot merge -f "Unrelated CI failures"

pytorchmergebot · 2026-03-04T21:59:52Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2026-03-04T22:00:03Z

Merge failed

Reason: <urlopen error [Errno 111] Connection refused>

Details for Dev Infra team

Raised by workflow job

mergennachin · 2026-03-04T22:17:14Z

@pytorchbot merge -f "Unrelated CI failures"

pytorchmergebot · 2026-03-04T22:19:21Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: 1. #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time. Forward fix by updating the code checking target for cpp-wrapper. 2. #176353 also had land race. Skip now and the fix is coming later. [ghstack-poisoned]

Summary: 1. #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time. Forward fix by updating the code checking target for cpp-wrapper. 2. #176353 also had land race. Skip now and the fix is coming later. ghstack-source-id: c856a94 Pull Request resolved: #176745

Summary: 1. #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time. Forward fix by updating the code checking target for cpp-wrapper. 2. #176353 also had land race. Skip now and the fix is coming later. Pull Request resolved: #176745 Approved by: https://github.com/AmesingFlank, https://github.com/zou3519

…iton kernels (pytorch#176353) User-defined Triton kernels (via @triton.jit or @triton_op) that take bool tensor arguments produce incorrect results when compiled through AOTI. The root cause is that Triton's mangle_type maps torch.bool tensors to *i1/*u1 (1-bit pointer), but PyTorch stores bool tensors as uint8 (1 byte per element). The compiled cubin kernel generates bit-packed loads for *i1/*u1 pointers, reading garbled data from the byte-addressed memory. Inductor-generated kernels already work around this (Triton issue triton-lang/triton#2151 and corresponding workaround in pytorch https://github.com/pytorch/pytorch/blob/da0eb6647126f1b0e57112a79a83f55393de635f/torch/_inductor/codegen/triton.py#L3657-L3661) by adding .to(tl.int1) after loads and converting to int8 for stores. But user-defined kernels don't get these workarounds since their code is user-written. Fix: override *i1/*u1 -> *u8 in the mangle_type signature for user-defined kernels. This makes the compiled kernel use byte-addressed loads matching PyTorch's bool memory layout. Test Plan: ``` # Existing bool param test (should still pass) python -m pytest test/inductor/test_aot_inductor.py -k "test_triton_kernel_bool_param" -x -v # New bool tensor test python -m pytest test/inductor/test_aot_inductor.py -k "test_triton_kernel_bool_tensor_arg" -x -v # Inductor torch.compile path python -m pytest test/inductor/test_torchinductor.py -k "test_triton_kernel_bool_tensor_arg" -x -v # Broader regression check — all user-defined triton kernel tests python -m pytest test/inductor/test_aot_inductor.py -k "triton_kernel" -x -v ``` Pull Request resolved: pytorch#176353 Approved by: https://github.com/desertfire

Summary: 1. pytorch#173662 added more tests to test/inductor/test_triton_kernels.py, and pytorch#175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and pytorch#173662 didn't have the failing CI signal at the landing time. Forward fix by updating the code checking target for cpp-wrapper. 2. pytorch#176353 also had land race. Skip now and the fix is coming later. Pull Request resolved: pytorch#176745 Approved by: https://github.com/AmesingFlank, https://github.com/zou3519

mergennachin requested review from aorenste and zou3519 as code owners March 4, 2026 00:11

pytorch-bot Bot added ciflow/inductor module: inductor labels Mar 4, 2026

mergennachin requested review from dolpm, jansel and oulgen March 4, 2026 00:12

mergennachin marked this pull request as draft March 4, 2026 00:16

mergennachin requested a review from desertfire March 4, 2026 00:18

mergennachin force-pushed the fix-triton-bool-tensor-aoti branch from 2e75703 to 6212b3f Compare March 4, 2026 04:55

mergennachin force-pushed the fix-triton-bool-tensor-aoti branch from 6212b3f to b0a89d6 Compare March 4, 2026 04:56

mergennachin marked this pull request as ready for review March 4, 2026 13:40

mergennachin added the topic: bug fixes topic category label Mar 4, 2026

desertfire approved these changes Mar 4, 2026

View reviewed changes

pytorch-bot Bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 4, 2026

pytorchmergebot added the merging label Mar 4, 2026

pytorchmergebot removed the merging label Mar 4, 2026

mergennachin added the release notes: inductor (aoti) label Mar 4, 2026

pytorchmergebot added the merging label Mar 4, 2026

pytorchmergebot removed the merging label Mar 4, 2026

pytorchmergebot added the merging label Mar 4, 2026

pytorchmergebot removed the merging label Mar 4, 2026

pytorchmergebot added the merging label Mar 4, 2026

pytorchmergebot added the Merged label Mar 4, 2026

pytorchmergebot closed this in 01a11e6 Mar 4, 2026

pytorchmergebot removed the merging label Mar 4, 2026

desertfire mentioned this pull request Mar 6, 2026

[inductor][CI] Fix cpp-wrapper CI failures #176745

Closed

github-actions Bot deleted the fix-triton-bool-tensor-aoti branch April 4, 2026 02:23

	# Workaround for https://github.com/triton-lang/triton/issues/2151
	# tl.load returns int8 when loading from pointer to int1
	# NOTE: Currently causes hangs on bool UTs for ROCm
	line += ".to(tl.int1)"
	dtype = torch.bool

Conversation

mergennachin commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/176353

❌ 1 New Failure, 1 Unrelated Failure

Uh oh!

pytorch-bot Bot commented Mar 4, 2026

This PR needs a release notes: label

Uh oh!

desertfire Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

mergennachin commented Mar 4, 2026

Uh oh!

pytorchmergebot commented Mar 4, 2026

Merge failed

Uh oh!

pytorch-bot Bot commented Mar 4, 2026

This PR needs a release notes: label

Uh oh!

mergennachin commented Mar 4, 2026

Uh oh!

pytorchmergebot commented Mar 4, 2026

Merge started

Uh oh!

pytorchmergebot commented Mar 4, 2026

Merge failed

Uh oh!

mergennachin commented Mar 4, 2026

Uh oh!

pytorchmergebot commented Mar 4, 2026

Merge started

Uh oh!

pytorchmergebot commented Mar 4, 2026

Merge failed

Uh oh!

mergennachin commented Mar 4, 2026

Uh oh!

pytorchmergebot commented Mar 4, 2026

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mergennachin commented Mar 4, 2026 •

edited

Loading

pytorch-bot Bot commented Mar 4, 2026 •

edited

Loading

This PR needs a `release notes:` label

This PR needs a `release notes:` label