Skip to content

[inductor] Add TMA support for lazy Triton kernel compilation#175548

Closed
desertfire wants to merge 15 commits intogh/desertfire/654/basefrom
gh/desertfire/654/head
Closed

[inductor] Add TMA support for lazy Triton kernel compilation#175548
desertfire wants to merge 15 commits intogh/desertfire/654/basefrom
gh/desertfire/654/head

Conversation

@desertfire
Copy link
Copy Markdown
Contributor

@desertfire desertfire commented Feb 23, 2026

Stack from ghstack (oldest at bottom):

Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.

Authored with Claude.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.

Authored with Claude.

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 23, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175548

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (2 Unrelated Failures)

As of commit d306ef2 with merge base b180c2f (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

desertfire added a commit that referenced this pull request Feb 23, 2026
Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.

Authored with Claude.

ghstack-source-id: 8b1e0a7
Pull Request resolved: #175548
…ion"

Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.

Authored with Claude.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Feb 24, 2026
Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.

Authored with Claude.

ghstack-source-id: eb027b8
Pull Request resolved: #175548
…ion"

Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.

Authored with Claude.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Feb 25, 2026
Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.

Authored with Claude.

ghstack-source-id: bef5145
Pull Request resolved: #175548
…ion"

Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.

Authored with Claude.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Feb 26, 2026
Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.

Authored with Claude.

ghstack-source-id: 32b8558
Pull Request resolved: #175548
…ion"

Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.

Authored with Claude.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Feb 26, 2026
Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.

Authored with Claude.

ghstack-source-id: ab2a6eb
Pull Request resolved: #175548
…ion"

Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.

Authored with Claude.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Feb 27, 2026
Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.

Authored with Claude.

ghstack-source-id: be73cda
Pull Request resolved: #175548
…ion"

Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.

Authored with Claude.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Feb 27, 2026
Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.

Authored with Claude.

ghstack-source-id: 006c684
Pull Request resolved: #175548
…ion"

Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.

Authored with Claude.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Feb 27, 2026
Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.

Authored with Claude.

ghstack-source-id: ecafe2f
Pull Request resolved: #175548
…ion"

Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.

Authored with Claude.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Mar 2, 2026
Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.

Authored with Claude.

ghstack-source-id: 036c080
Pull Request resolved: #175548
…ion"

Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.

Authored with Claude.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Mar 2, 2026
Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.

Authored with Claude.

ghstack-source-id: fcd9ed8
Pull Request resolved: #175548
…ion"

Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.

Authored with Claude.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Mar 4, 2026
Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.

Authored with Claude.

ghstack-source-id: 3491f11
Pull Request resolved: #175548
…ion"

Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.

Authored with Claude.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
sig_type = signature.get(key, "")
if isinstance(sig_type, str) and signature_is_tma_desc(sig_type):
if isinstance(
raw_arg, (TMADescriptorExperimental, TMADescriptorStable)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially brittle?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will raise an AssertionError in the else branch.

…ion"

Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.

Authored with Claude.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

Differential Revision: [D96125146](https://our.internmc.facebook.com/intern/diff/D96125146)

[ghstack-poisoned]
@desertfire
Copy link
Copy Markdown
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR!

Details for Dev Infra team Raised by workflow job

@desertfire
Copy link
Copy Markdown
Contributor Author

@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@desertfire
Copy link
Copy Markdown
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR!

Details for Dev Infra team Raised by workflow job

@desertfire
Copy link
Copy Markdown
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit that referenced this pull request Mar 14, 2026
…#177306)

Remove most cpp_wrapper skips from test_torchinductor.py since they can
pass now. For some tests, change their skips to be conditioned on
autotune_at_compile_time instead of cpp_wrapper.

Fix `run_and_get_kernels` to extract kernel code using `R"TRITON(...)"` pattern
for lazy compile cpp_wrapper mode, since kernels are embedded in C++ raw strings
rather than Python triple-quoted strings.

The remaining skips require more feature parity work to match cpp_wrapper with
python_wrapper.

Authored with Claude.

Pull Request resolved: #177306
Approved by: https://github.com/PaulZhang12
ghstack dependencies: #175548
pytorchmergebot pushed a commit that referenced this pull request Mar 14, 2026
Add `aten._grouped_mm.default` to the AOTI fallback ops list so that
a c-shim is generated, enabling cpp_wrapper mode for grouped_mm.

Authored with Claude.

Pull Request resolved: #177307
Approved by: https://github.com/yushangdi
ghstack dependencies: #175548, #177306
AaronWang04 pushed a commit to AaronWang04/pytorch that referenced this pull request Mar 24, 2026
…h#175548)

Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.

Authored with Claude.

Pull Request resolved: pytorch#175548
Approved by: https://github.com/PaulZhang12
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
…h#175548)

Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.

Authored with Claude.

Pull Request resolved: pytorch#175548
Approved by: https://github.com/PaulZhang12
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
…pytorch#177306)

Remove most cpp_wrapper skips from test_torchinductor.py since they can
pass now. For some tests, change their skips to be conditioned on
autotune_at_compile_time instead of cpp_wrapper.

Fix `run_and_get_kernels` to extract kernel code using `R"TRITON(...)"` pattern
for lazy compile cpp_wrapper mode, since kernels are embedded in C++ raw strings
rather than Python triple-quoted strings.

The remaining skips require more feature parity work to match cpp_wrapper with
python_wrapper.

Authored with Claude.

Pull Request resolved: pytorch#177306
Approved by: https://github.com/PaulZhang12
ghstack dependencies: pytorch#175548
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
Add `aten._grouped_mm.default` to the AOTI fallback ops list so that
a c-shim is generated, enabling cpp_wrapper mode for grouped_mm.

Authored with Claude.

Pull Request resolved: pytorch#177307
Approved by: https://github.com/yushangdi
ghstack dependencies: pytorch#175548, pytorch#177306
AaronWang04 pushed a commit to AaronWang04/pytorch that referenced this pull request Mar 31, 2026
…pytorch#177306)

Remove most cpp_wrapper skips from test_torchinductor.py since they can
pass now. For some tests, change their skips to be conditioned on
autotune_at_compile_time instead of cpp_wrapper.

Fix `run_and_get_kernels` to extract kernel code using `R"TRITON(...)"` pattern
for lazy compile cpp_wrapper mode, since kernels are embedded in C++ raw strings
rather than Python triple-quoted strings.

The remaining skips require more feature parity work to match cpp_wrapper with
python_wrapper.

Authored with Claude.

Pull Request resolved: pytorch#177306
Approved by: https://github.com/PaulZhang12
ghstack dependencies: pytorch#175548
AaronWang04 pushed a commit to AaronWang04/pytorch that referenced this pull request Mar 31, 2026
Add `aten._grouped_mm.default` to the AOTI fallback ops list so that
a c-shim is generated, enabling cpp_wrapper mode for grouped_mm.

Authored with Claude.

Pull Request resolved: pytorch#177307
Approved by: https://github.com/yushangdi
ghstack dependencies: pytorch#175548, pytorch#177306
@github-actions github-actions Bot deleted the gh/desertfire/654/head branch April 13, 2026 02:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants