[inductor] Add TMA support for lazy Triton kernel compilation#175548
[inductor] Add TMA support for lazy Triton kernel compilation#175548desertfire wants to merge 15 commits intogh/desertfire/654/basefrom
Conversation
Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the lazy compile path. The generated C++ wrapper receives both the TMA descriptor and the underlying tensor as parameters. On the first call, the tensor is passed to Python where _wrap_tma_args reconstructs TensorDescriptor.from_tensor() for Triton's autotuner. On cached launches, the StableTMADescriptor fields are unpacked directly into kernel launch args. Scratch space is now allocated dynamically at runtime using sizes from the autotuning result. Authored with Claude. [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175548
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (2 Unrelated Failures)As of commit d306ef2 with merge base b180c2f ( FLAKY - The following job failed but was likely due to flakiness present on trunk:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the lazy compile path. The generated C++ wrapper receives both the TMA descriptor and the underlying tensor as parameters. On the first call, the tensor is passed to Python where _wrap_tma_args reconstructs TensorDescriptor.from_tensor() for Triton's autotuner. On cached launches, the StableTMADescriptor fields are unpacked directly into kernel launch args. Scratch space is now allocated dynamically at runtime using sizes from the autotuning result. Authored with Claude. ghstack-source-id: 8b1e0a7 Pull Request resolved: #175548
…ion" Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the lazy compile path. The generated C++ wrapper receives both the TMA descriptor and the underlying tensor as parameters. On the first call, the tensor is passed to Python where _wrap_tma_args reconstructs TensorDescriptor.from_tensor() for Triton's autotuner. On cached launches, the StableTMADescriptor fields are unpacked directly into kernel launch args. Scratch space is now allocated dynamically at runtime using sizes from the autotuning result. Authored with Claude. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the lazy compile path. The generated C++ wrapper receives both the TMA descriptor and the underlying tensor as parameters. On the first call, the tensor is passed to Python where _wrap_tma_args reconstructs TensorDescriptor.from_tensor() for Triton's autotuner. On cached launches, the StableTMADescriptor fields are unpacked directly into kernel launch args. Scratch space is now allocated dynamically at runtime using sizes from the autotuning result. Authored with Claude. ghstack-source-id: eb027b8 Pull Request resolved: #175548
…ion" Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the lazy compile path. The generated C++ wrapper receives both the TMA descriptor and the underlying tensor as parameters. On the first call, the tensor is passed to Python where _wrap_tma_args reconstructs TensorDescriptor.from_tensor() for Triton's autotuner. On cached launches, the StableTMADescriptor fields are unpacked directly into kernel launch args. Scratch space is now allocated dynamically at runtime using sizes from the autotuning result. Authored with Claude. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the lazy compile path. The generated C++ wrapper receives both the TMA descriptor and the underlying tensor as parameters. On the first call, the tensor is passed to Python where _wrap_tma_args reconstructs TensorDescriptor.from_tensor() for Triton's autotuner. On cached launches, the StableTMADescriptor fields are unpacked directly into kernel launch args. Scratch space is now allocated dynamically at runtime using sizes from the autotuning result. Authored with Claude. ghstack-source-id: bef5145 Pull Request resolved: #175548
…ion" Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the lazy compile path. The generated C++ wrapper receives both the TMA descriptor and the underlying tensor as parameters. On the first call, the tensor is passed to Python where _wrap_tma_args reconstructs TensorDescriptor.from_tensor() for Triton's autotuner. On cached launches, the StableTMADescriptor fields are unpacked directly into kernel launch args. Scratch space is now allocated dynamically at runtime using sizes from the autotuning result. Authored with Claude. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the lazy compile path. The generated C++ wrapper receives both the TMA descriptor and the underlying tensor as parameters. On the first call, the tensor is passed to Python where _wrap_tma_args reconstructs TensorDescriptor.from_tensor() for Triton's autotuner. On cached launches, the StableTMADescriptor fields are unpacked directly into kernel launch args. Scratch space is now allocated dynamically at runtime using sizes from the autotuning result. Authored with Claude. ghstack-source-id: 32b8558 Pull Request resolved: #175548
…ion" Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the lazy compile path. The generated C++ wrapper receives both the TMA descriptor and the underlying tensor as parameters. On the first call, the tensor is passed to Python where _wrap_tma_args reconstructs TensorDescriptor.from_tensor() for Triton's autotuner. On cached launches, the StableTMADescriptor fields are unpacked directly into kernel launch args. Scratch space is now allocated dynamically at runtime using sizes from the autotuning result. Authored with Claude. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the lazy compile path. The generated C++ wrapper receives both the TMA descriptor and the underlying tensor as parameters. On the first call, the tensor is passed to Python where _wrap_tma_args reconstructs TensorDescriptor.from_tensor() for Triton's autotuner. On cached launches, the StableTMADescriptor fields are unpacked directly into kernel launch args. Scratch space is now allocated dynamically at runtime using sizes from the autotuning result. Authored with Claude. ghstack-source-id: ab2a6eb Pull Request resolved: #175548
…ion" Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the lazy compile path. The generated C++ wrapper receives both the TMA descriptor and the underlying tensor as parameters. On the first call, the tensor is passed to Python where _wrap_tma_args reconstructs TensorDescriptor.from_tensor() for Triton's autotuner. On cached launches, the StableTMADescriptor fields are unpacked directly into kernel launch args. Scratch space is now allocated dynamically at runtime using sizes from the autotuning result. Authored with Claude. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the lazy compile path. The generated C++ wrapper receives both the TMA descriptor and the underlying tensor as parameters. On the first call, the tensor is passed to Python where _wrap_tma_args reconstructs TensorDescriptor.from_tensor() for Triton's autotuner. On cached launches, the StableTMADescriptor fields are unpacked directly into kernel launch args. Scratch space is now allocated dynamically at runtime using sizes from the autotuning result. Authored with Claude. ghstack-source-id: be73cda Pull Request resolved: #175548
…ion" Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the lazy compile path. The generated C++ wrapper receives both the TMA descriptor and the underlying tensor as parameters. On the first call, the tensor is passed to Python where _wrap_tma_args reconstructs TensorDescriptor.from_tensor() for Triton's autotuner. On cached launches, the StableTMADescriptor fields are unpacked directly into kernel launch args. Scratch space is now allocated dynamically at runtime using sizes from the autotuning result. Authored with Claude. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the lazy compile path. The generated C++ wrapper receives both the TMA descriptor and the underlying tensor as parameters. On the first call, the tensor is passed to Python where _wrap_tma_args reconstructs TensorDescriptor.from_tensor() for Triton's autotuner. On cached launches, the StableTMADescriptor fields are unpacked directly into kernel launch args. Scratch space is now allocated dynamically at runtime using sizes from the autotuning result. Authored with Claude. ghstack-source-id: 006c684 Pull Request resolved: #175548
…ion" Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the lazy compile path. The generated C++ wrapper receives both the TMA descriptor and the underlying tensor as parameters. On the first call, the tensor is passed to Python where _wrap_tma_args reconstructs TensorDescriptor.from_tensor() for Triton's autotuner. On cached launches, the StableTMADescriptor fields are unpacked directly into kernel launch args. Scratch space is now allocated dynamically at runtime using sizes from the autotuning result. Authored with Claude. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the lazy compile path. The generated C++ wrapper receives both the TMA descriptor and the underlying tensor as parameters. On the first call, the tensor is passed to Python where _wrap_tma_args reconstructs TensorDescriptor.from_tensor() for Triton's autotuner. On cached launches, the StableTMADescriptor fields are unpacked directly into kernel launch args. Scratch space is now allocated dynamically at runtime using sizes from the autotuning result. Authored with Claude. ghstack-source-id: ecafe2f Pull Request resolved: #175548
…ion" Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the lazy compile path. The generated C++ wrapper receives both the TMA descriptor and the underlying tensor as parameters. On the first call, the tensor is passed to Python where _wrap_tma_args reconstructs TensorDescriptor.from_tensor() for Triton's autotuner. On cached launches, the StableTMADescriptor fields are unpacked directly into kernel launch args. Scratch space is now allocated dynamically at runtime using sizes from the autotuning result. Authored with Claude. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the lazy compile path. The generated C++ wrapper receives both the TMA descriptor and the underlying tensor as parameters. On the first call, the tensor is passed to Python where _wrap_tma_args reconstructs TensorDescriptor.from_tensor() for Triton's autotuner. On cached launches, the StableTMADescriptor fields are unpacked directly into kernel launch args. Scratch space is now allocated dynamically at runtime using sizes from the autotuning result. Authored with Claude. ghstack-source-id: 036c080 Pull Request resolved: #175548
…ion" Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the lazy compile path. The generated C++ wrapper receives both the TMA descriptor and the underlying tensor as parameters. On the first call, the tensor is passed to Python where _wrap_tma_args reconstructs TensorDescriptor.from_tensor() for Triton's autotuner. On cached launches, the StableTMADescriptor fields are unpacked directly into kernel launch args. Scratch space is now allocated dynamically at runtime using sizes from the autotuning result. Authored with Claude. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the lazy compile path. The generated C++ wrapper receives both the TMA descriptor and the underlying tensor as parameters. On the first call, the tensor is passed to Python where _wrap_tma_args reconstructs TensorDescriptor.from_tensor() for Triton's autotuner. On cached launches, the StableTMADescriptor fields are unpacked directly into kernel launch args. Scratch space is now allocated dynamically at runtime using sizes from the autotuning result. Authored with Claude. ghstack-source-id: fcd9ed8 Pull Request resolved: #175548
…ion" Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the lazy compile path. The generated C++ wrapper receives both the TMA descriptor and the underlying tensor as parameters. On the first call, the tensor is passed to Python where _wrap_tma_args reconstructs TensorDescriptor.from_tensor() for Triton's autotuner. On cached launches, the StableTMADescriptor fields are unpacked directly into kernel launch args. Scratch space is now allocated dynamically at runtime using sizes from the autotuning result. Authored with Claude. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the lazy compile path. The generated C++ wrapper receives both the TMA descriptor and the underlying tensor as parameters. On the first call, the tensor is passed to Python where _wrap_tma_args reconstructs TensorDescriptor.from_tensor() for Triton's autotuner. On cached launches, the StableTMADescriptor fields are unpacked directly into kernel launch args. Scratch space is now allocated dynamically at runtime using sizes from the autotuning result. Authored with Claude. ghstack-source-id: 3491f11 Pull Request resolved: #175548
…ion" Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the lazy compile path. The generated C++ wrapper receives both the TMA descriptor and the underlying tensor as parameters. On the first call, the tensor is passed to Python where _wrap_tma_args reconstructs TensorDescriptor.from_tensor() for Triton's autotuner. On cached launches, the StableTMADescriptor fields are unpacked directly into kernel launch args. Scratch space is now allocated dynamically at runtime using sizes from the autotuning result. Authored with Claude. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]
| sig_type = signature.get(key, "") | ||
| if isinstance(sig_type, str) and signature_is_tma_desc(sig_type): | ||
| if isinstance( | ||
| raw_arg, (TMADescriptorExperimental, TMADescriptorStable) |
There was a problem hiding this comment.
I will raise an AssertionError in the else branch.
…ion" Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the lazy compile path. The generated C++ wrapper receives both the TMA descriptor and the underlying tensor as parameters. On the first call, the tensor is passed to Python where _wrap_tma_args reconstructs TensorDescriptor.from_tensor() for Triton's autotuner. On cached launches, the StableTMADescriptor fields are unpacked directly into kernel launch args. Scratch space is now allocated dynamically at runtime using sizes from the autotuning result. Authored with Claude. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo Differential Revision: [D96125146](https://our.internmc.facebook.com/intern/diff/D96125146) [ghstack-poisoned]
|
@pytorchbot merge |
Merge failedReason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR! Details for Dev Infra teamRaised by workflow job |
|
@desertfire has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
@pytorchbot merge |
Merge failedReason: This PR has internal changes and must be landed via Phabricator! Please try reimporting/rexporting the PR! Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…#177306) Remove most cpp_wrapper skips from test_torchinductor.py since they can pass now. For some tests, change their skips to be conditioned on autotune_at_compile_time instead of cpp_wrapper. Fix `run_and_get_kernels` to extract kernel code using `R"TRITON(...)"` pattern for lazy compile cpp_wrapper mode, since kernels are embedded in C++ raw strings rather than Python triple-quoted strings. The remaining skips require more feature parity work to match cpp_wrapper with python_wrapper. Authored with Claude. Pull Request resolved: #177306 Approved by: https://github.com/PaulZhang12 ghstack dependencies: #175548
Add `aten._grouped_mm.default` to the AOTI fallback ops list so that a c-shim is generated, enabling cpp_wrapper mode for grouped_mm. Authored with Claude. Pull Request resolved: #177307 Approved by: https://github.com/yushangdi ghstack dependencies: #175548, #177306
…h#175548) Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the lazy compile path. The generated C++ wrapper receives both the TMA descriptor and the underlying tensor as parameters. On the first call, the tensor is passed to Python where _wrap_tma_args reconstructs TensorDescriptor.from_tensor() for Triton's autotuner. On cached launches, the StableTMADescriptor fields are unpacked directly into kernel launch args. Scratch space is now allocated dynamically at runtime using sizes from the autotuning result. Authored with Claude. Pull Request resolved: pytorch#175548 Approved by: https://github.com/PaulZhang12
…h#175548) Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the lazy compile path. The generated C++ wrapper receives both the TMA descriptor and the underlying tensor as parameters. On the first call, the tensor is passed to Python where _wrap_tma_args reconstructs TensorDescriptor.from_tensor() for Triton's autotuner. On cached launches, the StableTMADescriptor fields are unpacked directly into kernel launch args. Scratch space is now allocated dynamically at runtime using sizes from the autotuning result. Authored with Claude. Pull Request resolved: pytorch#175548 Approved by: https://github.com/PaulZhang12
…pytorch#177306) Remove most cpp_wrapper skips from test_torchinductor.py since they can pass now. For some tests, change their skips to be conditioned on autotune_at_compile_time instead of cpp_wrapper. Fix `run_and_get_kernels` to extract kernel code using `R"TRITON(...)"` pattern for lazy compile cpp_wrapper mode, since kernels are embedded in C++ raw strings rather than Python triple-quoted strings. The remaining skips require more feature parity work to match cpp_wrapper with python_wrapper. Authored with Claude. Pull Request resolved: pytorch#177306 Approved by: https://github.com/PaulZhang12 ghstack dependencies: pytorch#175548
Add `aten._grouped_mm.default` to the AOTI fallback ops list so that a c-shim is generated, enabling cpp_wrapper mode for grouped_mm. Authored with Claude. Pull Request resolved: pytorch#177307 Approved by: https://github.com/yushangdi ghstack dependencies: pytorch#175548, pytorch#177306
…pytorch#177306) Remove most cpp_wrapper skips from test_torchinductor.py since they can pass now. For some tests, change their skips to be conditioned on autotune_at_compile_time instead of cpp_wrapper. Fix `run_and_get_kernels` to extract kernel code using `R"TRITON(...)"` pattern for lazy compile cpp_wrapper mode, since kernels are embedded in C++ raw strings rather than Python triple-quoted strings. The remaining skips require more feature parity work to match cpp_wrapper with python_wrapper. Authored with Claude. Pull Request resolved: pytorch#177306 Approved by: https://github.com/PaulZhang12 ghstack dependencies: pytorch#175548
Add `aten._grouped_mm.default` to the AOTI fallback ops list so that a c-shim is generated, enabling cpp_wrapper mode for grouped_mm. Authored with Claude. Pull Request resolved: pytorch#177307 Approved by: https://github.com/yushangdi ghstack dependencies: pytorch#175548, pytorch#177306
Stack from ghstack (oldest at bottom):
Summary: Host-side TMA descriptors (StableTMADescriptor) are now handled in the
lazy compile path. The generated C++ wrapper receives both the TMA
descriptor and the underlying tensor as parameters. On the first call,
the tensor is passed to Python where _wrap_tma_args reconstructs
TensorDescriptor.from_tensor() for Triton's autotuner. On cached
launches, the StableTMADescriptor fields are unpacked directly into
kernel launch args. Scratch space is now allocated dynamically
at runtime using sizes from the autotuning result.
Authored with Claude.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo