[inductor] Add lazy Triton kernel compilation for cpp-wrapper by desertfire · Pull Request #175416 · pytorch/pytorch

desertfire · 2026-02-20T15:06:17Z

Stack from ghstack (oldest at bottom):

Summary: This adds support for lazy Triton kernel compilation when using
cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously,
the cpp-wrapper mode relies on autotune_at_compile_time to
compile Triton kernels into cubin files, which has two problems,

Kernels are not tuned with real inputs which can lead to
sub-optimal performance or even IMA.
Extra GPU memory consumption which can trigger OOM comparing
to the default Inductor.

Key changes:

CppWrapperGpu.generate_lazy() method that generates deferred
kernel compilation code. In the generated cpp code, a Triton
kernel will be compiled with AsyncCompile when the code is
loaded, and the first time the kernel is called, it will
perform autotuning and store the generated cubin file and
metadata. Subsequent kernel calls will just need to compute
grids and then cudaLaunch.
triton_lazy_compile.py takes care of the interaction with
cpp code at the runtime, triggering Triton kernel compilation
and autotuning.

Limitation:

This version does not support TMA yet. It wil come in future PRs.

https://gist.github.com/desertfire/8ee1ca889f411d3f2fca08d7658ea88e is an example of the generated code.

Authored with Claude.

Differential Revision: D95154174

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

Summary: This adds support for lazy Triton kernel compilation when using cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously, the cpp-wrapper mode relies on autotune_at_compile_time to compile Triton kernels into cubin files, which has two problems, 1. Kernels are not tuned with real inputs which can lead to sub-optimal performance or even IMA. 2. Extra GPU memory consumption which can trigger OOM comparing to the default Inductor. Key changes: 1. CppWrapperGpu.generate_lazy() method that generates deferred kernel compilation code. In the generated cpp code, a Triton kernel will be compiled with AsyncCompile when the code is loaded, and the first time the kernel is called, it will perform autotuning and store the generated cubin file and metadata. Subsequent kernel calls will just need to compute grids and then cudaLaunch. 2. triton_lazy_compile.py takes care of the interaction with cpp code at the runtime, triggering Triton kernel compilation and autotuning. Limitation: This version does not support TMA yet. It wil come in future PRs. Authored with Claude. [ghstack-poisoned]

pytorch-bot · 2026-02-20T15:06:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175416

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit f86da80 with merge base fabd1c4 ():

NEW FAILURE - The following job has failed:

inductor / inductor-test / test (inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
Process completed with exit code 1.

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / inductor-test / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (trunk failure)
sam

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 1, 2, linux.2xlarge.amx, unstable) (gh) (#174929)
detectron2_maskrcnn_r_50_fpn

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: This adds support for lazy Triton kernel compilation when using cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously, the cpp-wrapper mode relies on autotune_at_compile_time to compile Triton kernels into cubin files, which has two problems, 1. Kernels are not tuned with real inputs which can lead to sub-optimal performance or even IMA. 2. Extra GPU memory consumption which can trigger OOM comparing to the default Inductor. Key changes: 1. CppWrapperGpu.generate_lazy() method that generates deferred kernel compilation code. In the generated cpp code, a Triton kernel will be compiled with AsyncCompile when the code is loaded, and the first time the kernel is called, it will perform autotuning and store the generated cubin file and metadata. Subsequent kernel calls will just need to compute grids and then cudaLaunch. 2. triton_lazy_compile.py takes care of the interaction with cpp code at the runtime, triggering Triton kernel compilation and autotuning. Limitation: This version does not support TMA yet. It wil come in future PRs. Authored with Claude. ghstack-source-id: a4f9f20 Pull Request resolved: #175416

…per" Summary: This adds support for lazy Triton kernel compilation when using cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously, the cpp-wrapper mode relies on autotune_at_compile_time to compile Triton kernels into cubin files, which has two problems, 1. Kernels are not tuned with real inputs which can lead to sub-optimal performance or even IMA. 2. Extra GPU memory consumption which can trigger OOM comparing to the default Inductor. Key changes: 1. CppWrapperGpu.generate_lazy() method that generates deferred kernel compilation code. In the generated cpp code, a Triton kernel will be compiled with AsyncCompile when the code is loaded, and the first time the kernel is called, it will perform autotuning and store the generated cubin file and metadata. Subsequent kernel calls will just need to compute grids and then cudaLaunch. 2. triton_lazy_compile.py takes care of the interaction with cpp code at the runtime, triggering Triton kernel compilation and autotuning. Limitation: This version does not support TMA yet. It wil come in future PRs. Authored with Claude. [ghstack-poisoned]

Summary: This adds support for lazy Triton kernel compilation when using cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously, the cpp-wrapper mode relies on autotune_at_compile_time to compile Triton kernels into cubin files, which has two problems, 1. Kernels are not tuned with real inputs which can lead to sub-optimal performance or even IMA. 2. Extra GPU memory consumption which can trigger OOM comparing to the default Inductor. Key changes: 1. CppWrapperGpu.generate_lazy() method that generates deferred kernel compilation code. In the generated cpp code, a Triton kernel will be compiled with AsyncCompile when the code is loaded, and the first time the kernel is called, it will perform autotuning and store the generated cubin file and metadata. Subsequent kernel calls will just need to compute grids and then cudaLaunch. 2. triton_lazy_compile.py takes care of the interaction with cpp code at the runtime, triggering Triton kernel compilation and autotuning. Limitation: This version does not support TMA yet. It wil come in future PRs. Authored with Claude. ghstack-source-id: a4f9f20 Pull Request resolved: #175416

desertfire · 2026-02-20T16:19:01Z

            )

        @patch.object(config, "profile_bandwidth", True)
+        @skip_if_cpp_wrapper("cpp output code is different")


I tried to limit the changes to these test files, to make the reviewing easier. More clean-up and fixes will come in follow-up PRs.

…per" Summary: This adds support for lazy Triton kernel compilation when using cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously, the cpp-wrapper mode relies on autotune_at_compile_time to compile Triton kernels into cubin files, which has two problems, 1. Kernels are not tuned with real inputs which can lead to sub-optimal performance or even IMA. 2. Extra GPU memory consumption which can trigger OOM comparing to the default Inductor. Key changes: 1. CppWrapperGpu.generate_lazy() method that generates deferred kernel compilation code. In the generated cpp code, a Triton kernel will be compiled with AsyncCompile when the code is loaded, and the first time the kernel is called, it will perform autotuning and store the generated cubin file and metadata. Subsequent kernel calls will just need to compute grids and then cudaLaunch. 2. triton_lazy_compile.py takes care of the interaction with cpp code at the runtime, triggering Triton kernel compilation and autotuning. Limitation: This version does not support TMA yet. It wil come in future PRs. Authored with Claude. [ghstack-poisoned]

Summary: This adds support for lazy Triton kernel compilation when using cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously, the cpp-wrapper mode relies on autotune_at_compile_time to compile Triton kernels into cubin files, which has two problems, 1. Kernels are not tuned with real inputs which can lead to sub-optimal performance or even IMA. 2. Extra GPU memory consumption which can trigger OOM comparing to the default Inductor. Key changes: 1. CppWrapperGpu.generate_lazy() method that generates deferred kernel compilation code. In the generated cpp code, a Triton kernel will be compiled with AsyncCompile when the code is loaded, and the first time the kernel is called, it will perform autotuning and store the generated cubin file and metadata. Subsequent kernel calls will just need to compute grids and then cudaLaunch. 2. triton_lazy_compile.py takes care of the interaction with cpp code at the runtime, triggering Triton kernel compilation and autotuning. Limitation: This version does not support TMA yet. It wil come in future PRs. Authored with Claude. ghstack-source-id: 6df8c86 Pull Request resolved: #175416

PaulZhang12 · 2026-02-25T19:46:12Z

+    stream: Any,
+    args: list[Any],
+) -> tuple[
+    str,  # cubin_path


Can we make this an individual class/dict instead of just a tuple? Will be hard to track

PaulZhang12 · 2026-02-25T19:46:45Z

+        raise RuntimeError(f"cubin path not found in cached params for {kernel_name}")
+
+    mangled_name = cached_params.get("mangled_name", "")
+    num_warps = cached_params.get("num_warps", 4)


Where are these defaults coming from?

I think this one and xblock = config.get("XBLOCK", 128) are somewhat ad-hoc. We probably should just error out if these values are missing. Other 1 default values are fine.

…per" Summary: This adds support for lazy Triton kernel compilation when using cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously, the cpp-wrapper mode relies on autotune_at_compile_time to compile Triton kernels into cubin files, which has two problems, 1. Kernels are not tuned with real inputs which can lead to sub-optimal performance or even IMA. 2. Extra GPU memory consumption which can trigger OOM comparing to the default Inductor. Key changes: 1. CppWrapperGpu.generate_lazy() method that generates deferred kernel compilation code. In the generated cpp code, a Triton kernel will be compiled with AsyncCompile when the code is loaded, and the first time the kernel is called, it will perform autotuning and store the generated cubin file and metadata. Subsequent kernel calls will just need to compute grids and then cudaLaunch. 2. triton_lazy_compile.py takes care of the interaction with cpp code at the runtime, triggering Triton kernel compilation and autotuning. Limitation: This version does not support TMA yet. It wil come in future PRs. https://gist.github.com/desertfire/8ee1ca889f411d3f2fca08d7658ea88e is an example of the generated code. Authored with Claude. [ghstack-poisoned]

desertfire · 2026-02-25T20:51:09Z

+        raise RuntimeError(f"cubin path not found in cached params for {kernel_name}")
+
+    mangled_name = cached_params.get("mangled_name", "")
+    num_warps = cached_params.get("num_warps", 4)


I think this one and xblock = config.get("XBLOCK", 128) are somewhat ad-hoc. We probably should just error out if these values are missing. Other 1 default values are fine.

PaulZhang12 · 2026-02-26T16:24:23Z

How does the autotuning happen? Once inputs are encountered? And compilation happens on the C++ program startup?

I think this one and xblock = config.get("XBLOCK", 128) are somewhat ad-hoc
Yeah I think for this we should error out if we don't get the expected result. This could lead to a lot of silent performance issues

…per" Summary: This adds support for lazy Triton kernel compilation when using cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously, the cpp-wrapper mode relies on autotune_at_compile_time to compile Triton kernels into cubin files, which has two problems, 1. Kernels are not tuned with real inputs which can lead to sub-optimal performance or even IMA. 2. Extra GPU memory consumption which can trigger OOM comparing to the default Inductor. Key changes: 1. CppWrapperGpu.generate_lazy() method that generates deferred kernel compilation code. In the generated cpp code, a Triton kernel will be compiled with AsyncCompile when the code is loaded, and the first time the kernel is called, it will perform autotuning and store the generated cubin file and metadata. Subsequent kernel calls will just need to compute grids and then cudaLaunch. 2. triton_lazy_compile.py takes care of the interaction with cpp code at the runtime, triggering Triton kernel compilation and autotuning. Limitation: This version does not support TMA yet. It wil come in future PRs. https://gist.github.com/desertfire/8ee1ca889f411d3f2fca08d7658ea88e is an example of the generated code. Authored with Claude. [ghstack-poisoned]

desertfire · 2026-02-27T17:45:23Z

How does the autotuning happen? Once inputs are encountered? And compilation happens on the C++ program startup?

Yes, it happens when the first time we run a kernel, through runTritonKernelWithAutotune(C++) -> run_triton_kernel_with_autotune(python).

desertfire · 2026-03-04T16:46:17Z

@pytorchbot merge

pytorchmergebot · 2026-03-04T16:48:22Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2026-03-04T17:04:26Z

Merge failed

Reason: 1 jobs have failed, first few of them are: inductor / inductor-test / test (inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team

Raised by workflow job

desertfire · 2026-03-04T19:06:19Z

@pytorchbot merge -i

pytorchmergebot · 2026-03-04T19:09:09Z

Merge started

Your change will be merged while ignoring the following 3 checks: inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 1, 2, linux.2xlarge.amx, unstable), inductor / inductor-test / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor / inductor-test / test (inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…h#175416) Summary: This adds support for lazy Triton kernel compilation when using cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously, the cpp-wrapper mode relies on autotune_at_compile_time to compile Triton kernels into cubin files, which has two problems, 1. Kernels are not tuned with real inputs which can lead to sub-optimal performance or even IMA. 2. Extra GPU memory consumption which can trigger OOM comparing to the default Inductor. Key changes: 1. CppWrapperGpu.generate_lazy() method that generates deferred kernel compilation code. In the generated cpp code, a Triton kernel will be compiled with AsyncCompile when the code is loaded, and the first time the kernel is called, it will perform autotuning and store the generated cubin file and metadata. Subsequent kernel calls will just need to compute grids and then cudaLaunch. 2. triton_lazy_compile.py takes care of the interaction with cpp code at the runtime, triggering Triton kernel compilation and autotuning. Limitation: This version does not support TMA yet. It wil come in future PRs. https://gist.github.com/desertfire/8ee1ca889f411d3f2fca08d7658ea88e is an example of the generated code. Authored with Claude. Differential Revision: [D95154174](https://our.internmc.facebook.com/intern/diff/D95154174) Pull Request resolved: pytorch#175416 Approved by: https://github.com/mlazos, https://github.com/PaulZhang12

Summary: #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time. Forward fix by updating the code checking target for cpp-wrapper. [ghstack-poisoned]

Summary: #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time. Forward fix by updating the code checking target for cpp-wrapper. ghstack-source-id: 5244ba6 Pull Request resolved: #176745

Summary: 1. #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time. Forward fix by updating the code checking target for cpp-wrapper. 2. #176353 also had land race. Skip now and the fix is coming later. [ghstack-poisoned]

Summary: 1. #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time. Forward fix by updating the code checking target for cpp-wrapper. 2. #176353 also had land race. Skip now and the fix is coming later. ghstack-source-id: c856a94 Pull Request resolved: #176745

Summary: 1. #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time. Forward fix by updating the code checking target for cpp-wrapper. 2. #176353 also had land race. Skip now and the fix is coming later. Pull Request resolved: #176745 Approved by: https://github.com/AmesingFlank, https://github.com/zou3519

Summary: This adds support for lazy Triton kernel compilation when using cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously, the cpp-wrapper mode relies on autotune_at_compile_time to compile Triton kernels into cubin files, which has two problems, 1. Kernels are not tuned with real inputs which can lead to sub-optimal performance or even IMA. 2. Extra GPU memory consumption which can trigger OOM comparing to the default Inductor. Key changes: 1. CppWrapperGpu.generate_lazy() method that generates deferred kernel compilation code. In the generated cpp code, a Triton kernel will be compiled with AsyncCompile when the code is loaded, and the first time the kernel is called, it will perform autotuning and store the generated cubin file and metadata. Subsequent kernel calls will just need to compute grids and then cudaLaunch. 2. triton_lazy_compile.py takes care of the interaction with cpp code at the runtime, triggering Triton kernel compilation and autotuning. Limitation: This version does not support TMA yet. It wil come in future PRs. Authored with Claude. ghstack-source-id: ec0ced1 Pull Request resolved: pytorch/pytorch#175416

…h#175416) Summary: This adds support for lazy Triton kernel compilation when using cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously, the cpp-wrapper mode relies on autotune_at_compile_time to compile Triton kernels into cubin files, which has two problems, 1. Kernels are not tuned with real inputs which can lead to sub-optimal performance or even IMA. 2. Extra GPU memory consumption which can trigger OOM comparing to the default Inductor. Key changes: 1. CppWrapperGpu.generate_lazy() method that generates deferred kernel compilation code. In the generated cpp code, a Triton kernel will be compiled with AsyncCompile when the code is loaded, and the first time the kernel is called, it will perform autotuning and store the generated cubin file and metadata. Subsequent kernel calls will just need to compute grids and then cudaLaunch. 2. triton_lazy_compile.py takes care of the interaction with cpp code at the runtime, triggering Triton kernel compilation and autotuning. Limitation: This version does not support TMA yet. It wil come in future PRs. https://gist.github.com/desertfire/8ee1ca889f411d3f2fca08d7658ea88e is an example of the generated code. Authored with Claude. Differential Revision: [D95154174](https://our.internmc.facebook.com/intern/diff/D95154174) Pull Request resolved: pytorch#175416 Approved by: https://github.com/mlazos, https://github.com/PaulZhang12

Summary: 1. pytorch#173662 added more tests to test/inductor/test_triton_kernels.py, and pytorch#175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and pytorch#173662 didn't have the failing CI signal at the landing time. Forward fix by updating the code checking target for cpp-wrapper. 2. pytorch#176353 also had land race. Skip now and the fix is coming later. Pull Request resolved: pytorch#176745 Approved by: https://github.com/AmesingFlank, https://github.com/zou3519

… XPU The lazy Triton kernel compilation feature (#175416) introduced CUDA-specific assumptions that broke XPU cpp-wrapper codegen. Fix the XPU incompatibilities: - lazy_triton_compile.h: conditional include for XPU vs CUDA device headers; change runTritonKernelWithAutotune stream param from cudaStream_t to void* (both pointer types convert implicitly) - cpp_wrapper_gpu.py: use device-appropriate pointer type for scratch allocations instead of hardcoded CUdeviceptr - sycl_runtime_wrappers.h: query threads_per_warp from the kernel object via compile_sub_group_size instead of requiring it as an external parameter, matching static_launcher/xpu.cpp - Remove threads_per_warp plumbing from triton_heuristics.py and cpp_wrapper_gpu.py codegen since it is no longer needed ghstack-source-id: 58de386 Pull-Request: #179239

… XPU The lazy Triton kernel compilation feature (#175416) introduced CUDA-specific assumptions that broke XPU cpp-wrapper codegen. Fix the XPU incompatibilities: - lazy_triton_compile.h: conditional include for XPU vs CUDA device headers; change runTritonKernelWithAutotune stream param from cudaStream_t to void* (both pointer types convert implicitly) - cpp_wrapper_gpu.py: use device-appropriate pointer type for scratch allocations instead of hardcoded CUdeviceptr - sycl_runtime_wrappers.h: query threads_per_warp from the kernel object via compile_sub_group_size instead of requiring it as an external parameter, matching static_launcher/xpu.cpp - Remove threads_per_warp plumbing from triton_heuristics.py and cpp_wrapper_gpu.py codegen since it is no longer needed ghstack-source-id: 9bcb595 Pull-Request: #179239

… XPU (#179239) The lazy Triton kernel compilation feature (#175416) introduced CUDA-specific assumptions that broke XPU cpp-wrapper codegen. Fix the XPU incompatibilities: - lazy_triton_compile.h: conditional include for XPU vs CUDA device headers; change runTritonKernelWithAutotune stream param from cudaStream_t to void* (both pointer types convert implicitly) - cpp_wrapper_gpu.py: use device-appropriate pointer type for scratch allocations instead of hardcoded CUdeviceptr - sycl_runtime_wrappers.h: query threads_per_warp from the kernel object via compile_sub_group_size instead of requiring it as an external parameter. - Remove threads_per_warp plumbing from triton_heuristics.py and cpp_wrapper_gpu.py codegen since it is no longer needed Pull Request resolved: #179239 Approved by: https://github.com/desertfire

… XPU (pytorch#179239) The lazy Triton kernel compilation feature (pytorch#175416) introduced CUDA-specific assumptions that broke XPU cpp-wrapper codegen. Fix the XPU incompatibilities: - lazy_triton_compile.h: conditional include for XPU vs CUDA device headers; change runTritonKernelWithAutotune stream param from cudaStream_t to void* (both pointer types convert implicitly) - cpp_wrapper_gpu.py: use device-appropriate pointer type for scratch allocations instead of hardcoded CUdeviceptr - sycl_runtime_wrappers.h: query threads_per_warp from the kernel object via compile_sub_group_size instead of requiring it as an external parameter. - Remove threads_per_warp plumbing from triton_heuristics.py and cpp_wrapper_gpu.py codegen since it is no longer needed Pull Request resolved: pytorch#179239 Approved by: https://github.com/desertfire

desertfire requested a review from a team as a code owner February 20, 2026 15:06

This was referenced Feb 20, 2026

[inductor][refactor] Update DeferredTritonCallWrapper.generate #175414

Closed

[inductor][refactor] Update cpp_wrapper_src quote format #175415

Closed

pytorch-bot Bot added ciflow/inductor module: inductor release notes: inductor (aoti) labels Feb 20, 2026

desertfire commented Feb 20, 2026

View reviewed changes

desertfire requested review from eellison, jansel and mlazos February 20, 2026 20:32

desertfire mentioned this pull request Feb 23, 2026

[inductor] Add TMA support for lazy Triton kernel compilation #175548

Closed

PaulZhang12 reviewed Feb 25, 2026

View reviewed changes

Comment thread torch/_inductor/runtime/triton_heuristics.py

PaulZhang12 reviewed Feb 25, 2026

View reviewed changes

Comment thread torch/_inductor/runtime/triton_lazy_compile.py

desertfire commented Feb 26, 2026

View reviewed changes

desertfire added 3 commits February 26, 2026 17:58

desertfire requested a review from PaulZhang12 February 27, 2026 21:37

pytorchmergebot added the merging label Mar 4, 2026

pytorchmergebot removed the merging label Mar 4, 2026

pytorchmergebot added the merging label Mar 4, 2026

pytorchmergebot closed this in bc6103f Mar 4, 2026

pytorchmergebot added Merged and removed merging labels Mar 4, 2026

desertfire mentioned this pull request Mar 6, 2026

[inductor][CI] Fix cpp-wrapper CI failures #176745

Closed

etaf mentioned this pull request Apr 3, 2026

[AOTI][XPU] Support lazy Triton kernel compilation for cpp-wrapper on XPU #179239

Closed

github-actions Bot deleted the gh/desertfire/653/head branch April 4, 2026 02:24

Conversation

desertfire commented Feb 20, 2026 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175416

❌ 1 New Failure, 2 Unrelated Failures

Uh oh!

desertfire Feb 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

PaulZhang12 Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

PaulZhang12 Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

desertfire Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

desertfire Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

PaulZhang12 commented Feb 26, 2026

Uh oh!

desertfire commented Feb 27, 2026

Uh oh!

desertfire commented Mar 4, 2026

Uh oh!

pytorchmergebot commented Mar 4, 2026

Merge started

Uh oh!

pytorchmergebot commented Mar 4, 2026

Merge failed

Uh oh!

desertfire commented Mar 4, 2026

Uh oh!

pytorchmergebot commented Mar 4, 2026

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

desertfire commented Feb 20, 2026 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented Feb 20, 2026 •

edited

Loading