Skip to content

[inductor] Add lazy Triton kernel compilation for cpp-wrapper#175416

Closed
desertfire wants to merge 11 commits intogh/desertfire/653/basefrom
gh/desertfire/653/head
Closed

[inductor] Add lazy Triton kernel compilation for cpp-wrapper#175416
desertfire wants to merge 11 commits intogh/desertfire/653/basefrom
gh/desertfire/653/head

Conversation

@desertfire
Copy link
Copy Markdown
Contributor

@desertfire desertfire commented Feb 20, 2026

Stack from ghstack (oldest at bottom):

Summary: This adds support for lazy Triton kernel compilation when using
cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously,
the cpp-wrapper mode relies on autotune_at_compile_time to
compile Triton kernels into cubin files, which has two problems,

  1. Kernels are not tuned with real inputs which can lead to
    sub-optimal performance or even IMA.

  2. Extra GPU memory consumption which can trigger OOM comparing
    to the default Inductor.

Key changes:

  1. CppWrapperGpu.generate_lazy() method that generates deferred
    kernel compilation code. In the generated cpp code, a Triton
    kernel will be compiled with AsyncCompile when the code is
    loaded, and the first time the kernel is called, it will
    perform autotuning and store the generated cubin file and
    metadata. Subsequent kernel calls will just need to compute
    grids and then cudaLaunch.

  2. triton_lazy_compile.py takes care of the interaction with
    cpp code at the runtime, triggering Triton kernel compilation
    and autotuning.

Limitation:

This version does not support TMA yet. It wil come in future PRs.

https://gist.github.com/desertfire/8ee1ca889f411d3f2fca08d7658ea88e is an example of the generated code.

Authored with Claude.

Differential Revision: D95154174

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

Summary: This adds support for lazy Triton kernel compilation when using
cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously,
the cpp-wrapper mode relies on autotune_at_compile_time to
compile Triton kernels into cubin files, which has two problems,

1. Kernels are not tuned with real inputs which can lead to
   sub-optimal performance or even IMA.

2. Extra GPU memory consumption which can trigger OOM comparing
   to the default Inductor.

Key changes:

1. CppWrapperGpu.generate_lazy() method that generates deferred
   kernel compilation code. In the generated cpp code, a Triton
   kernel will be compiled with AsyncCompile when the code is
   loaded, and the first time the kernel is called, it will
   perform autotuning and store the generated cubin file and
   metadata. Subsequent kernel calls will just need to compute
   grids and then cudaLaunch.

2. triton_lazy_compile.py takes care of the interaction with
   cpp code at the runtime, triggering Triton kernel compilation
   and autotuning.

Limitation:

This version does not support TMA yet. It wil come in future PRs.

Authored with Claude.

[ghstack-poisoned]
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Feb 20, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175416

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit f86da80 with merge base fabd1c4 (image):

NEW FAILURE - The following job has failed:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

desertfire added a commit that referenced this pull request Feb 20, 2026
Summary: This adds support for lazy Triton kernel compilation when using
cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously,
the cpp-wrapper mode relies on autotune_at_compile_time to
compile Triton kernels into cubin files, which has two problems,

1. Kernels are not tuned with real inputs which can lead to
   sub-optimal performance or even IMA.

2. Extra GPU memory consumption which can trigger OOM comparing
   to the default Inductor.

Key changes:

1. CppWrapperGpu.generate_lazy() method that generates deferred
   kernel compilation code. In the generated cpp code, a Triton
   kernel will be compiled with AsyncCompile when the code is
   loaded, and the first time the kernel is called, it will
   perform autotuning and store the generated cubin file and
   metadata. Subsequent kernel calls will just need to compute
   grids and then cudaLaunch.

2. triton_lazy_compile.py takes care of the interaction with
   cpp code at the runtime, triggering Triton kernel compilation
   and autotuning.

Limitation:

This version does not support TMA yet. It wil come in future PRs.

Authored with Claude.

ghstack-source-id: a4f9f20
Pull Request resolved: #175416
…per"

Summary: This adds support for lazy Triton kernel compilation when using
cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously,
the cpp-wrapper mode relies on autotune_at_compile_time to
compile Triton kernels into cubin files, which has two problems,

1. Kernels are not tuned with real inputs which can lead to
   sub-optimal performance or even IMA.

2. Extra GPU memory consumption which can trigger OOM comparing
   to the default Inductor.

Key changes:

1. CppWrapperGpu.generate_lazy() method that generates deferred
   kernel compilation code. In the generated cpp code, a Triton
   kernel will be compiled with AsyncCompile when the code is
   loaded, and the first time the kernel is called, it will
   perform autotuning and store the generated cubin file and
   metadata. Subsequent kernel calls will just need to compute
   grids and then cudaLaunch.

2. triton_lazy_compile.py takes care of the interaction with
   cpp code at the runtime, triggering Triton kernel compilation
   and autotuning.

Limitation:

This version does not support TMA yet. It wil come in future PRs.

Authored with Claude.

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Feb 20, 2026
Summary: This adds support for lazy Triton kernel compilation when using
cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously,
the cpp-wrapper mode relies on autotune_at_compile_time to
compile Triton kernels into cubin files, which has two problems,

1. Kernels are not tuned with real inputs which can lead to
   sub-optimal performance or even IMA.

2. Extra GPU memory consumption which can trigger OOM comparing
   to the default Inductor.

Key changes:

1. CppWrapperGpu.generate_lazy() method that generates deferred
   kernel compilation code. In the generated cpp code, a Triton
   kernel will be compiled with AsyncCompile when the code is
   loaded, and the first time the kernel is called, it will
   perform autotuning and store the generated cubin file and
   metadata. Subsequent kernel calls will just need to compute
   grids and then cudaLaunch.

2. triton_lazy_compile.py takes care of the interaction with
   cpp code at the runtime, triggering Triton kernel compilation
   and autotuning.

Limitation:

This version does not support TMA yet. It wil come in future PRs.

Authored with Claude.

ghstack-source-id: a4f9f20
Pull Request resolved: #175416
)

@patch.object(config, "profile_bandwidth", True)
@skip_if_cpp_wrapper("cpp output code is different")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to limit the changes to these test files, to make the reviewing easier. More clean-up and fixes will come in follow-up PRs.

…per"

Summary: This adds support for lazy Triton kernel compilation when using
cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously,
the cpp-wrapper mode relies on autotune_at_compile_time to
compile Triton kernels into cubin files, which has two problems,

1. Kernels are not tuned with real inputs which can lead to
   sub-optimal performance or even IMA.

2. Extra GPU memory consumption which can trigger OOM comparing
   to the default Inductor.

Key changes:

1. CppWrapperGpu.generate_lazy() method that generates deferred
   kernel compilation code. In the generated cpp code, a Triton
   kernel will be compiled with AsyncCompile when the code is
   loaded, and the first time the kernel is called, it will
   perform autotuning and store the generated cubin file and
   metadata. Subsequent kernel calls will just need to compute
   grids and then cudaLaunch.

2. triton_lazy_compile.py takes care of the interaction with
   cpp code at the runtime, triggering Triton kernel compilation
   and autotuning.

Limitation:

This version does not support TMA yet. It wil come in future PRs.

Authored with Claude.

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Feb 20, 2026
Summary: This adds support for lazy Triton kernel compilation when using
cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously,
the cpp-wrapper mode relies on autotune_at_compile_time to
compile Triton kernels into cubin files, which has two problems,

1. Kernels are not tuned with real inputs which can lead to
   sub-optimal performance or even IMA.

2. Extra GPU memory consumption which can trigger OOM comparing
   to the default Inductor.

Key changes:

1. CppWrapperGpu.generate_lazy() method that generates deferred
   kernel compilation code. In the generated cpp code, a Triton
   kernel will be compiled with AsyncCompile when the code is
   loaded, and the first time the kernel is called, it will
   perform autotuning and store the generated cubin file and
   metadata. Subsequent kernel calls will just need to compute
   grids and then cudaLaunch.

2. triton_lazy_compile.py takes care of the interaction with
   cpp code at the runtime, triggering Triton kernel compilation
   and autotuning.

Limitation:

This version does not support TMA yet. It wil come in future PRs.

Authored with Claude.

ghstack-source-id: 6df8c86
Pull Request resolved: #175416
Comment thread torch/_inductor/runtime/triton_heuristics.py
stream: Any,
args: list[Any],
) -> tuple[
str, # cubin_path
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this an individual class/dict instead of just a tuple? Will be hard to track

raise RuntimeError(f"cubin path not found in cached params for {kernel_name}")

mangled_name = cached_params.get("mangled_name", "")
num_warps = cached_params.get("num_warps", 4)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are these defaults coming from?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one and xblock = config.get("XBLOCK", 128) are somewhat ad-hoc. We probably should just error out if these values are missing. Other 1 default values are fine.

Comment thread torch/_inductor/runtime/triton_lazy_compile.py
…per"


Summary: This adds support for lazy Triton kernel compilation when using
cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously,
the cpp-wrapper mode relies on autotune_at_compile_time to
compile Triton kernels into cubin files, which has two problems,

1. Kernels are not tuned with real inputs which can lead to
   sub-optimal performance or even IMA.

2. Extra GPU memory consumption which can trigger OOM comparing
   to the default Inductor.

Key changes:

1. CppWrapperGpu.generate_lazy() method that generates deferred
   kernel compilation code. In the generated cpp code, a Triton
   kernel will be compiled with AsyncCompile when the code is
   loaded, and the first time the kernel is called, it will
   perform autotuning and store the generated cubin file and
   metadata. Subsequent kernel calls will just need to compute
   grids and then cudaLaunch.

2. triton_lazy_compile.py takes care of the interaction with
   cpp code at the runtime, triggering Triton kernel compilation
   and autotuning.

Limitation:

This version does not support TMA yet. It wil come in future PRs.

https://gist.github.com/desertfire/8ee1ca889f411d3f2fca08d7658ea88e is an example of the generated code.

Authored with Claude.

[ghstack-poisoned]
Comment thread torch/_inductor/runtime/triton_heuristics.py
Comment thread torch/_inductor/runtime/triton_heuristics.py Outdated
Comment thread torch/_inductor/runtime/triton_lazy_compile.py
raise RuntimeError(f"cubin path not found in cached params for {kernel_name}")

mangled_name = cached_params.get("mangled_name", "")
num_warps = cached_params.get("num_warps", 4)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one and xblock = config.get("XBLOCK", 128) are somewhat ad-hoc. We probably should just error out if these values are missing. Other 1 default values are fine.

@PaulZhang12
Copy link
Copy Markdown
Contributor

How does the autotuning happen? Once inputs are encountered? And compilation happens on the C++ program startup?

I think this one and xblock = config.get("XBLOCK", 128) are somewhat ad-hoc
Yeah I think for this we should error out if we don't get the expected result. This could lead to a lot of silent performance issues

…per"


Summary: This adds support for lazy Triton kernel compilation when using
cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously,
the cpp-wrapper mode relies on autotune_at_compile_time to
compile Triton kernels into cubin files, which has two problems,

1. Kernels are not tuned with real inputs which can lead to
   sub-optimal performance or even IMA.

2. Extra GPU memory consumption which can trigger OOM comparing
   to the default Inductor.

Key changes:

1. CppWrapperGpu.generate_lazy() method that generates deferred
   kernel compilation code. In the generated cpp code, a Triton
   kernel will be compiled with AsyncCompile when the code is
   loaded, and the first time the kernel is called, it will
   perform autotuning and store the generated cubin file and
   metadata. Subsequent kernel calls will just need to compute
   grids and then cudaLaunch.

2. triton_lazy_compile.py takes care of the interaction with
   cpp code at the runtime, triggering Triton kernel compilation
   and autotuning.

Limitation:

This version does not support TMA yet. It wil come in future PRs.

https://gist.github.com/desertfire/8ee1ca889f411d3f2fca08d7658ea88e is an example of the generated code.

Authored with Claude.

[ghstack-poisoned]
…per"


Summary: This adds support for lazy Triton kernel compilation when using
cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously,
the cpp-wrapper mode relies on autotune_at_compile_time to
compile Triton kernels into cubin files, which has two problems,

1. Kernels are not tuned with real inputs which can lead to
   sub-optimal performance or even IMA.

2. Extra GPU memory consumption which can trigger OOM comparing
   to the default Inductor.

Key changes:

1. CppWrapperGpu.generate_lazy() method that generates deferred
   kernel compilation code. In the generated cpp code, a Triton
   kernel will be compiled with AsyncCompile when the code is
   loaded, and the first time the kernel is called, it will
   perform autotuning and store the generated cubin file and
   metadata. Subsequent kernel calls will just need to compute
   grids and then cudaLaunch.

2. triton_lazy_compile.py takes care of the interaction with
   cpp code at the runtime, triggering Triton kernel compilation
   and autotuning.

Limitation:

This version does not support TMA yet. It wil come in future PRs.

https://gist.github.com/desertfire/8ee1ca889f411d3f2fca08d7658ea88e is an example of the generated code.

Authored with Claude.

[ghstack-poisoned]
…per"


Summary: This adds support for lazy Triton kernel compilation when using
cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously,
the cpp-wrapper mode relies on autotune_at_compile_time to
compile Triton kernels into cubin files, which has two problems,

1. Kernels are not tuned with real inputs which can lead to
   sub-optimal performance or even IMA.

2. Extra GPU memory consumption which can trigger OOM comparing
   to the default Inductor.

Key changes:

1. CppWrapperGpu.generate_lazy() method that generates deferred
   kernel compilation code. In the generated cpp code, a Triton
   kernel will be compiled with AsyncCompile when the code is
   loaded, and the first time the kernel is called, it will
   perform autotuning and store the generated cubin file and
   metadata. Subsequent kernel calls will just need to compute
   grids and then cudaLaunch.

2. triton_lazy_compile.py takes care of the interaction with
   cpp code at the runtime, triggering Triton kernel compilation
   and autotuning.

Limitation:

This version does not support TMA yet. It wil come in future PRs.

https://gist.github.com/desertfire/8ee1ca889f411d3f2fca08d7658ea88e is an example of the generated code.

Authored with Claude.

[ghstack-poisoned]
@desertfire
Copy link
Copy Markdown
Contributor Author

How does the autotuning happen? Once inputs are encountered? And compilation happens on the C++ program startup?

Yes, it happens when the first time we run a kernel, through runTritonKernelWithAutotune(C++) -> run_triton_kernel_with_autotune(python).

@desertfire
Copy link
Copy Markdown
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge failed

Reason: 1 jobs have failed, first few of them are: inductor / inductor-test / test (inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu)

Details for Dev Infra team Raised by workflow job

@desertfire
Copy link
Copy Markdown
Contributor Author

@pytorchbot merge -i

@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Merge started

Your change will be merged while ignoring the following 3 checks: inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 1, 2, linux.2xlarge.amx, unstable), inductor / inductor-test / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor / inductor-test / test (inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

pytorchmergebot pushed a commit to anatoliylitv/pytorch that referenced this pull request Mar 4, 2026
…h#175416)

Summary: This adds support for lazy Triton kernel compilation when using
cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously,
the cpp-wrapper mode relies on autotune_at_compile_time to
compile Triton kernels into cubin files, which has two problems,

1. Kernels are not tuned with real inputs which can lead to
   sub-optimal performance or even IMA.

2. Extra GPU memory consumption which can trigger OOM comparing
   to the default Inductor.

Key changes:

1. CppWrapperGpu.generate_lazy() method that generates deferred
   kernel compilation code. In the generated cpp code, a Triton
   kernel will be compiled with AsyncCompile when the code is
   loaded, and the first time the kernel is called, it will
   perform autotuning and store the generated cubin file and
   metadata. Subsequent kernel calls will just need to compute
   grids and then cudaLaunch.

2. triton_lazy_compile.py takes care of the interaction with
   cpp code at the runtime, triggering Triton kernel compilation
   and autotuning.

Limitation:

This version does not support TMA yet. It wil come in future PRs.

https://gist.github.com/desertfire/8ee1ca889f411d3f2fca08d7658ea88e is an example of the generated code.

Authored with Claude.

Differential Revision: [D95154174](https://our.internmc.facebook.com/intern/diff/D95154174)

Pull Request resolved: pytorch#175416
Approved by: https://github.com/mlazos, https://github.com/PaulZhang12
desertfire added a commit that referenced this pull request Mar 6, 2026
Summary: #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time.

Forward fix by updating the code checking target for cpp-wrapper.

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Mar 6, 2026
Summary: #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time.

Forward fix by updating the code checking target for cpp-wrapper.

ghstack-source-id: 5244ba6
Pull Request resolved: #176745
desertfire added a commit that referenced this pull request Mar 6, 2026
Summary:
1. #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time.

Forward fix by updating the code checking target for cpp-wrapper.

2. #176353 also had land race. Skip now and the fix is coming later.

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Mar 6, 2026
Summary:
1. #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time.

Forward fix by updating the code checking target for cpp-wrapper.

2. #176353 also had land race. Skip now and the fix is coming later.

ghstack-source-id: c856a94
Pull Request resolved: #176745
pytorchmergebot pushed a commit that referenced this pull request Mar 7, 2026
Summary:
1. #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time.

Forward fix by updating the code checking target for cpp-wrapper.

2. #176353 also had land race. Skip now and the fix is coming later.
Pull Request resolved: #176745
Approved by: https://github.com/AmesingFlank, https://github.com/zou3519
sandy-gags pushed a commit to sandy-gags/pytorch that referenced this pull request Mar 12, 2026
Summary: This adds support for lazy Triton kernel compilation when using
cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously,
the cpp-wrapper mode relies on autotune_at_compile_time to
compile Triton kernels into cubin files, which has two problems,

1. Kernels are not tuned with real inputs which can lead to
   sub-optimal performance or even IMA.

2. Extra GPU memory consumption which can trigger OOM comparing
   to the default Inductor.

Key changes:

1. CppWrapperGpu.generate_lazy() method that generates deferred
   kernel compilation code. In the generated cpp code, a Triton
   kernel will be compiled with AsyncCompile when the code is
   loaded, and the first time the kernel is called, it will
   perform autotuning and store the generated cubin file and
   metadata. Subsequent kernel calls will just need to compute
   grids and then cudaLaunch.

2. triton_lazy_compile.py takes care of the interaction with
   cpp code at the runtime, triggering Triton kernel compilation
   and autotuning.

Limitation:

This version does not support TMA yet. It wil come in future PRs.

Authored with Claude.

ghstack-source-id: ec0ced1
Pull Request resolved: pytorch/pytorch#175416
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
…h#175416)

Summary: This adds support for lazy Triton kernel compilation when using
cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously,
the cpp-wrapper mode relies on autotune_at_compile_time to
compile Triton kernels into cubin files, which has two problems,

1. Kernels are not tuned with real inputs which can lead to
   sub-optimal performance or even IMA.

2. Extra GPU memory consumption which can trigger OOM comparing
   to the default Inductor.

Key changes:

1. CppWrapperGpu.generate_lazy() method that generates deferred
   kernel compilation code. In the generated cpp code, a Triton
   kernel will be compiled with AsyncCompile when the code is
   loaded, and the first time the kernel is called, it will
   perform autotuning and store the generated cubin file and
   metadata. Subsequent kernel calls will just need to compute
   grids and then cudaLaunch.

2. triton_lazy_compile.py takes care of the interaction with
   cpp code at the runtime, triggering Triton kernel compilation
   and autotuning.

Limitation:

This version does not support TMA yet. It wil come in future PRs.

https://gist.github.com/desertfire/8ee1ca889f411d3f2fca08d7658ea88e is an example of the generated code.

Authored with Claude.

Differential Revision: [D95154174](https://our.internmc.facebook.com/intern/diff/D95154174)

Pull Request resolved: pytorch#175416
Approved by: https://github.com/mlazos, https://github.com/PaulZhang12
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
Summary:
1. pytorch#173662 added more tests to test/inductor/test_triton_kernels.py, and pytorch#175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and pytorch#173662 didn't have the failing CI signal at the landing time.

Forward fix by updating the code checking target for cpp-wrapper.

2. pytorch#176353 also had land race. Skip now and the fix is coming later.
Pull Request resolved: pytorch#176745
Approved by: https://github.com/AmesingFlank, https://github.com/zou3519
etaf added a commit that referenced this pull request Apr 3, 2026
… XPU

The lazy Triton kernel compilation feature (#175416) introduced
CUDA-specific assumptions that broke XPU cpp-wrapper codegen.

Fix the XPU incompatibilities:
- lazy_triton_compile.h: conditional include for XPU vs CUDA device
  headers; change runTritonKernelWithAutotune stream param from
  cudaStream_t to void* (both pointer types convert implicitly)
- cpp_wrapper_gpu.py: use device-appropriate pointer type for scratch
  allocations instead of hardcoded CUdeviceptr
- sycl_runtime_wrappers.h: query threads_per_warp from the kernel
  object via compile_sub_group_size instead of requiring it as an
  external parameter, matching static_launcher/xpu.cpp
- Remove threads_per_warp plumbing from triton_heuristics.py and
  cpp_wrapper_gpu.py codegen since it is no longer needed


ghstack-source-id: 58de386
Pull-Request: #179239
etaf added a commit that referenced this pull request Apr 3, 2026
… XPU

The lazy Triton kernel compilation feature (#175416) introduced
CUDA-specific assumptions that broke XPU cpp-wrapper codegen.

Fix the XPU incompatibilities:
- lazy_triton_compile.h: conditional include for XPU vs CUDA device
  headers; change runTritonKernelWithAutotune stream param from
  cudaStream_t to void* (both pointer types convert implicitly)
- cpp_wrapper_gpu.py: use device-appropriate pointer type for scratch
  allocations instead of hardcoded CUdeviceptr
- sycl_runtime_wrappers.h: query threads_per_warp from the kernel
  object via compile_sub_group_size instead of requiring it as an
  external parameter, matching static_launcher/xpu.cpp
- Remove threads_per_warp plumbing from triton_heuristics.py and
  cpp_wrapper_gpu.py codegen since it is no longer needed

ghstack-source-id: 9bcb595
Pull-Request: #179239
pytorchmergebot pushed a commit that referenced this pull request Apr 3, 2026
… XPU (#179239)

The lazy Triton kernel compilation feature (#175416) introduced
CUDA-specific assumptions that broke XPU cpp-wrapper codegen.

Fix the XPU incompatibilities:
- lazy_triton_compile.h: conditional include for XPU vs CUDA device
  headers; change runTritonKernelWithAutotune stream param from
  cudaStream_t to void* (both pointer types convert implicitly)
- cpp_wrapper_gpu.py: use device-appropriate pointer type for scratch
  allocations instead of hardcoded CUdeviceptr
- sycl_runtime_wrappers.h: query threads_per_warp from the kernel
  object via compile_sub_group_size instead of requiring it as an
  external parameter.
- Remove threads_per_warp plumbing from triton_heuristics.py and
  cpp_wrapper_gpu.py codegen since it is no longer needed

Pull Request resolved: #179239
Approved by: https://github.com/desertfire
@github-actions github-actions Bot deleted the gh/desertfire/653/head branch April 4, 2026 02:24
weifengpy pushed a commit that referenced this pull request Apr 7, 2026
… XPU (#179239)

The lazy Triton kernel compilation feature (#175416) introduced
CUDA-specific assumptions that broke XPU cpp-wrapper codegen.

Fix the XPU incompatibilities:
- lazy_triton_compile.h: conditional include for XPU vs CUDA device
  headers; change runTritonKernelWithAutotune stream param from
  cudaStream_t to void* (both pointer types convert implicitly)
- cpp_wrapper_gpu.py: use device-appropriate pointer type for scratch
  allocations instead of hardcoded CUdeviceptr
- sycl_runtime_wrappers.h: query threads_per_warp from the kernel
  object via compile_sub_group_size instead of requiring it as an
  external parameter.
- Remove threads_per_warp plumbing from triton_heuristics.py and
  cpp_wrapper_gpu.py codegen since it is no longer needed

Pull Request resolved: #179239
Approved by: https://github.com/desertfire
nklshy-aws pushed a commit to nklshy-aws/pytorch that referenced this pull request Apr 7, 2026
… XPU (pytorch#179239)

The lazy Triton kernel compilation feature (pytorch#175416) introduced
CUDA-specific assumptions that broke XPU cpp-wrapper codegen.

Fix the XPU incompatibilities:
- lazy_triton_compile.h: conditional include for XPU vs CUDA device
  headers; change runTritonKernelWithAutotune stream param from
  cudaStream_t to void* (both pointer types convert implicitly)
- cpp_wrapper_gpu.py: use device-appropriate pointer type for scratch
  allocations instead of hardcoded CUdeviceptr
- sycl_runtime_wrappers.h: query threads_per_warp from the kernel
  object via compile_sub_group_size instead of requiring it as an
  external parameter.
- Remove threads_per_warp plumbing from triton_heuristics.py and
  cpp_wrapper_gpu.py codegen since it is no longer needed

Pull Request resolved: pytorch#179239
Approved by: https://github.com/desertfire
bobrenjc93 pushed a commit to bobrenjc93/pytorch that referenced this pull request Apr 10, 2026
… XPU (pytorch#179239)

The lazy Triton kernel compilation feature (pytorch#175416) introduced
CUDA-specific assumptions that broke XPU cpp-wrapper codegen.

Fix the XPU incompatibilities:
- lazy_triton_compile.h: conditional include for XPU vs CUDA device
  headers; change runTritonKernelWithAutotune stream param from
  cudaStream_t to void* (both pointer types convert implicitly)
- cpp_wrapper_gpu.py: use device-appropriate pointer type for scratch
  allocations instead of hardcoded CUdeviceptr
- sycl_runtime_wrappers.h: query threads_per_warp from the kernel
  object via compile_sub_group_size instead of requiring it as an
  external parameter.
- Remove threads_per_warp plumbing from triton_heuristics.py and
  cpp_wrapper_gpu.py codegen since it is no longer needed

Pull Request resolved: pytorch#179239
Approved by: https://github.com/desertfire
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/inductor-rocm-mi300 Trigger "inductor" config CI on ROCm MI300/MI325 ciflow/trunk Trigger trunk jobs on your pull request Merged module: inductor release notes: inductor (aoti)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants