[inductor] Add lazy Triton kernel compilation for cpp-wrapper#175416
[inductor] Add lazy Triton kernel compilation for cpp-wrapper#175416desertfire wants to merge 11 commits intogh/desertfire/653/basefrom
Conversation
Summary: This adds support for lazy Triton kernel compilation when using cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously, the cpp-wrapper mode relies on autotune_at_compile_time to compile Triton kernels into cubin files, which has two problems, 1. Kernels are not tuned with real inputs which can lead to sub-optimal performance or even IMA. 2. Extra GPU memory consumption which can trigger OOM comparing to the default Inductor. Key changes: 1. CppWrapperGpu.generate_lazy() method that generates deferred kernel compilation code. In the generated cpp code, a Triton kernel will be compiled with AsyncCompile when the code is loaded, and the first time the kernel is called, it will perform autotuning and store the generated cubin file and metadata. Subsequent kernel calls will just need to compute grids and then cudaLaunch. 2. triton_lazy_compile.py takes care of the interaction with cpp code at the runtime, triggering Triton kernel compilation and autotuning. Limitation: This version does not support TMA yet. It wil come in future PRs. Authored with Claude. [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/175416
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Unrelated FailuresAs of commit f86da80 with merge base fabd1c4 ( NEW FAILURE - The following job has failed:
BROKEN TRUNK - The following job failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Summary: This adds support for lazy Triton kernel compilation when using cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously, the cpp-wrapper mode relies on autotune_at_compile_time to compile Triton kernels into cubin files, which has two problems, 1. Kernels are not tuned with real inputs which can lead to sub-optimal performance or even IMA. 2. Extra GPU memory consumption which can trigger OOM comparing to the default Inductor. Key changes: 1. CppWrapperGpu.generate_lazy() method that generates deferred kernel compilation code. In the generated cpp code, a Triton kernel will be compiled with AsyncCompile when the code is loaded, and the first time the kernel is called, it will perform autotuning and store the generated cubin file and metadata. Subsequent kernel calls will just need to compute grids and then cudaLaunch. 2. triton_lazy_compile.py takes care of the interaction with cpp code at the runtime, triggering Triton kernel compilation and autotuning. Limitation: This version does not support TMA yet. It wil come in future PRs. Authored with Claude. ghstack-source-id: a4f9f20 Pull Request resolved: #175416
…per" Summary: This adds support for lazy Triton kernel compilation when using cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously, the cpp-wrapper mode relies on autotune_at_compile_time to compile Triton kernels into cubin files, which has two problems, 1. Kernels are not tuned with real inputs which can lead to sub-optimal performance or even IMA. 2. Extra GPU memory consumption which can trigger OOM comparing to the default Inductor. Key changes: 1. CppWrapperGpu.generate_lazy() method that generates deferred kernel compilation code. In the generated cpp code, a Triton kernel will be compiled with AsyncCompile when the code is loaded, and the first time the kernel is called, it will perform autotuning and store the generated cubin file and metadata. Subsequent kernel calls will just need to compute grids and then cudaLaunch. 2. triton_lazy_compile.py takes care of the interaction with cpp code at the runtime, triggering Triton kernel compilation and autotuning. Limitation: This version does not support TMA yet. It wil come in future PRs. Authored with Claude. [ghstack-poisoned]
Summary: This adds support for lazy Triton kernel compilation when using cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously, the cpp-wrapper mode relies on autotune_at_compile_time to compile Triton kernels into cubin files, which has two problems, 1. Kernels are not tuned with real inputs which can lead to sub-optimal performance or even IMA. 2. Extra GPU memory consumption which can trigger OOM comparing to the default Inductor. Key changes: 1. CppWrapperGpu.generate_lazy() method that generates deferred kernel compilation code. In the generated cpp code, a Triton kernel will be compiled with AsyncCompile when the code is loaded, and the first time the kernel is called, it will perform autotuning and store the generated cubin file and metadata. Subsequent kernel calls will just need to compute grids and then cudaLaunch. 2. triton_lazy_compile.py takes care of the interaction with cpp code at the runtime, triggering Triton kernel compilation and autotuning. Limitation: This version does not support TMA yet. It wil come in future PRs. Authored with Claude. ghstack-source-id: a4f9f20 Pull Request resolved: #175416
| ) | ||
|
|
||
| @patch.object(config, "profile_bandwidth", True) | ||
| @skip_if_cpp_wrapper("cpp output code is different") |
There was a problem hiding this comment.
I tried to limit the changes to these test files, to make the reviewing easier. More clean-up and fixes will come in follow-up PRs.
…per" Summary: This adds support for lazy Triton kernel compilation when using cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously, the cpp-wrapper mode relies on autotune_at_compile_time to compile Triton kernels into cubin files, which has two problems, 1. Kernels are not tuned with real inputs which can lead to sub-optimal performance or even IMA. 2. Extra GPU memory consumption which can trigger OOM comparing to the default Inductor. Key changes: 1. CppWrapperGpu.generate_lazy() method that generates deferred kernel compilation code. In the generated cpp code, a Triton kernel will be compiled with AsyncCompile when the code is loaded, and the first time the kernel is called, it will perform autotuning and store the generated cubin file and metadata. Subsequent kernel calls will just need to compute grids and then cudaLaunch. 2. triton_lazy_compile.py takes care of the interaction with cpp code at the runtime, triggering Triton kernel compilation and autotuning. Limitation: This version does not support TMA yet. It wil come in future PRs. Authored with Claude. [ghstack-poisoned]
Summary: This adds support for lazy Triton kernel compilation when using cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously, the cpp-wrapper mode relies on autotune_at_compile_time to compile Triton kernels into cubin files, which has two problems, 1. Kernels are not tuned with real inputs which can lead to sub-optimal performance or even IMA. 2. Extra GPU memory consumption which can trigger OOM comparing to the default Inductor. Key changes: 1. CppWrapperGpu.generate_lazy() method that generates deferred kernel compilation code. In the generated cpp code, a Triton kernel will be compiled with AsyncCompile when the code is loaded, and the first time the kernel is called, it will perform autotuning and store the generated cubin file and metadata. Subsequent kernel calls will just need to compute grids and then cudaLaunch. 2. triton_lazy_compile.py takes care of the interaction with cpp code at the runtime, triggering Triton kernel compilation and autotuning. Limitation: This version does not support TMA yet. It wil come in future PRs. Authored with Claude. ghstack-source-id: 6df8c86 Pull Request resolved: #175416
| stream: Any, | ||
| args: list[Any], | ||
| ) -> tuple[ | ||
| str, # cubin_path |
There was a problem hiding this comment.
Can we make this an individual class/dict instead of just a tuple? Will be hard to track
| raise RuntimeError(f"cubin path not found in cached params for {kernel_name}") | ||
|
|
||
| mangled_name = cached_params.get("mangled_name", "") | ||
| num_warps = cached_params.get("num_warps", 4) |
There was a problem hiding this comment.
Where are these defaults coming from?
There was a problem hiding this comment.
I think this one and xblock = config.get("XBLOCK", 128) are somewhat ad-hoc. We probably should just error out if these values are missing. Other 1 default values are fine.
…per" Summary: This adds support for lazy Triton kernel compilation when using cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously, the cpp-wrapper mode relies on autotune_at_compile_time to compile Triton kernels into cubin files, which has two problems, 1. Kernels are not tuned with real inputs which can lead to sub-optimal performance or even IMA. 2. Extra GPU memory consumption which can trigger OOM comparing to the default Inductor. Key changes: 1. CppWrapperGpu.generate_lazy() method that generates deferred kernel compilation code. In the generated cpp code, a Triton kernel will be compiled with AsyncCompile when the code is loaded, and the first time the kernel is called, it will perform autotuning and store the generated cubin file and metadata. Subsequent kernel calls will just need to compute grids and then cudaLaunch. 2. triton_lazy_compile.py takes care of the interaction with cpp code at the runtime, triggering Triton kernel compilation and autotuning. Limitation: This version does not support TMA yet. It wil come in future PRs. https://gist.github.com/desertfire/8ee1ca889f411d3f2fca08d7658ea88e is an example of the generated code. Authored with Claude. [ghstack-poisoned]
| raise RuntimeError(f"cubin path not found in cached params for {kernel_name}") | ||
|
|
||
| mangled_name = cached_params.get("mangled_name", "") | ||
| num_warps = cached_params.get("num_warps", 4) |
There was a problem hiding this comment.
I think this one and xblock = config.get("XBLOCK", 128) are somewhat ad-hoc. We probably should just error out if these values are missing. Other 1 default values are fine.
|
How does the autotuning happen? Once inputs are encountered? And compilation happens on the C++ program startup?
|
…per" Summary: This adds support for lazy Triton kernel compilation when using cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously, the cpp-wrapper mode relies on autotune_at_compile_time to compile Triton kernels into cubin files, which has two problems, 1. Kernels are not tuned with real inputs which can lead to sub-optimal performance or even IMA. 2. Extra GPU memory consumption which can trigger OOM comparing to the default Inductor. Key changes: 1. CppWrapperGpu.generate_lazy() method that generates deferred kernel compilation code. In the generated cpp code, a Triton kernel will be compiled with AsyncCompile when the code is loaded, and the first time the kernel is called, it will perform autotuning and store the generated cubin file and metadata. Subsequent kernel calls will just need to compute grids and then cudaLaunch. 2. triton_lazy_compile.py takes care of the interaction with cpp code at the runtime, triggering Triton kernel compilation and autotuning. Limitation: This version does not support TMA yet. It wil come in future PRs. https://gist.github.com/desertfire/8ee1ca889f411d3f2fca08d7658ea88e is an example of the generated code. Authored with Claude. [ghstack-poisoned]
…per" Summary: This adds support for lazy Triton kernel compilation when using cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously, the cpp-wrapper mode relies on autotune_at_compile_time to compile Triton kernels into cubin files, which has two problems, 1. Kernels are not tuned with real inputs which can lead to sub-optimal performance or even IMA. 2. Extra GPU memory consumption which can trigger OOM comparing to the default Inductor. Key changes: 1. CppWrapperGpu.generate_lazy() method that generates deferred kernel compilation code. In the generated cpp code, a Triton kernel will be compiled with AsyncCompile when the code is loaded, and the first time the kernel is called, it will perform autotuning and store the generated cubin file and metadata. Subsequent kernel calls will just need to compute grids and then cudaLaunch. 2. triton_lazy_compile.py takes care of the interaction with cpp code at the runtime, triggering Triton kernel compilation and autotuning. Limitation: This version does not support TMA yet. It wil come in future PRs. https://gist.github.com/desertfire/8ee1ca889f411d3f2fca08d7658ea88e is an example of the generated code. Authored with Claude. [ghstack-poisoned]
…per" Summary: This adds support for lazy Triton kernel compilation when using cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously, the cpp-wrapper mode relies on autotune_at_compile_time to compile Triton kernels into cubin files, which has two problems, 1. Kernels are not tuned with real inputs which can lead to sub-optimal performance or even IMA. 2. Extra GPU memory consumption which can trigger OOM comparing to the default Inductor. Key changes: 1. CppWrapperGpu.generate_lazy() method that generates deferred kernel compilation code. In the generated cpp code, a Triton kernel will be compiled with AsyncCompile when the code is loaded, and the first time the kernel is called, it will perform autotuning and store the generated cubin file and metadata. Subsequent kernel calls will just need to compute grids and then cudaLaunch. 2. triton_lazy_compile.py takes care of the interaction with cpp code at the runtime, triggering Triton kernel compilation and autotuning. Limitation: This version does not support TMA yet. It wil come in future PRs. https://gist.github.com/desertfire/8ee1ca889f411d3f2fca08d7658ea88e is an example of the generated code. Authored with Claude. [ghstack-poisoned]
Yes, it happens when the first time we run a kernel, through runTritonKernelWithAutotune(C++) -> run_triton_kernel_with_autotune(python). |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: inductor / inductor-test / test (inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 3 checks: inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 1, 2, linux.2xlarge.amx, unstable), inductor / inductor-test / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu), inductor / inductor-test / test (inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…h#175416) Summary: This adds support for lazy Triton kernel compilation when using cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously, the cpp-wrapper mode relies on autotune_at_compile_time to compile Triton kernels into cubin files, which has two problems, 1. Kernels are not tuned with real inputs which can lead to sub-optimal performance or even IMA. 2. Extra GPU memory consumption which can trigger OOM comparing to the default Inductor. Key changes: 1. CppWrapperGpu.generate_lazy() method that generates deferred kernel compilation code. In the generated cpp code, a Triton kernel will be compiled with AsyncCompile when the code is loaded, and the first time the kernel is called, it will perform autotuning and store the generated cubin file and metadata. Subsequent kernel calls will just need to compute grids and then cudaLaunch. 2. triton_lazy_compile.py takes care of the interaction with cpp code at the runtime, triggering Triton kernel compilation and autotuning. Limitation: This version does not support TMA yet. It wil come in future PRs. https://gist.github.com/desertfire/8ee1ca889f411d3f2fca08d7658ea88e is an example of the generated code. Authored with Claude. Differential Revision: [D95154174](https://our.internmc.facebook.com/intern/diff/D95154174) Pull Request resolved: pytorch#175416 Approved by: https://github.com/mlazos, https://github.com/PaulZhang12
Summary: #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time. Forward fix by updating the code checking target for cpp-wrapper. [ghstack-poisoned]
Summary: #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time. Forward fix by updating the code checking target for cpp-wrapper. ghstack-source-id: 5244ba6 Pull Request resolved: #176745
Summary: 1. #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time. Forward fix by updating the code checking target for cpp-wrapper. 2. #176353 also had land race. Skip now and the fix is coming later. [ghstack-poisoned]
Summary: 1. #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time. Forward fix by updating the code checking target for cpp-wrapper. 2. #176353 also had land race. Skip now and the fix is coming later. ghstack-source-id: c856a94 Pull Request resolved: #176745
Summary: 1. #173662 added more tests to test/inductor/test_triton_kernels.py, and #175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and #173662 didn't have the failing CI signal at the landing time. Forward fix by updating the code checking target for cpp-wrapper. 2. #176353 also had land race. Skip now and the fix is coming later. Pull Request resolved: #176745 Approved by: https://github.com/AmesingFlank, https://github.com/zou3519
Summary: This adds support for lazy Triton kernel compilation when using cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously, the cpp-wrapper mode relies on autotune_at_compile_time to compile Triton kernels into cubin files, which has two problems, 1. Kernels are not tuned with real inputs which can lead to sub-optimal performance or even IMA. 2. Extra GPU memory consumption which can trigger OOM comparing to the default Inductor. Key changes: 1. CppWrapperGpu.generate_lazy() method that generates deferred kernel compilation code. In the generated cpp code, a Triton kernel will be compiled with AsyncCompile when the code is loaded, and the first time the kernel is called, it will perform autotuning and store the generated cubin file and metadata. Subsequent kernel calls will just need to compute grids and then cudaLaunch. 2. triton_lazy_compile.py takes care of the interaction with cpp code at the runtime, triggering Triton kernel compilation and autotuning. Limitation: This version does not support TMA yet. It wil come in future PRs. Authored with Claude. ghstack-source-id: ec0ced1 Pull Request resolved: pytorch/pytorch#175416
…h#175416) Summary: This adds support for lazy Triton kernel compilation when using cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously, the cpp-wrapper mode relies on autotune_at_compile_time to compile Triton kernels into cubin files, which has two problems, 1. Kernels are not tuned with real inputs which can lead to sub-optimal performance or even IMA. 2. Extra GPU memory consumption which can trigger OOM comparing to the default Inductor. Key changes: 1. CppWrapperGpu.generate_lazy() method that generates deferred kernel compilation code. In the generated cpp code, a Triton kernel will be compiled with AsyncCompile when the code is loaded, and the first time the kernel is called, it will perform autotuning and store the generated cubin file and metadata. Subsequent kernel calls will just need to compute grids and then cudaLaunch. 2. triton_lazy_compile.py takes care of the interaction with cpp code at the runtime, triggering Triton kernel compilation and autotuning. Limitation: This version does not support TMA yet. It wil come in future PRs. https://gist.github.com/desertfire/8ee1ca889f411d3f2fca08d7658ea88e is an example of the generated code. Authored with Claude. Differential Revision: [D95154174](https://our.internmc.facebook.com/intern/diff/D95154174) Pull Request resolved: pytorch#175416 Approved by: https://github.com/mlazos, https://github.com/PaulZhang12
Summary: 1. pytorch#173662 added more tests to test/inductor/test_triton_kernels.py, and pytorch#175416 enable cpp-wrapper test on test/inductor/test_triton_kernels.py. So there was a land race and pytorch#173662 didn't have the failing CI signal at the landing time. Forward fix by updating the code checking target for cpp-wrapper. 2. pytorch#176353 also had land race. Skip now and the fix is coming later. Pull Request resolved: pytorch#176745 Approved by: https://github.com/AmesingFlank, https://github.com/zou3519
… XPU The lazy Triton kernel compilation feature (#175416) introduced CUDA-specific assumptions that broke XPU cpp-wrapper codegen. Fix the XPU incompatibilities: - lazy_triton_compile.h: conditional include for XPU vs CUDA device headers; change runTritonKernelWithAutotune stream param from cudaStream_t to void* (both pointer types convert implicitly) - cpp_wrapper_gpu.py: use device-appropriate pointer type for scratch allocations instead of hardcoded CUdeviceptr - sycl_runtime_wrappers.h: query threads_per_warp from the kernel object via compile_sub_group_size instead of requiring it as an external parameter, matching static_launcher/xpu.cpp - Remove threads_per_warp plumbing from triton_heuristics.py and cpp_wrapper_gpu.py codegen since it is no longer needed ghstack-source-id: 58de386 Pull-Request: #179239
… XPU The lazy Triton kernel compilation feature (#175416) introduced CUDA-specific assumptions that broke XPU cpp-wrapper codegen. Fix the XPU incompatibilities: - lazy_triton_compile.h: conditional include for XPU vs CUDA device headers; change runTritonKernelWithAutotune stream param from cudaStream_t to void* (both pointer types convert implicitly) - cpp_wrapper_gpu.py: use device-appropriate pointer type for scratch allocations instead of hardcoded CUdeviceptr - sycl_runtime_wrappers.h: query threads_per_warp from the kernel object via compile_sub_group_size instead of requiring it as an external parameter, matching static_launcher/xpu.cpp - Remove threads_per_warp plumbing from triton_heuristics.py and cpp_wrapper_gpu.py codegen since it is no longer needed ghstack-source-id: 9bcb595 Pull-Request: #179239
… XPU (#179239) The lazy Triton kernel compilation feature (#175416) introduced CUDA-specific assumptions that broke XPU cpp-wrapper codegen. Fix the XPU incompatibilities: - lazy_triton_compile.h: conditional include for XPU vs CUDA device headers; change runTritonKernelWithAutotune stream param from cudaStream_t to void* (both pointer types convert implicitly) - cpp_wrapper_gpu.py: use device-appropriate pointer type for scratch allocations instead of hardcoded CUdeviceptr - sycl_runtime_wrappers.h: query threads_per_warp from the kernel object via compile_sub_group_size instead of requiring it as an external parameter. - Remove threads_per_warp plumbing from triton_heuristics.py and cpp_wrapper_gpu.py codegen since it is no longer needed Pull Request resolved: #179239 Approved by: https://github.com/desertfire
… XPU (#179239) The lazy Triton kernel compilation feature (#175416) introduced CUDA-specific assumptions that broke XPU cpp-wrapper codegen. Fix the XPU incompatibilities: - lazy_triton_compile.h: conditional include for XPU vs CUDA device headers; change runTritonKernelWithAutotune stream param from cudaStream_t to void* (both pointer types convert implicitly) - cpp_wrapper_gpu.py: use device-appropriate pointer type for scratch allocations instead of hardcoded CUdeviceptr - sycl_runtime_wrappers.h: query threads_per_warp from the kernel object via compile_sub_group_size instead of requiring it as an external parameter. - Remove threads_per_warp plumbing from triton_heuristics.py and cpp_wrapper_gpu.py codegen since it is no longer needed Pull Request resolved: #179239 Approved by: https://github.com/desertfire
… XPU (pytorch#179239) The lazy Triton kernel compilation feature (pytorch#175416) introduced CUDA-specific assumptions that broke XPU cpp-wrapper codegen. Fix the XPU incompatibilities: - lazy_triton_compile.h: conditional include for XPU vs CUDA device headers; change runTritonKernelWithAutotune stream param from cudaStream_t to void* (both pointer types convert implicitly) - cpp_wrapper_gpu.py: use device-appropriate pointer type for scratch allocations instead of hardcoded CUdeviceptr - sycl_runtime_wrappers.h: query threads_per_warp from the kernel object via compile_sub_group_size instead of requiring it as an external parameter. - Remove threads_per_warp plumbing from triton_heuristics.py and cpp_wrapper_gpu.py codegen since it is no longer needed Pull Request resolved: pytorch#179239 Approved by: https://github.com/desertfire
… XPU (pytorch#179239) The lazy Triton kernel compilation feature (pytorch#175416) introduced CUDA-specific assumptions that broke XPU cpp-wrapper codegen. Fix the XPU incompatibilities: - lazy_triton_compile.h: conditional include for XPU vs CUDA device headers; change runTritonKernelWithAutotune stream param from cudaStream_t to void* (both pointer types convert implicitly) - cpp_wrapper_gpu.py: use device-appropriate pointer type for scratch allocations instead of hardcoded CUdeviceptr - sycl_runtime_wrappers.h: query threads_per_warp from the kernel object via compile_sub_group_size instead of requiring it as an external parameter. - Remove threads_per_warp plumbing from triton_heuristics.py and cpp_wrapper_gpu.py codegen since it is no longer needed Pull Request resolved: pytorch#179239 Approved by: https://github.com/desertfire
Stack from ghstack (oldest at bottom):
Summary: This adds support for lazy Triton kernel compilation when using
cpp-wrapper mode (TORCHINDUCTOR_CPP_WRAPPER=1). Previously,
the cpp-wrapper mode relies on autotune_at_compile_time to
compile Triton kernels into cubin files, which has two problems,
Kernels are not tuned with real inputs which can lead to
sub-optimal performance or even IMA.
Extra GPU memory consumption which can trigger OOM comparing
to the default Inductor.
Key changes:
CppWrapperGpu.generate_lazy() method that generates deferred
kernel compilation code. In the generated cpp code, a Triton
kernel will be compiled with AsyncCompile when the code is
loaded, and the first time the kernel is called, it will
perform autotuning and store the generated cubin file and
metadata. Subsequent kernel calls will just need to compute
grids and then cudaLaunch.
triton_lazy_compile.py takes care of the interaction with
cpp code at the runtime, triggering Triton kernel compilation
and autotuning.
Limitation:
This version does not support TMA yet. It wil come in future PRs.
https://gist.github.com/desertfire/8ee1ca889f411d3f2fca08d7658ea88e is an example of the generated code.
Authored with Claude.
Differential Revision: D95154174
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo