Closed
Conversation
…rch#159496) Summary: Report the kernel choices and benchmark data to better understand how kernels are selected and the performance gap between the best kernel (likely a CUDA kernel) and Triton kernels. **Example** Event: mm_template_autotuning Column: autotune_choices ```json { "num_choices": 52, "num_triton_choices": 19, "best_kernel": "cutlass_f6c25cf2", "best_kernel_desc": "cutlass3x_sm90_tensorop_gemm_f16_f16_f32_void_f16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8", "best_time": 0.6283040046691895, "best_triton_pos": 26, "best_triton_time": 0.6832960247993469, "best_triton_kernel": "triton_mm_17", "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0" } ``` Test Plan: ``` TORCHINDUCTOR_MAX_AUTOTUNE_REPORT_CHOICES_STATS=1 buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt ``` Rollback Plan: Reviewed By: masnesral, stashuk-olek Differential Revision: D79235037
Summary:
Exceptions during autotune kernel precompilation are now systematically captured and reported via the chromium_event_logger, enabling better debugging and analysis of autotune failures.
Currently, exceptions are dumped to the console in the following format::
```
[0/0] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.
[0/0] Runtime error during autotuning:
[0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help..
[0/0] Ignoring this choice.
```
With this change, the exception details will also be logged in the metadata of the `{name}_template_precompiling` event.
The format:
```
{
"exceptions": [
{
"choice_type": "triton",
"choice": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0",
"exception_message": "No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.",
"exception": "OutOfMemoryError",
"required_memory": "262144",
"hardware_limit": "232448"
}
]
}
```
Test Plan:
buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt
Rollback Plan:
Differential Revision: D79420953
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159687
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D79420953 |
Contributor
This PR needs a
|
wychi
added a commit
to wychi/pytorch
that referenced
this pull request
Aug 1, 2025
Summary:
Exceptions during autotune kernel precompilation are now systematically captured and reported via the chromium_event_logger, enabling better debugging and analysis of autotune failures.
Currently, exceptions are dumped to the console in the following format::
```
[0/0] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.
[0/0] Runtime error during autotuning:
[0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help..
[0/0] Ignoring this choice.
```
With this change, the exception details will also be logged in the metadata of the `{name}_template_precompiling` event.
The format:
```
{
"exceptions": [
{
"choice_type": "triton",
"choice": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0",
"exception_message": "No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.",
"exception": "OutOfMemoryError",
"required_memory": "262144",
"hardware_limit": "232448"
}
]
}
```
Test Plan:
buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt
Rollback Plan:
Differential Revision: D79420953
wychi
added a commit
to wychi/pytorch
that referenced
this pull request
Aug 5, 2025
Summary: Pull Request resolved: pytorch#159688 Pull Request resolved: pytorch#159687 Exceptions during autotune kernel precompilation are now systematically captured and reported via the chromium_event_logger, enabling better debugging and analysis of autotune failures. Currently, exceptions are dumped to the console in the following format:: ``` [0/0] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help. [0/0] Runtime error during autotuning: [0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.. [0/0] Ignoring this choice. ``` With this change, the exception details will also be logged in the metadata of the `{name}_template_precompiling` event. The format: ``` { "exceptions": [ { "choice_type": "triton", "choice": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0", "exception_message": "No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.", "exception": "OutOfMemoryError", "required_memory": "262144", "hardware_limit": "232448" } ] } ``` Test Plan: buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt Rollback Plan: Reviewed By: stashuk-olek Differential Revision: D79420953
pytorchmergebot
pushed a commit
that referenced
this pull request
Aug 8, 2025
Summary:
Exceptions during autotune kernel precompilation are now systematically captured and reported via the chromium_event_logger, enabling better debugging and analysis of autotune failures.
Currently, exceptions are dumped to the console in the following format::
```
[0/0] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.
[0/0] Runtime error during autotuning:
[0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help..
[0/0] Ignoring this choice.
```
The exception tracebacks:
```
# inner exception
traceback:
File "/torch/_inductor/runtime/triton_heuristics.py", line 603, in _make_launchers
launchers.append(result.make_launcher())
^^^^^^^^^^^^^^^^^^^^^^
File "/torch/_inductor/runtime/triton_heuristics.py", line 1503, in make_launcher
self.kernel.load_kernel(device)
File "/torch/_inductor/runtime/static_cuda_launcher.py", line 113, in load_kernel
(self.function, self.n_regs, self.n_spills) = _StaticCudaLauncher._load_kernel(
# wrapped exception
traceback:
File "/usr/local/fbcode/platform010/lib/python3.12/concurrent/futures/thread.py", line 59, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 2596, in precompile_with_captured_stdout
choice.precompile()
File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 1881, in precompile
self.bmreq.precompile()
File "<trimmed>#link-tree/torch/_inductor/autotune_process.py", line 660, in precompile
getattr(mod, self.kernel_name).precompile()
File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 440, in precompile
self._make_launchers()
File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 608, in _make_launchers
raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}")
```
With this change, the exception details will also be logged in the metadata of the `{name}_template_precompiling` event.
The format:
```
{
"exceptions": [
{
"choice_type": "triton",
"choice": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0",
"exception_message": "No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.",
"exception": "OutOfMemoryError",
"required_memory": "262144",
"hardware_limit": "232448"
}
]
}
```
Test Plan:
buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt
Rollback Plan:
Differential Revision: D79420953
Pull Request resolved: #159688
Approved by: https://github.com/stashuk-olek
hinriksnaer
pushed a commit
to hinriksnaer/pytorch
that referenced
this pull request
Aug 8, 2025
Summary:
Exceptions during autotune kernel precompilation are now systematically captured and reported via the chromium_event_logger, enabling better debugging and analysis of autotune failures.
Currently, exceptions are dumped to the console in the following format::
```
[0/0] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.
[0/0] Runtime error during autotuning:
[0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help..
[0/0] Ignoring this choice.
```
The exception tracebacks:
```
# inner exception
traceback:
File "/torch/_inductor/runtime/triton_heuristics.py", line 603, in _make_launchers
launchers.append(result.make_launcher())
^^^^^^^^^^^^^^^^^^^^^^
File "/torch/_inductor/runtime/triton_heuristics.py", line 1503, in make_launcher
self.kernel.load_kernel(device)
File "/torch/_inductor/runtime/static_cuda_launcher.py", line 113, in load_kernel
(self.function, self.n_regs, self.n_spills) = _StaticCudaLauncher._load_kernel(
# wrapped exception
traceback:
File "/usr/local/fbcode/platform010/lib/python3.12/concurrent/futures/thread.py", line 59, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 2596, in precompile_with_captured_stdout
choice.precompile()
File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 1881, in precompile
self.bmreq.precompile()
File "<trimmed>#link-tree/torch/_inductor/autotune_process.py", line 660, in precompile
getattr(mod, self.kernel_name).precompile()
File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 440, in precompile
self._make_launchers()
File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 608, in _make_launchers
raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}")
```
With this change, the exception details will also be logged in the metadata of the `{name}_template_precompiling` event.
The format:
```
{
"exceptions": [
{
"choice_type": "triton",
"choice": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0",
"exception_message": "No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.",
"exception": "OutOfMemoryError",
"required_memory": "262144",
"hardware_limit": "232448"
}
]
}
```
Test Plan:
buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt
Rollback Plan:
Differential Revision: D79420953
Pull Request resolved: pytorch#159688
Approved by: https://github.com/stashuk-olek
markc-614
pushed a commit
to markc-614/pytorch
that referenced
this pull request
Sep 17, 2025
Summary:
Exceptions during autotune kernel precompilation are now systematically captured and reported via the chromium_event_logger, enabling better debugging and analysis of autotune failures.
Currently, exceptions are dumped to the console in the following format::
```
[0/0] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.
[0/0] Runtime error during autotuning:
[0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help..
[0/0] Ignoring this choice.
```
The exception tracebacks:
```
# inner exception
traceback:
File "/torch/_inductor/runtime/triton_heuristics.py", line 603, in _make_launchers
launchers.append(result.make_launcher())
^^^^^^^^^^^^^^^^^^^^^^
File "/torch/_inductor/runtime/triton_heuristics.py", line 1503, in make_launcher
self.kernel.load_kernel(device)
File "/torch/_inductor/runtime/static_cuda_launcher.py", line 113, in load_kernel
(self.function, self.n_regs, self.n_spills) = _StaticCudaLauncher._load_kernel(
# wrapped exception
traceback:
File "/usr/local/fbcode/platform010/lib/python3.12/concurrent/futures/thread.py", line 59, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 2596, in precompile_with_captured_stdout
choice.precompile()
File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 1881, in precompile
self.bmreq.precompile()
File "<trimmed>#link-tree/torch/_inductor/autotune_process.py", line 660, in precompile
getattr(mod, self.kernel_name).precompile()
File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 440, in precompile
self._make_launchers()
File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 608, in _make_launchers
raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}")
```
With this change, the exception details will also be logged in the metadata of the `{name}_template_precompiling` event.
The format:
```
{
"exceptions": [
{
"choice_type": "triton",
"choice": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0",
"exception_message": "No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.",
"exception": "OutOfMemoryError",
"required_memory": "262144",
"hardware_limit": "232448"
}
]
}
```
Test Plan:
buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt
Rollback Plan:
Differential Revision: D79420953
Pull Request resolved: pytorch#159688
Approved by: https://github.com/stashuk-olek
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Exceptions during autotune kernel precompilation are now systematically captured and reported via the chromium_event_logger, enabling better debugging and analysis of autotune failures.
Currently, exceptions are dumped to the console in the following format::
With this change, the exception details will also be logged in the metadata of the
{name}_template_precompilingevent.The format:
Test Plan:
buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt
Rollback Plan:
Differential Revision: D79420953
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben