Skip to content

Log max_autotune exceptions#159687

Closed
wychi wants to merge 2 commits intopytorch:mainfrom
wychi:export-D79420953
Closed

Log max_autotune exceptions#159687
wychi wants to merge 2 commits intopytorch:mainfrom
wychi:export-D79420953

Conversation

@wychi
Copy link
Contributor

@wychi wychi commented Aug 1, 2025

Summary:
Exceptions during autotune kernel precompilation are now systematically captured and reported via the chromium_event_logger, enabling better debugging and analysis of autotune failures.

Currently, exceptions are dumped to the console in the following format::

[0/0] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.
[0/0] Runtime error during autotuning: 
[0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.. 
[0/0] Ignoring this choice.

With this change, the exception details will also be logged in the metadata of the {name}_template_precompiling event.

The format:

{
  "exceptions": [
    {
      "choice_type": "triton",
      "choice": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0",
      "exception_message": "No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.",
      "exception": "OutOfMemoryError",
      "required_memory": "262144",
      "hardware_limit": "232448"
    }
  ]
}

Test Plan:
buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt

Rollback Plan:

Differential Revision: D79420953

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

wychi added 2 commits August 1, 2025 16:00
…rch#159496)

Summary:

Report the kernel choices and benchmark data to better understand how kernels are selected and the performance gap between the best kernel (likely a CUDA kernel) and Triton kernels.


**Example**

Event: mm_template_autotuning
Column: autotune_choices

```json
{
  "num_choices": 52,
  "num_triton_choices": 19,
  "best_kernel": "cutlass_f6c25cf2",
  "best_kernel_desc": "cutlass3x_sm90_tensorop_gemm_f16_f16_f32_void_f16_128x256x64_2x1x1_0_tnn_align8_stream_k_warpspecialized_cooperative_epi_tma swizzle=8",
  "best_time": 0.6283040046691895,
  "best_triton_pos": 26,
  "best_triton_time": 0.6832960247993469,
  "best_triton_kernel": "triton_mm_17",
  "best_triton_kernel_desc": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=3, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0"
}
```

Test Plan:
```
TORCHINDUCTOR_MAX_AUTOTUNE_REPORT_CHOICES_STATS=1 buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt
```

Rollback Plan:

Reviewed By: masnesral, stashuk-olek

Differential Revision: D79235037
Summary:
Exceptions during autotune kernel precompilation are now systematically captured and reported via the chromium_event_logger, enabling better debugging and analysis of autotune failures.


Currently, exceptions are dumped to the console in the following format::
```
[0/0] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.
[0/0] Runtime error during autotuning: 
[0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.. 
[0/0] Ignoring this choice.
```

With this change, the exception details will also be logged in the metadata of the `{name}_template_precompiling` event.


The format:
```
{
  "exceptions": [
    {
      "choice_type": "triton",
      "choice": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0",
      "exception_message": "No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.",
      "exception": "OutOfMemoryError",
      "required_memory": "262144",
      "hardware_limit": "232448"
    }
  ]
}
```

Test Plan:
buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt

Rollback Plan:

Differential Revision: D79420953
@pytorch-bot
Copy link

pytorch-bot bot commented Aug 1, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/159687

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D79420953

@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2025

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@wychi wychi closed this Aug 1, 2025
wychi added a commit to wychi/pytorch that referenced this pull request Aug 1, 2025
Summary:

Exceptions during autotune kernel precompilation are now systematically captured and reported via the chromium_event_logger, enabling better debugging and analysis of autotune failures.


Currently, exceptions are dumped to the console in the following format::
```
[0/0] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.
[0/0] Runtime error during autotuning: 
[0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.. 
[0/0] Ignoring this choice.
```

With this change, the exception details will also be logged in the metadata of the `{name}_template_precompiling` event.


The format:
```
{
  "exceptions": [
    {
      "choice_type": "triton",
      "choice": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0",
      "exception_message": "No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.",
      "exception": "OutOfMemoryError",
      "required_memory": "262144",
      "hardware_limit": "232448"
    }
  ]
}
```

Test Plan:
buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt

Rollback Plan:

Differential Revision: D79420953
wychi added a commit to wychi/pytorch that referenced this pull request Aug 5, 2025
Summary:
Pull Request resolved: pytorch#159688

Pull Request resolved: pytorch#159687

Exceptions during autotune kernel precompilation are now systematically captured and reported via the chromium_event_logger, enabling better debugging and analysis of autotune failures.

Currently, exceptions are dumped to the console in the following format::
```
[0/0] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.
[0/0] Runtime error during autotuning:
[0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help..
[0/0] Ignoring this choice.
```

With this change, the exception details will also be logged in the metadata of the `{name}_template_precompiling` event.

The format:
```
{
  "exceptions": [
    {
      "choice_type": "triton",
      "choice": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0",
      "exception_message": "No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.",
      "exception": "OutOfMemoryError",
      "required_memory": "262144",
      "hardware_limit": "232448"
    }
  ]
}
```

Test Plan:
buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt

Rollback Plan:

Reviewed By: stashuk-olek

Differential Revision: D79420953
pytorchmergebot pushed a commit that referenced this pull request Aug 8, 2025
Summary:

Exceptions during autotune kernel precompilation are now systematically captured and reported via the chromium_event_logger, enabling better debugging and analysis of autotune failures.

Currently, exceptions are dumped to the console in the following format::
```
[0/0] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.
[0/0] Runtime error during autotuning:
[0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help..
[0/0] Ignoring this choice.
```

The exception tracebacks:
```
# inner exception
traceback:
  File "/torch/_inductor/runtime/triton_heuristics.py", line 603, in _make_launchers
    launchers.append(result.make_launcher())
                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/torch/_inductor/runtime/triton_heuristics.py", line 1503, in make_launcher
    self.kernel.load_kernel(device)
  File "/torch/_inductor/runtime/static_cuda_launcher.py", line 113, in load_kernel
    (self.function, self.n_regs, self.n_spills) = _StaticCudaLauncher._load_kernel(

# wrapped exception
traceback:
  File "/usr/local/fbcode/platform010/lib/python3.12/concurrent/futures/thread.py", line 59, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 2596, in precompile_with_captured_stdout
    choice.precompile()
  File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 1881, in precompile
    self.bmreq.precompile()
  File "<trimmed>#link-tree/torch/_inductor/autotune_process.py", line 660, in precompile
    getattr(mod, self.kernel_name).precompile()
  File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 440, in precompile
    self._make_launchers()
  File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 608, in _make_launchers
    raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}")
```

With this change, the exception details will also be logged in the metadata of the `{name}_template_precompiling` event.

The format:
```
{
  "exceptions": [
    {
      "choice_type": "triton",
      "choice": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0",
      "exception_message": "No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.",
      "exception": "OutOfMemoryError",
      "required_memory": "262144",
      "hardware_limit": "232448"
    }
  ]
}
```

Test Plan:
buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt

Rollback Plan:

Differential Revision: D79420953

Pull Request resolved: #159688
Approved by: https://github.com/stashuk-olek
hinriksnaer pushed a commit to hinriksnaer/pytorch that referenced this pull request Aug 8, 2025
Summary:

Exceptions during autotune kernel precompilation are now systematically captured and reported via the chromium_event_logger, enabling better debugging and analysis of autotune failures.

Currently, exceptions are dumped to the console in the following format::
```
[0/0] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.
[0/0] Runtime error during autotuning:
[0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help..
[0/0] Ignoring this choice.
```

The exception tracebacks:
```
# inner exception
traceback:
  File "/torch/_inductor/runtime/triton_heuristics.py", line 603, in _make_launchers
    launchers.append(result.make_launcher())
                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/torch/_inductor/runtime/triton_heuristics.py", line 1503, in make_launcher
    self.kernel.load_kernel(device)
  File "/torch/_inductor/runtime/static_cuda_launcher.py", line 113, in load_kernel
    (self.function, self.n_regs, self.n_spills) = _StaticCudaLauncher._load_kernel(

# wrapped exception
traceback:
  File "/usr/local/fbcode/platform010/lib/python3.12/concurrent/futures/thread.py", line 59, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 2596, in precompile_with_captured_stdout
    choice.precompile()
  File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 1881, in precompile
    self.bmreq.precompile()
  File "<trimmed>#link-tree/torch/_inductor/autotune_process.py", line 660, in precompile
    getattr(mod, self.kernel_name).precompile()
  File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 440, in precompile
    self._make_launchers()
  File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 608, in _make_launchers
    raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}")
```

With this change, the exception details will also be logged in the metadata of the `{name}_template_precompiling` event.

The format:
```
{
  "exceptions": [
    {
      "choice_type": "triton",
      "choice": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0",
      "exception_message": "No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.",
      "exception": "OutOfMemoryError",
      "required_memory": "262144",
      "hardware_limit": "232448"
    }
  ]
}
```

Test Plan:
buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt

Rollback Plan:

Differential Revision: D79420953

Pull Request resolved: pytorch#159688
Approved by: https://github.com/stashuk-olek
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
Summary:

Exceptions during autotune kernel precompilation are now systematically captured and reported via the chromium_event_logger, enabling better debugging and analysis of autotune failures.

Currently, exceptions are dumped to the console in the following format::
```
[0/0] RuntimeError: No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.
[0/0] Runtime error during autotuning:
[0/0] No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help..
[0/0] Ignoring this choice.
```

The exception tracebacks:
```
# inner exception
traceback:
  File "/torch/_inductor/runtime/triton_heuristics.py", line 603, in _make_launchers
    launchers.append(result.make_launcher())
                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/torch/_inductor/runtime/triton_heuristics.py", line 1503, in make_launcher
    self.kernel.load_kernel(device)
  File "/torch/_inductor/runtime/static_cuda_launcher.py", line 113, in load_kernel
    (self.function, self.n_regs, self.n_spills) = _StaticCudaLauncher._load_kernel(

# wrapped exception
traceback:
  File "/usr/local/fbcode/platform010/lib/python3.12/concurrent/futures/thread.py", line 59, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 2596, in precompile_with_captured_stdout
    choice.precompile()
  File "<trimmed>#link-tree/torch/_inductor/select_algorithm.py", line 1881, in precompile
    self.bmreq.precompile()
  File "<trimmed>#link-tree/torch/_inductor/autotune_process.py", line 660, in precompile
    getattr(mod, self.kernel_name).precompile()
  File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 440, in precompile
    self._make_launchers()
  File "<trimmed>#link-tree/torch/_inductor/runtime/triton_heuristics.py", line 608, in _make_launchers
    raise RuntimeError(f"No valid triton configs. {type(exc).__name__}: {exc}")
```

With this change, the exception details will also be logged in the metadata of the `{name}_template_precompiling` event.

The format:
```
{
  "exceptions": [
    {
      "choice_type": "triton",
      "choice": "ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=128, BLOCK_M=64, BLOCK_N=64, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0",
      "exception_message": "No valid triton configs. OutOfMemoryError: out of resource: triton_mm Required: 262144 Hardware limit:232448 Reducing block sizes or `num_stages` may help.",
      "exception": "OutOfMemoryError",
      "required_memory": "262144",
      "hardware_limit": "232448"
    }
  ]
}
```

Test Plan:
buck2 run //scripts/wychi:test_autotune_mm 2>&1 > /tmp/mylog.txt

Rollback Plan:

Differential Revision: D79420953

Pull Request resolved: pytorch#159688
Approved by: https://github.com/stashuk-olek
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants