[NO CP][release/2.7][ROCm][inductor] Inductor heuristic upstream backports by naromero77amd · Pull Request #2807 · ROCm/pytorch

naromero77amd · 2025-11-15T01:37:07Z

These are backports based on these upstream PRs. Cherrypicks were performed when they where possible.

pytorch#163908 (persistent reduction autotune)
pytorch#161280 (reduction)
pytorch#162053 (foreach)
pytorch#163197 (pointwise)
pytorch#166470 (pointwise config for atomic add)

Also included are some additional customer-specific configs which were not upstreamed but are in this backport to 2.9
#2723

Did not backport filter functions such as _maybe_filter_configs_for_tma_restrictions
https://github.com/ROCm/pytorch/blob/release/2.9/torch/_inductor/runtime/triton_heuristics.py#L2614

(cherry picked from commit 5d4455f) (cherry picked from commit d3d77f5)

(cherry picked from commit 2fc7525) (cherry picked from commit 528cf02)

(cherry picked from commit d5c71f0) (cherry picked from commit 11e1dfc)

(cherry picked from commit 262a33e) (cherry picked from commit 0cf1c89)

(cherry picked from commit 9f19754) (cherry picked from commit dee2fdf)

removed the (erroneous?) check that disables autotuning for pointwise kernels (cherry picked from commit e3b8e25) (cherry picked from commit 10af207) (cherry picked from commit b9e0182)

Added two nice grid configs for the 2d pointwise kernel cases for WRT5 workload. Confirmed that they were picked up when using max autotune. (cherry picked from commit f1eac49) (cherry picked from commit 2e79001) (cherry picked from commit 04aa3e4)

This config improves the performance of a 1D pointwise kernel by 20% as measured on MI350. (cherry picked from commit a7bac0a) (cherry picked from commit 0bdb796) (cherry picked from commit af5f678)

(cherry picked from commit 16e8266) (cherry picked from commit 8bd33f9)

(cherry picked from commit dfc1579) (cherry picked from commit 8f60456)

(cherry picked from commit 666e81b) (cherry picked from commit f6aaaf8)

(cherry picked from commit f97c7a9) (cherry picked from commit db49466) (cherry picked from commit 6e9b4ee) (cherry picked from commit c36d85f)

(cherry picked from commit 0c52d01) (cherry picked from commit 83e453f)

(cherry picked from commit dd990a3) (cherry picked from commit 0de435f)

(cherry picked from commit 9534cbd) (cherry picked from commit 189481e)

(cherry picked from commit 7eeb1ba) (cherry picked from commit eea659c)

Reorganized slightly the adding of hard-coded autotuning configs. Fixed wrt1 configs. Added wrt2 & 3 configs. (cherry picked from commit e3e9a17) (cherry picked from commit 6534df0)

rocm-repo-management-api · 2025-11-15T01:47:45Z

Jenkins build for 7850a9c97813ff2687769efd9a6c4ff5ff749187 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-11-15T02:47:31Z

Jenkins build for dbdb5542c2ae0f09415495c33bfd7d5d0f77bc53 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

Added a check that includes autotune configs for 2D POI only if their size is big enough. (cherry picked from commit a2b0fd7)

rocm-repo-management-api · 2025-11-20T02:07:02Z

Jenkins build for d235a1504f6702249dd72deef1a8f68ce991320a commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-11-20T03:06:17Z

Jenkins build for 627a5718c93f8c54fca6787f3167b2b454717226 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-11-21T02:11:05Z

Jenkins build for b1cdd5584626c1f0c2c6bad6b58272da6901e619 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-11-21T03:09:45Z

Jenkins build for b1cdd5584626c1f0c2c6bad6b58272da6901e619 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

rocm-repo-management-api · 2025-11-21T23:27:43Z

Jenkins build for d356b844b19b6dfb588b2f5815ebbefca0bba579 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

naromero77amd · 2025-11-22T00:44:07Z

Tested with TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1 to confirm we are getting the extra configs (note that some of them are getting filtered/scaled out as expected).

For triton_red_fused_sum_view_22.py:

V1122 00:32:56.142000 102705 torch/_inductor/runtime/triton_heuristics.py:256] CachingAutotuner gets 12 configs for triton_red_fused_sum_view_22
V1122 00:32:56.143000 102705 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 1, R0_BLOCK: 128, num_warps: 1, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:32:56.143000 102705 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 64, R0_BLOCK: 8, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:32:56.143000 102705 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 4, R0_BLOCK: 128, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:32:56.143000 102705 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 64, R0_BLOCK: 64, num_warps: 8, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:32:56.143000 102705 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 8, R0_BLOCK: 128, num_warps: 8, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:32:56.143000 102705 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 64, R0_BLOCK: 4, num_warps: 8, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:32:56.143000 102705 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 1024, R0_BLOCK: 8, waves_per_eu: 2, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:32:56.143000 102705 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 512, R0_BLOCK: 8, waves_per_eu: 1, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:32:56.143000 102705 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 128, R0_BLOCK: 4, waves_per_eu: 1, num_warps: 2, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:32:56.143000 102705 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 1, R0_BLOCK: 128, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:32:56.143000 102705 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 64, R0_BLOCK: 128, waves_per_eu: 1, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:32:56.143000 102705 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 2, R0_BLOCK: 128, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:32:56.143000 102705 torch/_inductor/runtime/triton_heuristics.py:271] Triton cache dir: /tmp/torchinductor_root/triton/0

triton_poi_fused_threshold_backward_36 (1D)

V1122 00:45:32.154000 110142 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 1024, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:45:32.154000 110142 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 512, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:45:32.154000 110142 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 8192, waves_per_eu: 2, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:45:32.154000 110142 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 4096, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:45:32.154000 110142 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 2048, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:45:32.154000 110142 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 128, waves_per_eu: 1, num_warps: 2, num_ctas: 1, num_stages: 2, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 00:45:32.154000 110142 torch/_inductor/runtime/triton_heuristics.py:271] Triton cache dir: /tmp/torchinductor_root/triton/0

triton_poi_fused_slice_13 (2D)

V1122 01:10:22.629000 126822 torch/_inductor/runtime/triton_heuristics.py:256] CachingAutotuner gets 12 configs for triton_poi_fused_slice_31
V1122 01:10:22.629000 126822 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 32, YBLOCK: 32, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:10:22.629000 126822 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 64, YBLOCK: 32, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:10:22.629000 126822 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 64, YBLOCK: 64, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:10:22.629000 126822 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 128, YBLOCK: 32, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:10:22.629000 126822 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 16, YBLOCK: 256, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:10:22.629000 126822 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 128, YBLOCK: 16, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:10:22.629000 126822 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 32, YBLOCK: 512, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:10:22.630000 126822 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 128, YBLOCK: 8, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:10:22.630000 126822 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 1, YBLOCK: 1024, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:10:22.630000 126822 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 32, YBLOCK: 128, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:10:22.630000 126822 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 64, YBLOCK: 32, num_warps: 8, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:10:22.630000 126822 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 64, YBLOCK: 256, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:10:22.630000 126822 torch/_inductor/runtime/triton_heuristics.py:271] Triton cache dir: /tmp/torchinductor_root/triton/0

triton_poi_fused__to_copy_index_add_new_zeros_4 (contans the atomic add config)

V1122 01:03:16.415000 121806 torch/_inductor/runtime/triton_heuristics.py:256] CachingAutotuner gets 7 configs for triton_poi_fused__to_copy_index_add_new_zeros_4
V1122 01:03:16.415000 121806 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 1024, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:03:16.415000 121806 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 512, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:03:16.415000 121806 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 8192, waves_per_eu: 2, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:03:16.415000 121806 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 4096, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:03:16.415000 121806 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 2048, waves_per_eu: 1, num_warps: 8, num_ctas: 1, num_stages: 2, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:03:16.415000 121806 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 128, waves_per_eu: 1, num_warps: 2, num_ctas: 1, num_stages: 2, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:03:16.415000 121806 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 64, num_warps: 1, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:03:16.416000 121806 torch/_inductor/runtime/triton_heuristics.py:271] Triton cache dir: /tmp/torchinductor_root/triton/0

triton_per_fused_sum_view_23

V1122 01:06:49.104000 124286 torch/_inductor/runtime/triton_heuristics.py:256] CachingAutotuner gets 8 configs for triton_per_fused_sum_view_23
V1122 01:06:49.104000 124286 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 1, num_warps: 1, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:06:49.105000 124286 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 4, num_warps: 2, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:06:49.105000 124286 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 8, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:06:49.105000 124286 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 16, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:06:49.105000 124286 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 32, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:06:49.105000 124286 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 64, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:06:49.105000 124286 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 128, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:06:49.105000 124286 torch/_inductor/runtime/triton_heuristics.py:262] XBLOCK: 256, num_warps: 4, num_ctas: 1, num_stages: 1, num_buffers_warp_spec: 0, num_consumer_groups: 0, reg_dec_producer: 0, reg_inc_consumer: 0, maxnreg: None
V1122 01:06:49.105000 124286 torch/_inductor/runtime/triton_heuristics.py:271] Triton cache dir: /tmp/torchinductor_root/triton/0

naromero77amd · 2025-11-22T01:09:04Z

Ran linter several times to clean the file up.

rocm-repo-management-api · 2025-11-22T02:14:22Z

Jenkins build for badfab0d09d48b0a580339e5119455ce0f30fcc7 commit finished as FAILURE
Links: Blue Ocean view / Build artifacts

naromero77amd · 2025-11-24T15:34:35Z

Ran the following test suites as follows:
PYTORCH_TEST_WITH_ROCM=1 python test/run_test.py -v --keep-going -i inductor/test_torchinductor.py inductor/test_max_autotune.py
with and without these changes.

No new regressions reported:

< ================== 74 passed, 29 skipped in 439.48s (0:07:19) ==================
> ================== 74 passed, 29 skipped in 464.02s (0:07:44) ==================
< ===================== 2 passed, 1708 deselected in 22.16s ======================
> ===================== 2 passed, 1708 deselected in 21.07s ======================
< ========= 1631 passed, 77 skipped, 2 deselected in 2681.23s (0:44:41) ==========
> ========= 1631 passed, 77 skipped, 2 deselected in 2749.65s (0:45:49) ==========

jataylo

LGTM

…ports (#2807) These are backports based on these upstream PRs. Cherrypicks were performed when they where possible. pytorch#163908 (persistent reduction autotune) pytorch#161280 (reduction) pytorch#162053 (foreach) pytorch#163197 (pointwise) pytorch#166470 (pointwise config for atomic add) Also included are some additional customer-specific configs which were not upstreamed but are in this backport to 2.9 #2723 Did not backport filter functions such as ` _maybe_filter_configs_for_tma_restrictions` https://github.com/ROCm/pytorch/blob/release/2.9/torch/_inductor/runtime/triton_heuristics.py#L2614 --------- Co-authored-by: Jack Taylor <jack.taylor@amd.com> Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Co-authored-by: Sampsa Riikonen <sriikone@amd.com> Co-authored-by: AmdSampsa <sampsa.riikonen@amd.com> (cherry picked from commit 7de1214)

…ports (#3006) This is identical to this release/2.7 PR: #2807 but now for release/2.8. Because release/2.7 and release/2.8 are similar. Changes were first backported to release/2.7 and then cherry-picked into release/2.8. -- Description from release/2.7 PR is included below These are backports based on these upstream PRs. Cherrypicks were performed when they where possible. pytorch#163908 (persistent reduction autotune) pytorch#161280 (reduction) pytorch#162053 (foreach) pytorch#163197 (pointwise) pytorch#166470 (pointwise config for atomic add) Also included are some additional customer-specific configs which were not upstreamed but are in this backport to 2.9 #2723 Did not backport filter functions such as ` _maybe_filter_configs_for_tma_restrictions` https://github.com/ROCm/pytorch/blob/release/2.9/torch/_inductor/runtime/triton_heuristics.py#L2614 --------- (cherry picked from commit 7de1214) --------- Co-authored-by: Jack Taylor <jack.taylor@amd.com> Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com> Co-authored-by: Sampsa Riikonen <sriikone@amd.com> Co-authored-by: AmdSampsa <sampsa.riikonen@amd.com>

jataylo and others added 19 commits November 14, 2025 22:52

Naive foreach autotune support

24717c3

(cherry picked from commit 5d4455f) (cherry picked from commit d3d77f5)

Update triton_heuristics.py

ec5765e

(cherry picked from commit 2fc7525) (cherry picked from commit 528cf02)

Update triton_heuristics.py

a49093e

(cherry picked from commit d5c71f0) (cherry picked from commit 11e1dfc)

Linting

1363a1c

(cherry picked from commit 262a33e) (cherry picked from commit 0cf1c89)

Conditionalize xblock values for HIP.

f80f0b8

Create tiny_config.

5ed2f82

(cherry picked from commit 9f19754) (cherry picked from commit dee2fdf)

pointwise autotuning returnz (#2636)

c269eb3

removed the (erroneous?) check that disables autotuning for pointwise kernels (cherry picked from commit e3b8e25) (cherry picked from commit 10af207) (cherry picked from commit b9e0182)

[ROCm][inductor] Additional pointwise tunings (#2642)

a13015c

This config improves the performance of a 1D pointwise kernel by 20% as measured on MI350. (cherry picked from commit a7bac0a) (cherry picked from commit 0bdb796) (cherry picked from commit af5f678)

pointwise config with MAX_BLOCK.

3d716eb

(cherry picked from commit 16e8266) (cherry picked from commit 8bd33f9)

Update triton_config method.

7edd183

(cherry picked from commit dfc1579) (cherry picked from commit 8f60456)

Increase TRITON_MAX_BLOCK X value.

adc6a45

(cherry picked from commit 666e81b) (cherry picked from commit f6aaaf8)

even more configs

b1f1c36

(cherry picked from commit f97c7a9) (cherry picked from commit db49466) (cherry picked from commit 6e9b4ee) (cherry picked from commit c36d85f)

More ROCm conditionalisation

26f8133

(cherry picked from commit 0c52d01) (cherry picked from commit 83e453f)

Lint.

c1f8e99

(cherry picked from commit dd990a3) (cherry picked from commit 0de435f)

Reduction heursitics improvements for ROCm

0a9dcdd

(cherry picked from commit 9534cbd) (cherry picked from commit 189481e)

Add PropagateNan argument to minimum and maximum function.

768ab84

(cherry picked from commit 7eeb1ba) (cherry picked from commit eea659c)

New WRT configs for autotuning (#2708)

451c2b4

Reorganized slightly the adding of hard-coded autotuning configs. Fixed wrt1 configs. Added wrt2 & 3 configs. (cherry picked from commit e3e9a17) (cherry picked from commit 6534df0)

Align TRITON_MAX_BLOCK code with upstream.

7850a9c

naromero77amd marked this pull request as draft November 15, 2025 01:37

naromero77amd requested review from jataylo, jeffdaily, jithunnair-amd and pruthvistony November 15, 2025 01:37

Fix indentation.

dbdb554

naromero77amd added 2 commits November 18, 2025 21:14

Backport InductorConfig class.

83e3be0

Patch get_args_with_constexprs

d235a15

naromero77amd changed the title ~~[release/2.7][ROCm][inductor] Inductor heuristic upstream backports~~ [NO CP][release/2.7][ROCm][inductor] Inductor heuristic upstream backports Nov 20, 2025

[NO CP] triton sanity check for 2D POI (#2798)

627a571

Added a check that includes autotune configs for 2D POI only if their size is big enough. (cherry picked from commit a2b0fd7)

naromero77amd added 3 commits November 21, 2025 01:32

Config specific to 1D pointwise kernels with atomic_add.

c48cc4c

Lint

e824f40

No change, just re-ordering of configs to match upstream.

ef5f375

naromero77amd added 2 commits November 21, 2025 02:15

Remove fbcode helper function.

3ad2e71

Missing loads_and_red variable.

b1cdd55

naromero77amd added 2 commits November 21, 2025 20:00

Missing num_dynamic

c054175

Lint

d356b84

Remove triton_meta from argument of _reduction_configs.

badfab0

naromero77amd force-pushed the release_/2.7_triton_heuristic_backports branch from a5d6423 to badfab0 Compare November 22, 2025 02:08

naromero77amd marked this pull request as ready for review November 24, 2025 15:47

jataylo approved these changes Nov 24, 2025

View reviewed changes

pruthvistony approved these changes Nov 26, 2025

View reviewed changes

pruthvistony merged commit 7de1214 into release/2.7 Nov 26, 2025
0 of 2 checks passed

pruthvistony deleted the release_/2.7_triton_heuristic_backports branch November 26, 2025 17:10

naromero77amd mentioned this pull request Feb 26, 2026

[NO CP][release/2.8][ROCm][inductor] Inductor heuristic upstream backports #3006

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NO CP][release/2.7][ROCm][inductor] Inductor heuristic upstream backports#2807

[NO CP][release/2.7][ROCm][inductor] Inductor heuristic upstream backports#2807
pruthvistony merged 31 commits intorelease/2.7from
release_/2.7_triton_heuristic_backports

naromero77amd commented Nov 15, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api Bot commented Nov 15, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api Bot commented Nov 15, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api Bot commented Nov 20, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api Bot commented Nov 20, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api Bot commented Nov 21, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api Bot commented Nov 21, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api Bot commented Nov 21, 2025 •

edited

Loading

Uh oh!

naromero77amd commented Nov 22, 2025 •

edited

Loading

Uh oh!

naromero77amd commented Nov 22, 2025

Uh oh!

rocm-repo-management-api Bot commented Nov 22, 2025 •

edited

Loading

Uh oh!

naromero77amd commented Nov 24, 2025 •

edited

Loading

Uh oh!

jataylo left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

naromero77amd commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api Bot commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api Bot commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api Bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api Bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api Bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api Bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api Bot commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naromero77amd commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naromero77amd commented Nov 22, 2025

Uh oh!

rocm-repo-management-api Bot commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naromero77amd commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jataylo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

naromero77amd commented Nov 15, 2025 •

edited

Loading

rocm-repo-management-api Bot commented Nov 15, 2025 •

edited

Loading

rocm-repo-management-api Bot commented Nov 15, 2025 •

edited

Loading

rocm-repo-management-api Bot commented Nov 20, 2025 •

edited

Loading

rocm-repo-management-api Bot commented Nov 20, 2025 •

edited

Loading

rocm-repo-management-api Bot commented Nov 21, 2025 •

edited

Loading

rocm-repo-management-api Bot commented Nov 21, 2025 •

edited

Loading

rocm-repo-management-api Bot commented Nov 21, 2025 •

edited

Loading

naromero77amd commented Nov 22, 2025 •

edited

Loading

rocm-repo-management-api Bot commented Nov 22, 2025 •

edited

Loading

naromero77amd commented Nov 24, 2025 •

edited

Loading