[Inductor][Triton][FP8] Add a Blackwell-specific scaled persistent + TMA template for GEMMs by jananisriram · Pull Request #163147 · pytorch/pytorch

jananisriram · 2025-09-17T06:19:17Z

Summary:
X-link: meta-pytorch/tritonbench#432

Add a Blackwell-specific scaled persistent + TMA Triton template to Inductor. This diff builds on D82515450 by adding a new set of mixins which inherit the scaling epilogue and add scaled persistent + TMA kwargs to the template.

This diff also adds a benchmark for the scaled Blackwell persistent + TMA template to TritonBench fp8_gemm.

Note that this diff is a minimal extension to the above diff; rather than adding a new kernel for the scaled version, we opted to simply extend the epilogue to account for scaling. This template is accurate for per-tensor and per-row scaling but may require modifications for other scaling modes, such as deepseek-style scaling, which apply scaling prior to the GEMM computation.

In addition, note that epilogue subtiling is currently unsupported for both the scaled and non-scaled Blackwell templates, and functionality will be added in a subsequent diff.

Test Plan:
Verified that the scaled Blackwell template adds the scaling epilogue to the generated Triton kernel by inspecting the Inductor-generated Triton kernel.

Benchmarking command:

TRITON_PRINT_AUTOTUNING=1 TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor TRITON_CACHE_DIR=~/personal/cache_dir_triton TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/{opt,inplace} pytorch/tritonbench:run -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -- --op fp8_gemm --only torch_fp8_gemm,blackwell_pt2_fp8_gemm --metrics tflops,accuracy --input-loader=/home/jananisriram/personal/fp8_shapes_testing.json --scaling_rowwise --output="/home/jananisriram/personal/fp8_shapes_testing_results.csv" --atol=1e-2 --rtol=0.5 2>&1 | tee ~/personal/fp8_shapes_testing.log

Rollback Plan:

Differential Revision: D82597111

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

pytorch-bot · 2025-09-17T06:19:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163147

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 52e36fa with merge base ddc56f6 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-09-17T06:19:26Z

@jananisriram has exported this pull request. If you are a Meta employee, you can view the originating diff in D82597111.

…TMA template for GEMMs (#163147) Summary: X-link: meta-pytorch/tritonbench#432 Add a Blackwell-specific scaled persistent + TMA Triton template to Inductor. This diff builds on D82515450 by adding a new set of mixins which inherit the scaling epilogue and add scaled persistent + TMA kwargs to the template. Note that this diff is a minimal extension to the above diff; rather than adding a new kernel for the scaled version, we opted to simply extend the epilogue to account for scaling. This template is accurate for per-tensor and per-row scaling but may require modifications for other scaling modes, such as deepseek-style scaling, which apply scaling prior to the GEMM computation. In addition, note that epilogue subtiling is currently unsupported for both the scaled and non-scaled Blackwell templates, and functionality will be added in a subsequent diff. Test Plan: Verified that the scaled Blackwell template adds the scaling epilogue to the generated Triton kernel by inspecting the Inductor-generated Triton kernel. Differential Revision: D82597111

facebook-github-bot · 2025-09-18T23:27:52Z

@jananisriram has exported this pull request. If you are a Meta employee, you can view the originating diff in D82597111.

njriasan

LGTM! Thanks!

test/inductor/test_max_autotune.py

torch/_inductor/template_heuristics/triton.py

…TMA template for GEMMs (#163147) Summary: X-link: meta-pytorch/tritonbench#432 Add a Blackwell-specific scaled persistent + TMA Triton template to Inductor. This diff builds on D82515450 by adding a new set of mixins which inherit the scaling epilogue and add scaled persistent + TMA kwargs to the template. Note that this diff is a minimal extension to the above diff; rather than adding a new kernel for the scaled version, we opted to simply extend the epilogue to account for scaling. This template is accurate for per-tensor and per-row scaling but may require modifications for other scaling modes, such as deepseek-style scaling, which apply scaling prior to the GEMM computation. In addition, note that epilogue subtiling is currently unsupported for both the scaled and non-scaled Blackwell templates, and functionality will be added in a subsequent diff. Test Plan: Verified that the scaled Blackwell template adds the scaling epilogue to the generated Triton kernel by inspecting the Inductor-generated Triton kernel. Reviewed By: njriasan Differential Revision: D82597111

facebook-github-bot · 2025-09-18T23:44:40Z

@jananisriram has exported this pull request. If you are a Meta employee, you can view the originating diff in D82597111.

…TMA template for GEMMs (#163147) Summary: X-link: meta-pytorch/tritonbench#432 Add a Blackwell-specific scaled persistent + TMA Triton template to Inductor. This diff builds on D82515450 by adding a new set of mixins which inherit the scaling epilogue and add scaled persistent + TMA kwargs to the template. Note that this diff is a minimal extension to the above diff; rather than adding a new kernel for the scaled version, we opted to simply extend the epilogue to account for scaling. This template is accurate for per-tensor and per-row scaling but may require modifications for other scaling modes, such as deepseek-style scaling, which apply scaling prior to the GEMM computation. In addition, note that epilogue subtiling is currently unsupported for both the scaled and non-scaled Blackwell templates, and functionality will be added in a subsequent diff. Test Plan: Verified that the scaled Blackwell template adds the scaling epilogue to the generated Triton kernel by inspecting the Inductor-generated Triton kernel. Reviewed By: njriasan Differential Revision: D82597111

facebook-github-bot · 2025-09-18T23:52:32Z

@jananisriram has exported this pull request. If you are a Meta employee, you can view the originating diff in D82597111.

jananisriram · 2025-09-19T17:16:08Z

@pytorchbot merge

pytorchmergebot · 2025-09-19T17:17:59Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

…TMA template for GEMMs (pytorch#163147) Summary: X-link: meta-pytorch/tritonbench#432 Add a Blackwell-specific scaled persistent + TMA Triton template to Inductor. This diff builds on D82515450 by adding a new set of mixins which inherit the scaling epilogue and add scaled persistent + TMA kwargs to the template. This diff also adds a benchmark for the scaled Blackwell persistent + TMA template to TritonBench `fp8_gemm`. Note that this diff is a minimal extension to the above diff; rather than adding a new kernel for the scaled version, we opted to simply extend the epilogue to account for scaling. This template is accurate for per-tensor and per-row scaling but may require modifications for other scaling modes, such as deepseek-style scaling, which apply scaling prior to the GEMM computation. In addition, note that epilogue subtiling is currently unsupported for both the scaled and non-scaled Blackwell templates, and functionality will be added in a subsequent diff. Test Plan: Verified that the scaled Blackwell template adds the scaling epilogue to the generated Triton kernel by inspecting the Inductor-generated Triton kernel. Benchmarking command: ``` TRITON_PRINT_AUTOTUNING=1 TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor TRITON_CACHE_DIR=~/personal/cache_dir_triton TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/{opt,inplace} pytorch/tritonbench:run -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -- --op fp8_gemm --only torch_fp8_gemm,blackwell_pt2_fp8_gemm --metrics tflops,accuracy --input-loader=/home/jananisriram/personal/fp8_shapes_testing.json --scaling_rowwise --output="/home/jananisriram/personal/fp8_shapes_testing_results.csv" --atol=1e-2 --rtol=0.5 2>&1 | tee ~/personal/fp8_shapes_testing.log ``` Rollback Plan: Differential Revision: D82597111 Pull Request resolved: pytorch#163147 Approved by: https://github.com/njriasan

pytorch-bot bot added ciflow/inductor module: inductor labels Sep 17, 2025

facebook-github-bot added fb-exported meta-exported labels Sep 17, 2025

facebook-github-bot force-pushed the export-D82597111 branch from ef53dc0 to a711aab Compare September 18, 2025 23:27

jananisriram added the topic: not user facing topic category label Sep 18, 2025

jananisriram requested a review from njriasan September 18, 2025 23:28

njriasan approved these changes Sep 18, 2025

View reviewed changes

test/inductor/test_max_autotune.py Show resolved Hide resolved

torch/_inductor/template_heuristics/triton.py Outdated Show resolved Hide resolved

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 18, 2025

facebook-github-bot force-pushed the export-D82597111 branch from a711aab to edab96f Compare September 18, 2025 23:44

facebook-github-bot force-pushed the export-D82597111 branch from edab96f to 52e36fa Compare September 18, 2025 23:52

pytorchmergebot added the merging label Sep 19, 2025

pytorchmergebot added the Merged label Sep 19, 2025

pytorchmergebot closed this in 3e663ce Sep 19, 2025

pytorchmergebot removed the merging label Sep 19, 2025

github-actions bot deleted the export-D82597111 branch October 20, 2025 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inductor][Triton][FP8] Add a Blackwell-specific scaled persistent + TMA template for GEMMs#163147

[Inductor][Triton][FP8] Add a Blackwell-specific scaled persistent + TMA template for GEMMs#163147
jananisriram wants to merge 1 commit intomainfrom
export-D82597111

jananisriram commented Sep 17, 2025 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Sep 17, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Sep 17, 2025

Uh oh!

facebook-github-bot commented Sep 18, 2025

Uh oh!

njriasan left a comment

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Sep 18, 2025

Uh oh!

facebook-github-bot commented Sep 18, 2025

Uh oh!

jananisriram commented Sep 19, 2025

Uh oh!

pytorchmergebot commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jananisriram commented Sep 17, 2025 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163147

✅ No Failures

Uh oh!

facebook-github-bot commented Sep 17, 2025

Uh oh!

facebook-github-bot commented Sep 18, 2025

Uh oh!

njriasan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Sep 18, 2025

Uh oh!

facebook-github-bot commented Sep 18, 2025

Uh oh!

jananisriram commented Sep 19, 2025

Uh oh!

pytorchmergebot commented Sep 19, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jananisriram commented Sep 17, 2025 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Sep 17, 2025 •

edited

Loading