[Inductor][NVGEMM] Enable nvMatmulHeuristics for FP4 blockscaled GEMM by NikhilAPatel · Pull Request #176548 · pytorch/pytorch

NikhilAPatel · 2026-03-05T01:31:51Z

Stack from ghstack (oldest at bottom):

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

Authored with Claude. [ghstack-poisoned]

pytorch-bot · 2026-03-05T01:31:54Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/176548

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 3 Unrelated Failures

As of commit af585a9 with merge base 572f0d0 ():

NEW FAILURES - The following jobs have failed:

inductor / unit-test / inductor-test / test (inductor_cpp_wrapper, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
test/inductor/test_torchinductor.py::GPUTests::test_triton_kernel_bool_tensor_arg_cuda
linux-aarch64 / linux-jammy-aarch64-py3.10 / test (openreg, 1, 1, lf.linux.arm64.m8g.4xlarge) (gh)
'Test'

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / unit-test / inductor-test / test (inductor_cpp_wrapper, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
test/inductor/test_triton_kernels.py::TestUserKernelEpilogueFusion::test_fusion_sigmoid_epilogue
pull / linux-jammy-py3.14t-clang15 / test (dynamo_wrapped, 2, 3, lf.linux.2xlarge) (gh) (similar failure)
Process completed with exit code 137.

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 1, 2, linux.2xlarge.amx, unstable) (gh) (#174929)
detectron2_maskrcnn_r_50_fpn

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2026-03-05T01:31:57Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…scaled GEMM" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

mlazos

Should we test this as well?

…scaled GEMM" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

pytorchmergebot · 2026-03-09T16:17:43Z

Starting merge as part of PR stack under #176549

Needed to remove some docstrings from #176546 in order to fit in the 2000 LoC limit. This PR adds them back. Pull Request resolved: #176549 Approved by: https://github.com/mlazos ghstack dependencies: #176543, #176544, #176545, #176546, #176547, #176548

…led GEMM (#176548)" This reverts commit 9db6179. Reverted #176548 on behalf of https://github.com/zou3519 due to broke CI ([comment](#176543 (comment)))

Shared helpers (module-level): - _round_up — deduplicated from two test methods - _prep_k — deduplicated from two test methods - _create_tensor_with_layout — unified layout creation for all dtypes (float16, bf16, fp8, fp4) - _nvgemm_config — standard config dict with autotune_fallback_to_aten: False always set Bug fixes: - Added missing torch._dynamo.reset() to test_scaled_gemm_mxfp8, test_scaled_gemm_nvf4, test_grouped_gemm, test_grouped_gemm_varying_offsets - ceildiv moved to top-level import (was imported inside two test methods) New test coverage: - test_matmul: added ("contiguous", "aligned_offset") and ("contiguous", "padded") layout combos (7 combos, up from 5) - test_scaled_gemm_mxfp8: added shape parametrization (4 shapes, was 1) - test_grouped_gemm: added layout_a parametrization (contiguous, aligned_offset, view, padded) - test_grouped_gemm_varying_offsets: split out from original test_grouped_gemm — tests different offset distributions separately - test_fp8_heuristic_configs: new heuristics integration test for FP8 precision strings Consistency fixes: - All tests now use _nvgemm_config() with autotune_fallback_to_aten: False - TestNVUniversalGemmDynamicShapes also uses _nvgemm_config() Pull Request resolved: #176847 Approved by: https://github.com/mlazos ghstack dependencies: #176543, #176544, #176545, #176546, #176547, #176548, #176549, #176845

…176859) Pull Request resolved: #176859 Approved by: https://github.com/mlazos ghstack dependencies: #176543, #176544, #176545, #176546, #176547, #176548, #176549, #176845, #176847

…rch#176845) Pull Request resolved: pytorch#176845 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548, pytorch#176549

Shared helpers (module-level): - _round_up — deduplicated from two test methods - _prep_k — deduplicated from two test methods - _create_tensor_with_layout — unified layout creation for all dtypes (float16, bf16, fp8, fp4) - _nvgemm_config — standard config dict with autotune_fallback_to_aten: False always set Bug fixes: - Added missing torch._dynamo.reset() to test_scaled_gemm_mxfp8, test_scaled_gemm_nvf4, test_grouped_gemm, test_grouped_gemm_varying_offsets - ceildiv moved to top-level import (was imported inside two test methods) New test coverage: - test_matmul: added ("contiguous", "aligned_offset") and ("contiguous", "padded") layout combos (7 combos, up from 5) - test_scaled_gemm_mxfp8: added shape parametrization (4 shapes, was 1) - test_grouped_gemm: added layout_a parametrization (contiguous, aligned_offset, view, padded) - test_grouped_gemm_varying_offsets: split out from original test_grouped_gemm — tests different offset distributions separately - test_fp8_heuristic_configs: new heuristics integration test for FP8 precision strings Consistency fixes: - All tests now use _nvgemm_config() with autotune_fallback_to_aten: False - TestNVUniversalGemmDynamicShapes also uses _nvgemm_config() Pull Request resolved: pytorch#176847 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548, pytorch#176549, pytorch#176845

…ytorch#176859) Pull Request resolved: pytorch#176859 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548, pytorch#176549, pytorch#176845, pytorch#176847

…pytorch#176548) Pull Request resolved: pytorch#176548 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547

Needed to remove some docstrings from pytorch#176546 in order to fit in the 2000 LoC limit. This PR adds them back. Pull Request resolved: pytorch#176549 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548

…led GEMM (pytorch#176548)" This reverts commit 9db6179. Reverted pytorch#176548 on behalf of https://github.com/zou3519 due to broke CI ([comment](pytorch#176543 (comment)))

…pytorch#176548) Pull Request resolved: pytorch#176548 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547

Needed to remove some docstrings from pytorch#176546 in order to fit in the 2000 LoC limit. This PR adds them back. Pull Request resolved: pytorch#176549 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548

…rch#176845) Pull Request resolved: pytorch#176845 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548, pytorch#176549

Shared helpers (module-level): - _round_up — deduplicated from two test methods - _prep_k — deduplicated from two test methods - _create_tensor_with_layout — unified layout creation for all dtypes (float16, bf16, fp8, fp4) - _nvgemm_config — standard config dict with autotune_fallback_to_aten: False always set Bug fixes: - Added missing torch._dynamo.reset() to test_scaled_gemm_mxfp8, test_scaled_gemm_nvf4, test_grouped_gemm, test_grouped_gemm_varying_offsets - ceildiv moved to top-level import (was imported inside two test methods) New test coverage: - test_matmul: added ("contiguous", "aligned_offset") and ("contiguous", "padded") layout combos (7 combos, up from 5) - test_scaled_gemm_mxfp8: added shape parametrization (4 shapes, was 1) - test_grouped_gemm: added layout_a parametrization (contiguous, aligned_offset, view, padded) - test_grouped_gemm_varying_offsets: split out from original test_grouped_gemm — tests different offset distributions separately - test_fp8_heuristic_configs: new heuristics integration test for FP8 precision strings Consistency fixes: - All tests now use _nvgemm_config() with autotune_fallback_to_aten: False - TestNVUniversalGemmDynamicShapes also uses _nvgemm_config() Pull Request resolved: pytorch#176847 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548, pytorch#176549, pytorch#176845

…ytorch#176859) Pull Request resolved: pytorch#176859 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548, pytorch#176549, pytorch#176845, pytorch#176847

…rch#176845) Pull Request resolved: pytorch#176845 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548, pytorch#176549

Shared helpers (module-level): - _round_up — deduplicated from two test methods - _prep_k — deduplicated from two test methods - _create_tensor_with_layout — unified layout creation for all dtypes (float16, bf16, fp8, fp4) - _nvgemm_config — standard config dict with autotune_fallback_to_aten: False always set Bug fixes: - Added missing torch._dynamo.reset() to test_scaled_gemm_mxfp8, test_scaled_gemm_nvf4, test_grouped_gemm, test_grouped_gemm_varying_offsets - ceildiv moved to top-level import (was imported inside two test methods) New test coverage: - test_matmul: added ("contiguous", "aligned_offset") and ("contiguous", "padded") layout combos (7 combos, up from 5) - test_scaled_gemm_mxfp8: added shape parametrization (4 shapes, was 1) - test_grouped_gemm: added layout_a parametrization (contiguous, aligned_offset, view, padded) - test_grouped_gemm_varying_offsets: split out from original test_grouped_gemm — tests different offset distributions separately - test_fp8_heuristic_configs: new heuristics integration test for FP8 precision strings Consistency fixes: - All tests now use _nvgemm_config() with autotune_fallback_to_aten: False - TestNVUniversalGemmDynamicShapes also uses _nvgemm_config() Pull Request resolved: pytorch#176847 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548, pytorch#176549, pytorch#176845

…ytorch#176859) Pull Request resolved: pytorch#176859 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548, pytorch#176549, pytorch#176845, pytorch#176847

[Inductor][NVGEMM] Enable nvMatmulHeuristics for FP4 blockscaled GEMM

9191c35

Authored with Claude. [ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Mar 5, 2026

NikhilAPatel added topic: not user facing topic category ciflow/b200 labels Mar 5, 2026

NikhilAPatel marked this pull request as draft March 5, 2026 01:33

NikhilAPatel added 2 commits March 4, 2026 18:26

NikhilAPatel marked this pull request as ready for review March 6, 2026 00:04

NikhilAPatel requested a review from mlazos March 6, 2026 00:10

mlazos approved these changes Mar 8, 2026

View reviewed changes

This was referenced Mar 8, 2026

[Inductor][NVGEMM] Drop tile_k from nvMatmulHeuristics matching #176845

Closed

[Inductor][NVGEMM] Refactor tests #176847

Closed

NikhilAPatel mentioned this pull request Mar 9, 2026

[Inductor] Fix benchmark_example_value losing dtype on view unwrap #176859

Closed

pytorchmergebot added the Merged label Mar 9, 2026

pytorchmergebot closed this in 9db6179 Mar 9, 2026

github-actions bot deleted the gh/NikhilAPatel/117/head branch April 10, 2026 02:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inductor][NVGEMM] Enable nvMatmulHeuristics for FP4 blockscaled GEMM#176548

[Inductor][NVGEMM] Enable nvMatmulHeuristics for FP4 blockscaled GEMM#176548
NikhilAPatel wants to merge 9 commits intogh/NikhilAPatel/117/basefrom
gh/NikhilAPatel/117/head

NikhilAPatel commented Mar 5, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 5, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 5, 2026

Uh oh!

mlazos left a comment

Uh oh!

pytorchmergebot commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

NikhilAPatel commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/176548

❌ 2 New Failures, 3 Unrelated Failures

Uh oh!

pytorch-bot bot commented Mar 5, 2026

This PR needs a release notes: label

Uh oh!

mlazos left a comment

Choose a reason for hiding this comment

Uh oh!

pytorchmergebot commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NikhilAPatel commented Mar 5, 2026 •

edited

Loading

pytorch-bot bot commented Mar 5, 2026 •

edited

Loading

This PR needs a `release notes:` label