[Inductor][NVGEMM] Register CuTeDSL Blockscaled GEMM with NVGEMM Backend by NikhilAPatel · Pull Request #176547 · pytorch/pytorch

NikhilAPatel · 2026-03-05T01:31:47Z

Stack from ghstack (oldest at bottom):

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

Authored with Claude. [ghstack-poisoned]

pytorch-bot · 2026-03-05T01:31:50Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/176547

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 3 Unrelated Failures

As of commit 286a853 with merge base 572f0d0 ():

NEW FAILURE - The following job has failed:

inductor / unit-test / inductor-test / test (inductor_cpp_wrapper, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
test/inductor/test_torchinductor.py::GPUTests::test_triton_kernel_bool_tensor_arg_cuda

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / unit-test / inductor-test / test (inductor_cpp_wrapper, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
test/inductor/test_triton_kernels.py::TestUserKernelEpilogueFusion::test_fusion_sigmoid_epilogue
pull / linux-jammy-py3.14t-clang15 / test (dynamo_wrapped, 2, 3, linux.2xlarge) (gh) (similar failure)
Process completed with exit code 137.

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 1, 2, linux.2xlarge.amx, unstable) (gh) (#174929)
detectron2_maskrcnn_r_50_fpn

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2026-03-05T01:31:53Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…upport" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

test/inductor/test_nv_universal_gemm.py

mlazos

One testing comment, otherwise looks good.

…NVGEMM Backend" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

pytorchmergebot · 2026-03-09T16:17:42Z

Starting merge as part of PR stack under #176549

…rch#176845) Pull Request resolved: pytorch#176845 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548, pytorch#176549

Shared helpers (module-level): - _round_up — deduplicated from two test methods - _prep_k — deduplicated from two test methods - _create_tensor_with_layout — unified layout creation for all dtypes (float16, bf16, fp8, fp4) - _nvgemm_config — standard config dict with autotune_fallback_to_aten: False always set Bug fixes: - Added missing torch._dynamo.reset() to test_scaled_gemm_mxfp8, test_scaled_gemm_nvf4, test_grouped_gemm, test_grouped_gemm_varying_offsets - ceildiv moved to top-level import (was imported inside two test methods) New test coverage: - test_matmul: added ("contiguous", "aligned_offset") and ("contiguous", "padded") layout combos (7 combos, up from 5) - test_scaled_gemm_mxfp8: added shape parametrization (4 shapes, was 1) - test_grouped_gemm: added layout_a parametrization (contiguous, aligned_offset, view, padded) - test_grouped_gemm_varying_offsets: split out from original test_grouped_gemm — tests different offset distributions separately - test_fp8_heuristic_configs: new heuristics integration test for FP8 precision strings Consistency fixes: - All tests now use _nvgemm_config() with autotune_fallback_to_aten: False - TestNVUniversalGemmDynamicShapes also uses _nvgemm_config() Pull Request resolved: pytorch#176847 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548, pytorch#176549, pytorch#176845

…ytorch#176859) Pull Request resolved: pytorch#176859 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548, pytorch#176549, pytorch#176845, pytorch#176847

…end (pytorch#176547) Pull Request resolved: pytorch#176547 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546

…pytorch#176548) Pull Request resolved: pytorch#176548 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547

Needed to remove some docstrings from pytorch#176546 in order to fit in the 2000 LoC limit. This PR adds them back. Pull Request resolved: pytorch#176549 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548

…EMM Backend (pytorch#176547)" This reverts commit dd06fff. Reverted pytorch#176547 on behalf of https://github.com/zou3519 due to broke CI ([comment](pytorch#176543 (comment)))

…end (pytorch#176547) Pull Request resolved: pytorch#176547 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546

…pytorch#176548) Pull Request resolved: pytorch#176548 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547

Needed to remove some docstrings from pytorch#176546 in order to fit in the 2000 LoC limit. This PR adds them back. Pull Request resolved: pytorch#176549 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548

…rch#176845) Pull Request resolved: pytorch#176845 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548, pytorch#176549

Shared helpers (module-level): - _round_up — deduplicated from two test methods - _prep_k — deduplicated from two test methods - _create_tensor_with_layout — unified layout creation for all dtypes (float16, bf16, fp8, fp4) - _nvgemm_config — standard config dict with autotune_fallback_to_aten: False always set Bug fixes: - Added missing torch._dynamo.reset() to test_scaled_gemm_mxfp8, test_scaled_gemm_nvf4, test_grouped_gemm, test_grouped_gemm_varying_offsets - ceildiv moved to top-level import (was imported inside two test methods) New test coverage: - test_matmul: added ("contiguous", "aligned_offset") and ("contiguous", "padded") layout combos (7 combos, up from 5) - test_scaled_gemm_mxfp8: added shape parametrization (4 shapes, was 1) - test_grouped_gemm: added layout_a parametrization (contiguous, aligned_offset, view, padded) - test_grouped_gemm_varying_offsets: split out from original test_grouped_gemm — tests different offset distributions separately - test_fp8_heuristic_configs: new heuristics integration test for FP8 precision strings Consistency fixes: - All tests now use _nvgemm_config() with autotune_fallback_to_aten: False - TestNVUniversalGemmDynamicShapes also uses _nvgemm_config() Pull Request resolved: pytorch#176847 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548, pytorch#176549, pytorch#176845

…ytorch#176859) Pull Request resolved: pytorch#176859 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548, pytorch#176549, pytorch#176845, pytorch#176847

…rch#176845) Pull Request resolved: pytorch#176845 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548, pytorch#176549

Shared helpers (module-level): - _round_up — deduplicated from two test methods - _prep_k — deduplicated from two test methods - _create_tensor_with_layout — unified layout creation for all dtypes (float16, bf16, fp8, fp4) - _nvgemm_config — standard config dict with autotune_fallback_to_aten: False always set Bug fixes: - Added missing torch._dynamo.reset() to test_scaled_gemm_mxfp8, test_scaled_gemm_nvf4, test_grouped_gemm, test_grouped_gemm_varying_offsets - ceildiv moved to top-level import (was imported inside two test methods) New test coverage: - test_matmul: added ("contiguous", "aligned_offset") and ("contiguous", "padded") layout combos (7 combos, up from 5) - test_scaled_gemm_mxfp8: added shape parametrization (4 shapes, was 1) - test_grouped_gemm: added layout_a parametrization (contiguous, aligned_offset, view, padded) - test_grouped_gemm_varying_offsets: split out from original test_grouped_gemm — tests different offset distributions separately - test_fp8_heuristic_configs: new heuristics integration test for FP8 precision strings Consistency fixes: - All tests now use _nvgemm_config() with autotune_fallback_to_aten: False - TestNVUniversalGemmDynamicShapes also uses _nvgemm_config() Pull Request resolved: pytorch#176847 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548, pytorch#176549, pytorch#176845

…ytorch#176859) Pull Request resolved: pytorch#176859 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548, pytorch#176549, pytorch#176845, pytorch#176847

[Inductor][NVGEMM] Add blockscaled GEMM wrapper with FP4 support

f2a4ab2

Authored with Claude. [ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Mar 5, 2026

NikhilAPatel added topic: not user facing topic category ciflow/b200 labels Mar 5, 2026

NikhilAPatel marked this pull request as draft March 5, 2026 01:33

NikhilAPatel added 2 commits March 4, 2026 18:26

Update on "[Inductor][NVGEMM] Add blockscaled GEMM wrapper with FP4 s…

4a1e922

…upport" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

Update on "[Inductor][NVGEMM] Add blockscaled GEMM wrapper with FP4 s…

0bea065

…upport" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

NikhilAPatel changed the title ~~[Inductor][NVGEMM] Add blockscaled GEMM wrapper with FP4 support~~ [Inductor][NVGEMM] Register CuTeDSL Blockscaled GEMM with NVGEMM Mar 5, 2026

NikhilAPatel changed the title ~~[Inductor][NVGEMM] Register CuTeDSL Blockscaled GEMM with NVGEMM~~ [Inductor][NVGEMM] Register CuTeDSL Blockscaled GEMM with NVGEMM Backend Mar 5, 2026

NikhilAPatel marked this pull request as ready for review March 6, 2026 00:04

NikhilAPatel requested a review from mlazos March 6, 2026 00:10

mlazos reviewed Mar 8, 2026

View reviewed changes

test/inductor/test_nv_universal_gemm.py Show resolved Hide resolved

mlazos approved these changes Mar 8, 2026

View reviewed changes

This was referenced Mar 8, 2026

[Inductor][NVGEMM] Drop tile_k from nvMatmulHeuristics matching #176845

Closed

[Inductor][NVGEMM] Refactor tests #176847

Closed

NikhilAPatel mentioned this pull request Mar 9, 2026

[Inductor] Fix benchmark_example_value losing dtype on view unwrap #176859

Closed

pytorchmergebot added the Merged label Mar 9, 2026

github-actions bot deleted the gh/NikhilAPatel/116/head branch April 10, 2026 02:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inductor][NVGEMM] Register CuTeDSL Blockscaled GEMM with NVGEMM Backend#176547

[Inductor][NVGEMM] Register CuTeDSL Blockscaled GEMM with NVGEMM Backend#176547
NikhilAPatel wants to merge 9 commits intogh/NikhilAPatel/116/basefrom
gh/NikhilAPatel/116/head

NikhilAPatel commented Mar 5, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 5, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 5, 2026

Uh oh!

Uh oh!

mlazos left a comment

Uh oh!

pytorchmergebot commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

NikhilAPatel commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/176547

❌ 1 New Failure, 3 Unrelated Failures

Uh oh!

pytorch-bot bot commented Mar 5, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

mlazos left a comment

Choose a reason for hiding this comment

Uh oh!

pytorchmergebot commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NikhilAPatel commented Mar 5, 2026 •

edited

Loading

pytorch-bot bot commented Mar 5, 2026 •

edited

Loading

This PR needs a `release notes:` label