[Inductor][NVGEMM] Add infrastructure for registering custom kernels with NVGEMM Cutlass API by NikhilAPatel · Pull Request #176543 · pytorch/pytorch

NikhilAPatel · 2026-03-05T01:31:31Z

Stack from ghstack (oldest at bottom):

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

…ppers Authored with Claude. [ghstack-poisoned]

pytorch-bot · 2026-03-05T01:31:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/176543

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Cancelled Job, 3 Unrelated Failures

As of commit ff29833 with merge base 572f0d0 ():

NEW FAILURES - The following jobs have failed:

docker-builds / docker-build (linux.12xlarge, pytorch-linux-noble-xpu-n-py3-inductor-benchmarks) (gh)
ninja: build stopped: subcommand failed
inductor / unit-test / inductor-test / test (inductor_cpp_wrapper, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
test/inductor/test_triton_kernels.py::TestUserKernelEpilogueFusion::test_fusion_relu_epilogue

CANCELLED JOB - The following job was cancelled. Please retry:

inductor / unit-test / inductor-cpu-core-test (3.13) / test (inductor_core, 1, 2, linux.c7i.2xlarge) (gh)
##[error]The operation was canceled.

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / unit-test / inductor-cpu-core-test (3.12) / test (inductor_core, 1, 2, linux.c7i.2xlarge) (gh) (detected as infra flaky with no log or failing log classifier)
inductor / unit-test / inductor-test / test (inductor_cpp_wrapper, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh) (similar failure)
test/inductor/test_triton_kernels.py::TestUserKernelEpilogueFusion::test_fusion_custom_kernel_with_linebreaks

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

inductor / inductor-cpu-test / test (cpu_inductor_torchbench, 1, 2, linux.2xlarge.amx, unstable) (gh) (#174929)
detectron2_maskrcnn_r_50_fpn

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2026-03-05T01:31:37Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

… kernel wrappers" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo [ghstack-poisoned]

pytorchmergebot · 2026-03-09T16:17:38Z

Starting merge as part of PR stack under #176549

Pull Request resolved: #176544 Approved by: https://github.com/mlazos ghstack dependencies: #176543

Pull Request resolved: #176545 Approved by: https://github.com/mlazos ghstack dependencies: #176543, #176544

Instead of cloning this directly from the Cutlass repo via `setup.py`, we need to own it ourselves inside of Inductor to do some Tensor mode reordering due to the differences between how Inductor and this kernel need the dims ordered Pull Request resolved: #176546 Approved by: https://github.com/mlazos ghstack dependencies: #176543, #176544, #176545

…end (#176547) Pull Request resolved: #176547 Approved by: https://github.com/mlazos ghstack dependencies: #176543, #176544, #176545, #176546

…led GEMM (pytorch#176548)" This reverts commit 9db6179. Reverted pytorch#176548 on behalf of https://github.com/zou3519 due to broke CI ([comment](pytorch#176543 (comment)))

…EMM Backend (pytorch#176547)" This reverts commit dd06fff. Reverted pytorch#176547 on behalf of https://github.com/zou3519 due to broke CI ([comment](pytorch#176543 (comment)))

…ch#176546)" This reverts commit c4fd6ef. Reverted pytorch#176546 on behalf of https://github.com/zou3519 due to broke CI ([comment](pytorch#176543 (comment)))

…ch#176545)" This reverts commit 8df497d. Reverted pytorch#176545 on behalf of https://github.com/zou3519 due to broke CI ([comment](pytorch#176543 (comment)))

…torch#176544)" This reverts commit eb97b58. Reverted pytorch#176544 on behalf of https://github.com/zou3519 due to broke CI ([comment](pytorch#176543 (comment)))

…kernels with NVGEMM Cutlass API (pytorch#176543)" This reverts commit 9e49f44. Reverted pytorch#176543 on behalf of https://github.com/zou3519 due to broke CI ([comment](pytorch#176543 (comment)))

…with NVGEMM Cutlass API (pytorch#176543) Pull Request resolved: pytorch#176543 Approved by: https://github.com/mlazos

…6544) Pull Request resolved: pytorch#176544 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543

Pull Request resolved: pytorch#176545 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544

Instead of cloning this directly from the Cutlass repo via `setup.py`, we need to own it ourselves inside of Inductor to do some Tensor mode reordering due to the differences between how Inductor and this kernel need the dims ordered Pull Request resolved: pytorch#176546 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545

…end (pytorch#176547) Pull Request resolved: pytorch#176547 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546

…pytorch#176548) Pull Request resolved: pytorch#176548 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547

Needed to remove some docstrings from pytorch#176546 in order to fit in the 2000 LoC limit. This PR adds them back. Pull Request resolved: pytorch#176549 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548

…rch#176845) Pull Request resolved: pytorch#176845 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548, pytorch#176549

Shared helpers (module-level): - _round_up — deduplicated from two test methods - _prep_k — deduplicated from two test methods - _create_tensor_with_layout — unified layout creation for all dtypes (float16, bf16, fp8, fp4) - _nvgemm_config — standard config dict with autotune_fallback_to_aten: False always set Bug fixes: - Added missing torch._dynamo.reset() to test_scaled_gemm_mxfp8, test_scaled_gemm_nvf4, test_grouped_gemm, test_grouped_gemm_varying_offsets - ceildiv moved to top-level import (was imported inside two test methods) New test coverage: - test_matmul: added ("contiguous", "aligned_offset") and ("contiguous", "padded") layout combos (7 combos, up from 5) - test_scaled_gemm_mxfp8: added shape parametrization (4 shapes, was 1) - test_grouped_gemm: added layout_a parametrization (contiguous, aligned_offset, view, padded) - test_grouped_gemm_varying_offsets: split out from original test_grouped_gemm — tests different offset distributions separately - test_fp8_heuristic_configs: new heuristics integration test for FP8 precision strings Consistency fixes: - All tests now use _nvgemm_config() with autotune_fallback_to_aten: False - TestNVUniversalGemmDynamicShapes also uses _nvgemm_config() Pull Request resolved: pytorch#176847 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548, pytorch#176549, pytorch#176845

…ytorch#176859) Pull Request resolved: pytorch#176859 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548, pytorch#176549, pytorch#176845, pytorch#176847

…rch#176845) Pull Request resolved: pytorch#176845 Approved by: https://github.com/mlazos ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548, pytorch#176549

[Inductor][NVGEMM] Add infrastructure for vendored CuTeDSL kernel wra…

ef74a30

…ppers Authored with Claude. [ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Mar 5, 2026

NikhilAPatel added topic: not user facing topic category ciflow/b200 labels Mar 5, 2026

NikhilAPatel marked this pull request as draft March 5, 2026 01:35

NikhilAPatel changed the title ~~[Inductor][NVGEMM] Add infrastructure for vendored CuTeDSL kernel wrappers~~ [Inductor][NVGEMM] Add infrastructure for registering custom kernels with NVGEMM Cutlass API Mar 5, 2026

NikhilAPatel marked this pull request as ready for review March 6, 2026 00:03

NikhilAPatel requested a review from mlazos March 6, 2026 00:03

mlazos approved these changes Mar 6, 2026

View reviewed changes

This was referenced Mar 8, 2026

[Inductor][NVGEMM] Drop tile_k from nvMatmulHeuristics matching #176845

Closed

[Inductor][NVGEMM] Refactor tests #176847

Closed

[Inductor] Fix benchmark_example_value losing dtype on view unwrap #176859

Closed

pytorchmergebot closed this in 9e49f44 Mar 9, 2026

pytorchmergebot added the Merged label Mar 9, 2026

pytorchmergebot pushed a commit that referenced this pull request Mar 9, 2026

[Inductor] Add FP4 tensor creation support for autotuning (#176544)

eb97b58

Pull Request resolved: #176544 Approved by: https://github.com/mlazos ghstack dependencies: #176543

pytorchmergebot pushed a commit that referenced this pull request Mar 9, 2026

[Inductor][NVGEMM] Patch cutlass_api FP4 dtype mapping (#176545)

8df497d

Pull Request resolved: #176545 Approved by: https://github.com/mlazos ghstack dependencies: #176543, #176544

benediktjohannes mentioned this pull request Apr 3, 2026

[Inductor] Missing file torch/_inductor/kernel/vendored_templates/cutedsl/kernels/cutedsl_grouped_gemm.py #179245

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inductor][NVGEMM] Add infrastructure for registering custom kernels with NVGEMM Cutlass API#176543

[Inductor][NVGEMM] Add infrastructure for registering custom kernels with NVGEMM Cutlass API#176543
NikhilAPatel wants to merge 3 commits intogh/NikhilAPatel/112/basefrom
gh/NikhilAPatel/112/head

NikhilAPatel commented Mar 5, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 5, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 5, 2026

Uh oh!

pytorchmergebot commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

NikhilAPatel commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/176543

❌ 2 New Failures, 1 Cancelled Job, 3 Unrelated Failures

Uh oh!

pytorch-bot bot commented Mar 5, 2026

This PR needs a release notes: label

Uh oh!

pytorchmergebot commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NikhilAPatel commented Mar 5, 2026 •

edited

Loading

pytorch-bot bot commented Mar 5, 2026 •

edited

Loading

This PR needs a `release notes:` label