Skip to content

[Inductor][NVGEMM] Register CuTeDSL Blockscaled GEMM with NVGEMM Backend#176547

Closed
NikhilAPatel wants to merge 9 commits intogh/NikhilAPatel/116/basefrom
gh/NikhilAPatel/116/head
Closed

[Inductor][NVGEMM] Register CuTeDSL Blockscaled GEMM with NVGEMM Backend#176547
NikhilAPatel wants to merge 9 commits intogh/NikhilAPatel/116/basefrom
gh/NikhilAPatel/116/head

Conversation

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 5, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/176547

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 3 Unrelated Failures

As of commit 286a853 with merge base 572f0d0 (image):

NEW FAILURE - The following job has failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Mar 5, 2026

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

…upport"

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
…upport"

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
@NikhilAPatel NikhilAPatel changed the title [Inductor][NVGEMM] Add blockscaled GEMM wrapper with FP4 support [Inductor][NVGEMM] Register CuTeDSL Blockscaled GEMM with NVGEMM Mar 5, 2026
@NikhilAPatel NikhilAPatel changed the title [Inductor][NVGEMM] Register CuTeDSL Blockscaled GEMM with NVGEMM [Inductor][NVGEMM] Register CuTeDSL Blockscaled GEMM with NVGEMM Backend Mar 5, 2026
@NikhilAPatel NikhilAPatel marked this pull request as ready for review March 6, 2026 00:04
@NikhilAPatel NikhilAPatel requested a review from mlazos March 6, 2026 00:10
Copy link
Copy Markdown
Contributor

@mlazos mlazos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One testing comment, otherwise looks good.

…NVGEMM Backend"

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
…NVGEMM Backend"

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy kadeng muchulee8 amjames chauhang aakhundov coconutruben jataylo

[ghstack-poisoned]
@pytorchmergebot
Copy link
Copy Markdown
Collaborator

Starting merge as part of PR stack under #176549

dshi7 pushed a commit to dshi7/pytorch that referenced this pull request Mar 23, 2026
  Shared helpers (module-level):
  - _round_up — deduplicated from two test methods
  - _prep_k — deduplicated from two test methods
  - _create_tensor_with_layout — unified layout creation for all dtypes (float16, bf16, fp8, fp4)
  - _nvgemm_config — standard config dict with autotune_fallback_to_aten: False always set

  Bug fixes:
  - Added missing torch._dynamo.reset() to test_scaled_gemm_mxfp8, test_scaled_gemm_nvf4, test_grouped_gemm, test_grouped_gemm_varying_offsets
  - ceildiv moved to top-level import (was imported inside two test methods)

  New test coverage:
  - test_matmul: added ("contiguous", "aligned_offset") and ("contiguous", "padded") layout combos (7 combos, up from 5)
  - test_scaled_gemm_mxfp8: added shape parametrization (4 shapes, was 1)
  - test_grouped_gemm: added layout_a parametrization (contiguous, aligned_offset, view, padded)
  - test_grouped_gemm_varying_offsets: split out from original test_grouped_gemm — tests different offset distributions separately
  - test_fp8_heuristic_configs: new heuristics integration test for FP8 precision strings

  Consistency fixes:
  - All tests now use _nvgemm_config() with autotune_fallback_to_aten: False
  - TestNVUniversalGemmDynamicShapes also uses _nvgemm_config()

Pull Request resolved: pytorch#176847
Approved by: https://github.com/mlazos
ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548, pytorch#176549, pytorch#176845
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
Needed to remove some docstrings from pytorch#176546 in order to fit in the 2000 LoC limit. This PR adds them back.

Pull Request resolved: pytorch#176549
Approved by: https://github.com/mlazos
ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
EmanueleCoradin pushed a commit to EmanueleCoradin/pytorch that referenced this pull request Mar 30, 2026
Needed to remove some docstrings from pytorch#176546 in order to fit in the 2000 LoC limit. This PR adds them back.

Pull Request resolved: pytorch#176549
Approved by: https://github.com/mlazos
ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548
AaronWang04 pushed a commit to AaronWang04/pytorch that referenced this pull request Mar 31, 2026
  Shared helpers (module-level):
  - _round_up — deduplicated from two test methods
  - _prep_k — deduplicated from two test methods
  - _create_tensor_with_layout — unified layout creation for all dtypes (float16, bf16, fp8, fp4)
  - _nvgemm_config — standard config dict with autotune_fallback_to_aten: False always set

  Bug fixes:
  - Added missing torch._dynamo.reset() to test_scaled_gemm_mxfp8, test_scaled_gemm_nvf4, test_grouped_gemm, test_grouped_gemm_varying_offsets
  - ceildiv moved to top-level import (was imported inside two test methods)

  New test coverage:
  - test_matmul: added ("contiguous", "aligned_offset") and ("contiguous", "padded") layout combos (7 combos, up from 5)
  - test_scaled_gemm_mxfp8: added shape parametrization (4 shapes, was 1)
  - test_grouped_gemm: added layout_a parametrization (contiguous, aligned_offset, view, padded)
  - test_grouped_gemm_varying_offsets: split out from original test_grouped_gemm — tests different offset distributions separately
  - test_fp8_heuristic_configs: new heuristics integration test for FP8 precision strings

  Consistency fixes:
  - All tests now use _nvgemm_config() with autotune_fallback_to_aten: False
  - TestNVUniversalGemmDynamicShapes also uses _nvgemm_config()

Pull Request resolved: pytorch#176847
Approved by: https://github.com/mlazos
ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548, pytorch#176549, pytorch#176845
nklshy-aws pushed a commit to nklshy-aws/pytorch that referenced this pull request Apr 7, 2026
  Shared helpers (module-level):
  - _round_up — deduplicated from two test methods
  - _prep_k — deduplicated from two test methods
  - _create_tensor_with_layout — unified layout creation for all dtypes (float16, bf16, fp8, fp4)
  - _nvgemm_config — standard config dict with autotune_fallback_to_aten: False always set

  Bug fixes:
  - Added missing torch._dynamo.reset() to test_scaled_gemm_mxfp8, test_scaled_gemm_nvf4, test_grouped_gemm, test_grouped_gemm_varying_offsets
  - ceildiv moved to top-level import (was imported inside two test methods)

  New test coverage:
  - test_matmul: added ("contiguous", "aligned_offset") and ("contiguous", "padded") layout combos (7 combos, up from 5)
  - test_scaled_gemm_mxfp8: added shape parametrization (4 shapes, was 1)
  - test_grouped_gemm: added layout_a parametrization (contiguous, aligned_offset, view, padded)
  - test_grouped_gemm_varying_offsets: split out from original test_grouped_gemm — tests different offset distributions separately
  - test_fp8_heuristic_configs: new heuristics integration test for FP8 precision strings

  Consistency fixes:
  - All tests now use _nvgemm_config() with autotune_fallback_to_aten: False
  - TestNVUniversalGemmDynamicShapes also uses _nvgemm_config()

Pull Request resolved: pytorch#176847
Approved by: https://github.com/mlazos
ghstack dependencies: pytorch#176543, pytorch#176544, pytorch#176545, pytorch#176546, pytorch#176547, pytorch#176548, pytorch#176549, pytorch#176845
@github-actions github-actions bot deleted the gh/NikhilAPatel/116/head branch April 10, 2026 02:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-no-td Do not run TD on this PR ciflow/b200 ciflow/inductor ciflow/torchtitan Run TorchTitan integration tests ciflow/trunk Trigger trunk jobs on your pull request Merged module: inductor Reverted topic: not user facing topic category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants