fix: Add cutlass as an mm_fp4 backend in compute capability 12.0 in benchmark code by bkryu · Pull Request #1959 · flashinfer-ai/flashinfer

bkryu · 2025-10-21T05:15:35Z

📌 Description

Previously backend='cutlass' was not available to be benchmarked in flashinfer_benchmark.py for compute capability 12.0 while the kernel actually has been available. Current PR marks the backend as available.

Example output of being runnable after PR:

# python3 flashinfer_benchmark.py --routine mm_fp4 --m 1024 --n 7168 --k 512 --out_dtype bfloat16 --backends cudnn cutlass trtllm --use_128x4_sf_layout --use_nvfp4 --refcheck -vv                                                  
[INFO] args = Namespace(routine='mm_fp4', no_cuda_graph=False, use_cupti=False, refcheck=True, allow_output_mismatch=False, random_seed=42, verbose=2, output_path=None, num_iters=30, dry_run_iters=5, case_tag=None, generate_repro_command=False, repro_command='', batch_size=1, m=1024, n=7168, k=512, tile_size=128, group_size=1, scale_major_mode='MN', input_dtype='fp8_e4m3', mat2_dtype='fp8_e4m3', out_dtype='bfloat16', mma_sm=1, backends=['cudnn', 'cutlass', 'trtllm'], use_128x4_sf_layout=True, use_nvfp4=True, autotune=False)
[INFO] Running testMmFp4
[INFO] FlashInfer version: 0.4.1
[VVERBOSE] gpu_name = 'NVIDIA_RTX_PRO_6000_Blackwell_Server_Edition'
[WARNING] trtllm for routine mm_fp4 is not supported on compute capability 12.0. Skipping.
[VVERBOSE] input_fp4.shape = torch.Size([1024, 256])
[VVERBOSE] input_fp4.dtype = torch.uint8
[VVERBOSE] mat2_fp4.shape = torch.Size([7168, 256])
[VVERBOSE] mat2_fp4.dtype = torch.uint8
[PERF] cudnn          :: median time 0.014 ms; std 0.000 ms; achieved tflops 535.891 TFLOPs/sec; achieved tb_per_sec 1.196 TB/sec
[PERF] cutlass        :: median time 0.015 ms; std 0.000 ms; achieved tflops 515.203 TFLOPs/sec; achieved tb_per_sec 1.150 TB/sec

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Chores
- Expanded backend support for benchmarking routines on compute capability 12.0, adding compatibility with additional processing backends.

coderabbitai · 2025-10-21T05:15:45Z

Walkthrough

The pull request updates the backend support configuration for the mm_fp4 routine at compute capability 12.0, expanding supported backends from cudnn only to include both cudnn and cutlass in the benchmarking utilities.

Changes

Cohort / File(s)	Change Summary
Backend Configuration Update `benchmarks/routines/flashinfer_benchmark_utils.py`	Updated `routine_cc_to_supported_backends` mapping to add "cutlass" as a supported backend for mm_fp4 at compute capability 12.0, alongside existing "cudnn" support

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐰 Cutlass joins the flashy team,
For benchmarks bright and mm_fp4 dream,
Twelve point zero now runs twice as fast,
With backends paired, a feature to last! ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The pull request title "fix: Add cutlass as an mm_fp4 backend in compute capability 12.0 in benchmark code" directly reflects the primary change in the changeset. The title is concise, specific, and clearly communicates that the cutlass backend support is being added for the mm_fp4 routine at compute capability 12.0 in the benchmark utilities. A teammate reviewing the commit history would immediately understand the purpose of this change without ambiguity.
Description Check	✅ Passed	The pull request description follows the repository template and provides all required sections. The Description section clearly explains what the PR does—marking the cutlass backend as available for benchmarking mm_fp4 on compute capability 12.0—and includes a helpful example output demonstrating the change works as intended. The Related Issues section is present but empty, which is acceptable for routine updates. The Pull Request Checklist is properly filled out with pre-commit checks and tests marked as complete, indicating the author has followed the contribution process.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c3f2596 and 34ab3b7.

📒 Files selected for processing (1)

benchmarks/routines/flashinfer_benchmark_utils.py (1 hunks)

🔇 Additional comments (1)

benchmarks/routines/flashinfer_benchmark_utils.py (1)

243-243: LGTM! Cutlass backend correctly added for compute capability 12.0.

The change appropriately adds the cutlass backend to the list of supported backends for mm_fp4 on compute capability 12.0, consistent with the support pattern established for compute capabilities 10.0 and 10.3. The omission of "trtllm" for 12.0 aligns with the PR description indicating that backend is skipped on this compute capability.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Add cutlass as an sm120 mm_fp4 backend

34ab3b7

bkryu marked this pull request as ready for review October 21, 2025 05:15

bkryu requested review from Anerudhan, cyx-6 and nvmbreughe as code owners October 21, 2025 05:15

yzh119 approved these changes Oct 21, 2025

View reviewed changes

yzh119 enabled auto-merge (squash) October 21, 2025 16:22

cyx-6 approved these changes Oct 21, 2025

View reviewed changes

yzh119 merged commit ffcc5f4 into flashinfer-ai:main Oct 21, 2025
4 checks passed

bkryu deleted the benchmark_sm120_mm_fp4_cutlass branch October 23, 2025 20:49

This was referenced Oct 30, 2025

feat: Add backend='auto' to mm_fp4 and enable autotune for backend='cudnn' #1979

Merged

fix: Enable SM121 for mm_fp4 #2012

Merged

coderabbitai Bot mentioned this pull request Feb 3, 2026

feat: Add MXFP8 GEMM mm_mxfp8 (cutlass) #2464

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Add cutlass as an mm_fp4 backend in compute capability 12.0 in benchmark code#1959

fix: Add cutlass as an mm_fp4 backend in compute capability 12.0 in benchmark code#1959
yzh119 merged 1 commit intoflashinfer-ai:mainfrom
bkryu:benchmark_sm120_mm_fp4_cutlass

bkryu commented Oct 21, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Oct 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bkryu commented Oct 21, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bkryu commented Oct 21, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Oct 21, 2025 •

edited

Loading