Skip to content

fix: Add cutlass as an mm_fp4 backend in compute capability 12.0 in benchmark code#1959

Merged
yzh119 merged 1 commit intoflashinfer-ai:mainfrom
bkryu:benchmark_sm120_mm_fp4_cutlass
Oct 21, 2025
Merged

fix: Add cutlass as an mm_fp4 backend in compute capability 12.0 in benchmark code#1959
yzh119 merged 1 commit intoflashinfer-ai:mainfrom
bkryu:benchmark_sm120_mm_fp4_cutlass

Conversation

@bkryu
Copy link
Copy Markdown
Collaborator

@bkryu bkryu commented Oct 21, 2025

📌 Description

Previously backend='cutlass' was not available to be benchmarked in flashinfer_benchmark.py for compute capability 12.0 while the kernel actually has been available. Current PR marks the backend as available.

Example output of being runnable after PR:

# python3 flashinfer_benchmark.py --routine mm_fp4 --m 1024 --n 7168 --k 512 --out_dtype bfloat16 --backends cudnn cutlass trtllm --use_128x4_sf_layout --use_nvfp4 --refcheck -vv                                                  
[INFO] args = Namespace(routine='mm_fp4', no_cuda_graph=False, use_cupti=False, refcheck=True, allow_output_mismatch=False, random_seed=42, verbose=2, output_path=None, num_iters=30, dry_run_iters=5, case_tag=None, generate_repro_command=False, repro_command='', batch_size=1, m=1024, n=7168, k=512, tile_size=128, group_size=1, scale_major_mode='MN', input_dtype='fp8_e4m3', mat2_dtype='fp8_e4m3', out_dtype='bfloat16', mma_sm=1, backends=['cudnn', 'cutlass', 'trtllm'], use_128x4_sf_layout=True, use_nvfp4=True, autotune=False)
[INFO] Running testMmFp4
[INFO] FlashInfer version: 0.4.1
[VVERBOSE] gpu_name = 'NVIDIA_RTX_PRO_6000_Blackwell_Server_Edition'
[WARNING] trtllm for routine mm_fp4 is not supported on compute capability 12.0. Skipping.
[VVERBOSE] input_fp4.shape = torch.Size([1024, 256])
[VVERBOSE] input_fp4.dtype = torch.uint8
[VVERBOSE] mat2_fp4.shape = torch.Size([7168, 256])
[VVERBOSE] mat2_fp4.dtype = torch.uint8
[PERF] cudnn          :: median time 0.014 ms; std 0.000 ms; achieved tflops 535.891 TFLOPs/sec; achieved tb_per_sec 1.196 TB/sec
[PERF] cutlass        :: median time 0.015 ms; std 0.000 ms; achieved tflops 515.203 TFLOPs/sec; achieved tb_per_sec 1.150 TB/sec

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Chores
    • Expanded backend support for benchmarking routines on compute capability 12.0, adding compatibility with additional processing backends.

@bkryu bkryu marked this pull request as ready for review October 21, 2025 05:15
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Oct 21, 2025

Walkthrough

The pull request updates the backend support configuration for the mm_fp4 routine at compute capability 12.0, expanding supported backends from cudnn only to include both cudnn and cutlass in the benchmarking utilities.

Changes

Cohort / File(s) Change Summary
Backend Configuration Update
benchmarks/routines/flashinfer_benchmark_utils.py
Updated routine_cc_to_supported_backends mapping to add "cutlass" as a supported backend for mm_fp4 at compute capability 12.0, alongside existing "cudnn" support

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐰 Cutlass joins the flashy team,
For benchmarks bright and mm_fp4 dream,
Twelve point zero now runs twice as fast,
With backends paired, a feature to last! ✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title "fix: Add cutlass as an mm_fp4 backend in compute capability 12.0 in benchmark code" directly reflects the primary change in the changeset. The title is concise, specific, and clearly communicates that the cutlass backend support is being added for the mm_fp4 routine at compute capability 12.0 in the benchmark utilities. A teammate reviewing the commit history would immediately understand the purpose of this change without ambiguity.
Description Check ✅ Passed The pull request description follows the repository template and provides all required sections. The Description section clearly explains what the PR does—marking the cutlass backend as available for benchmarking mm_fp4 on compute capability 12.0—and includes a helpful example output demonstrating the change works as intended. The Related Issues section is present but empty, which is acceptable for routine updates. The Pull Request Checklist is properly filled out with pre-commit checks and tests marked as complete, indicating the author has followed the contribution process.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c3f2596 and 34ab3b7.

📒 Files selected for processing (1)
  • benchmarks/routines/flashinfer_benchmark_utils.py (1 hunks)
🔇 Additional comments (1)
benchmarks/routines/flashinfer_benchmark_utils.py (1)

243-243: LGTM! Cutlass backend correctly added for compute capability 12.0.

The change appropriately adds the cutlass backend to the list of supported backends for mm_fp4 on compute capability 12.0, consistent with the support pattern established for compute capabilities 10.0 and 10.3. The omission of "trtllm" for 12.0 aligns with the PR description indicating that backend is skipped on this compute capability.


Comment @coderabbitai help to get the list of available commands and usage tips.

@yzh119 yzh119 enabled auto-merge (squash) October 21, 2025 16:22
@yzh119 yzh119 merged commit ffcc5f4 into flashinfer-ai:main Oct 21, 2025
4 checks passed
@bkryu bkryu deleted the benchmark_sm120_mm_fp4_cutlass branch October 23, 2025 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants