Support checks PoC by nvmbreughe · Pull Request #1809 · flashinfer-ai/flashinfer

nvmbreughe · 2025-09-29T23:14:33Z

📌 Description

This PR adds is_*supported checks for backend and compute capability, through decorators.

This allows us to check support before running
It also wraps the original function so it calls back the support check before running.
The wrapped function also adds an optional parameter "skip_check". A quick measurement show only minimal impact (14.51s without checks, 14.58s with checks for all of test_mm_fp4), so we should further benchmark the usefulness of this feature.

Example:

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

nvjullin · 2025-10-01T08:23:04Z

The checks currently live very far away from the implementation and updating them to be consistent with each other can eventually become a maintenance problem. The conditional checks are also quite tricky to get correct. For example, it's not easy to tell if the mxfp4 checks are correct.

    if not use_nvfp4 and block_size != 32:
        raise ValueError("mxfp4 supports block_size = 32.")

    if backend != "cudnn" and not use_nvfp4:
        raise ValueError("Only cudnn FP4 GEMM supports mxfp4 quantization.")

Shouldn't the checks be reordered to avoid confusing error messages?

User tries trtllm + block_size=16 and gets rejected by block_size=32
User then tries trtllm + block_size=32 and gets rejected by only cudnn is supported

Instead of having one top level supports_backends, perhaps consider a two level design:

Local requirement decorator requirement written for each backend entrypoint
Top level backend_requirement that composes requirements

For example:

def cudnn_gemm_fp4_requirement(
    # ...
):
        if (
            not use_nvfp4
            and _match_sm_version(a.device, ["120"])
            and cudnn.backend_version() < 91400
        ):
            raise LibraryError(
                "cudnn FP4 GEMM with mxfp4 quantization is not supported on SM120 with cuDNN backend version < 9.14.0."
            )

        _check_cudnn_fp4_availability()
        # ...

@requirement(cudnn_gemm_fp4_requirement, capability=["100", "101", "102"])
def execute_cudnn_gemm_fp4_graph(
    # ...


@backend_requirement({
    "cudnn": execute_cudnn_gemm_fp4_graph.requirement,
    "trtllm": #...
})
def mm_fp4(
    # ...

This also means that all requirements are enforced to be local to the backend and won't affect each other.

nvmbreughe · 2025-10-09T20:51:09Z

Instead of having one top level supports_backends, perhaps consider a two level design:

Local requirement decorator requirement written for each backend entrypoint

Top level backend_requirement that composes requirements

Thank you @nvjullin for the excellent suggestion.
I think these are two discussions:

Separate the support checks
Separate the execution routines

While both are valid points, we prioritize separating the checks for now. Not all APIs are as cleanly to separate (2) atm and there is a plan for a more OO Backend class @Anerudhan. That does overlap somewhat with the support checks, as eventually we would be able to do something like cudnn_backend->check_mmfp4_support().

So I think as an intermediary step, and to get tighter checks in, we could do something like this:

@supported_compute_capability(["100", "101", "102"])
def cudnn_gemm_fp4_requirement(
    # ...
):
        if (
            not use_nvfp4
            and _match_sm_version(a.device, ["120"])
            and cudnn.backend_version() < 91400
        ):
            raise LibraryError(
                "cudnn FP4 GEMM with mxfp4 quantization is not supported on SM120 with cuDNN backend version < 9.14.0."
            )

        _check_cudnn_fp4_availability()
        # ...


@backend_requirement({
    "cudnn": execute_cudnn_gemm_fp4_graph.requirement,
    "trtllm": #...
    },
   common_check=common_fp4_checks # To be called by all backend checks
})
def mm_fp4(
    )
    if backend == "cudnn":
              # cudnn path
    elif backend == "trtllm":
              # trtllm path

nvjullin · 2025-10-13T02:55:02Z

While both are valid points, we prioritize separating the checks for now. Not all APIs are as cleanly to separate (2) atm and there is a plan for a more OO Backend class @Anerudhan. That does overlap somewhat with the support checks, as eventually we would be able to do something like cudnn_backend->check_mmfp4_support().

I wasn't aware, thanks for the info. LGTM.

nvmbreughe · 2025-10-13T15:56:31Z

While both are valid points, we prioritize separating the checks for now. Not all APIs are as cleanly to separate (2) atm and there is a plan for a more OO Backend class @Anerudhan. That does overlap somewhat with the support checks, as eventually we would be able to do something like cudnn_backend->check_mmfp4_support().

I wasn't aware, thanks for the info. LGTM.

Thank you for the excellent suggestions, @nvjullin

nvmbreughe · 2025-10-13T17:07:41Z

/bot run

flashinfer-bot · 2025-10-13T17:08:02Z

GitLab MR !79 has been created, and the CI pipeline #36524696 is currently running. I'll report back once the pipeline job completes.

aleozlx

looks like a good step forward

flashinfer-bot · 2025-10-14T00:39:23Z

[SUCCESS] Pipeline #36524696: 13/17 passed

yzh119

LGTM

sricketts

Overall LGTM. Added one suggestion.

## 📌 Description In #1809 we previously added a compute-capability-based support check for `mm_fp4`. However, we missed enabling SM121 for backend = `cudnn` and `cutlass`. Additionally, we marked `trtllm` as supported on SM120 when it is not. Current PR fixes it. Example benchmark and pytest command on SM121 after the fix ``` (py312) root@f414f262f02a:/flashinfer/benchmarks# python3 flashinfer_benchmark.py --routine mm_fp4 --m 8192 --n 7168 --k 512 --out_dtype bfloat16 --backends cudnn cutlass --use_128x4_sf_layout --use_nvfp4 --refcheck --use_cupti /opt/conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:285: UserWarning: Found GPU0 NVIDIA GB10 which is of cuda capability 12.1. Minimum and Maximum cuda capability supported by this version of PyTorch is (8.0) - (12.0) warnings.warn( [PERF] cudnn :: median time 0.656 ms; std 0.025 ms; achieved tflops 91.701 TFLOPs/sec; achieved tb_per_sec 0.185 TB/sec [PERF] cutlass :: median time 0.669 ms; std 0.022 ms; achieved tflops 89.859 TFLOPs/sec; achieved tb_per_sec 0.181 TB/sec (py312) root@f414f262f02a:/flashinfer# pytest tests/gemm/test_mm_fp4.py ====================================================================================================================== test session starts ====================================================================================================================== platform linux -- Python 3.12.11, pytest-8.4.2, pluggy-1.6.0 rootdir: /flashinfer configfile: pytest.ini collected 3240 items ... ======================================================================================================================= warnings summary ======================================================================================================================== ../opt/conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:285 /opt/conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:285: UserWarning: Found GPU0 NVIDIA GB10 which is of cuda capability 12.1. Minimum and Maximum cuda capability supported by this version of PyTorch is (8.0) - (12.0) warnings.warn( -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ========================================================================================================= 450 passed, 2790 skipped, 1 warning in 8.24s ========================================================================================================== ```  ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **New Features** * Expanded hardware compatibility by adding support for newer NVIDIA GPU architectures. * FP4 quantized operations now available across multiple backends on supported devices.

## 📌 Description In flashinfer-ai#1809 we previously added a compute-capability-based support check for `mm_fp4`. However, we missed enabling SM121 for backend = `cudnn` and `cutlass`. Additionally, we marked `trtllm` as supported on SM120 when it is not. Current PR fixes it. Example benchmark and pytest command on SM121 after the fix ``` (py312) root@f414f262f02a:/flashinfer/benchmarks# python3 flashinfer_benchmark.py --routine mm_fp4 --m 8192 --n 7168 --k 512 --out_dtype bfloat16 --backends cudnn cutlass --use_128x4_sf_layout --use_nvfp4 --refcheck --use_cupti /opt/conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:285: UserWarning: Found GPU0 NVIDIA GB10 which is of cuda capability 12.1. Minimum and Maximum cuda capability supported by this version of PyTorch is (8.0) - (12.0) warnings.warn( [PERF] cudnn :: median time 0.656 ms; std 0.025 ms; achieved tflops 91.701 TFLOPs/sec; achieved tb_per_sec 0.185 TB/sec [PERF] cutlass :: median time 0.669 ms; std 0.022 ms; achieved tflops 89.859 TFLOPs/sec; achieved tb_per_sec 0.181 TB/sec (py312) root@f414f262f02a:/flashinfer# pytest tests/gemm/test_mm_fp4.py ====================================================================================================================== test session starts ====================================================================================================================== platform linux -- Python 3.12.11, pytest-8.4.2, pluggy-1.6.0 rootdir: /flashinfer configfile: pytest.ini collected 3240 items ... ======================================================================================================================= warnings summary ======================================================================================================================== ../opt/conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:285 /opt/conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:285: UserWarning: Found GPU0 NVIDIA GB10 which is of cuda capability 12.1. Minimum and Maximum cuda capability supported by this version of PyTorch is (8.0) - (12.0) warnings.warn( -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ========================================================================================================= 450 passed, 2790 skipped, 1 warning in 8.24s ========================================================================================================== ```  ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **New Features** * Expanded hardware compatibility by adding support for newer NVIDIA GPU architectures. * FP4 quantized operations now available across multiple backends on supported devices.

sricketts requested changes Sep 30, 2025

View reviewed changes

Comment thread flashinfer/utils.py Outdated

Comment thread flashinfer/gemm.py Outdated

Comment thread flashinfer/utils.py Outdated

nvjullin reviewed Oct 1, 2025

View reviewed changes

Comment thread flashinfer/utils.py Outdated

nvjullin reviewed Oct 1, 2025

View reviewed changes

Comment thread flashinfer/utils.py Outdated

nvjullin reviewed Oct 1, 2025

View reviewed changes

Comment thread flashinfer/utils.py Outdated

nvmbreughe added 14 commits October 13, 2025 09:13

Added PoC for decorator

ccde043

Added docstring

651f96c

Added problem_size_check to the decorator

605dc6d

Conflict resolution

ecd8749

Added skip_check inserted argument

42af3b0

Added checks for the other backends

3ba5554

Added documentation

b5ba88b

Added reviewer suggestion

e214986

Added underscores to the requirement functions

1f2aa6e

Made compute capability integers

842fcc4

Added unit tests and type checking

6ecc87b

Using functools.wrap

af5830d

Fixed some tests

89cd9b1

Cleanup

151fc7e

nvmbreughe force-pushed the support_checks_poc branch from 3c9f687 to 151fc7e Compare October 13, 2025 17:01

Added type hint to decorators

4eb0164

nvmbreughe marked this pull request as ready for review October 13, 2025 17:07

nvmbreughe requested a review from sricketts October 13, 2025 17:07

nvmbreughe enabled auto-merge (squash) October 13, 2025 22:44

aleozlx approved these changes Oct 13, 2025

View reviewed changes

yzh119 approved these changes Oct 14, 2025

View reviewed changes

sricketts requested changes Oct 14, 2025

View reviewed changes

Comment thread tests/gemm/test_mm_fp4.py Outdated

Keeping things dry in fp4 test

adf50d9

sricketts approved these changes Oct 14, 2025

View reviewed changes

nvmbreughe merged commit d728bcd into flashinfer-ai:main Oct 14, 2025
3 checks passed

sricketts mentioned this pull request Oct 14, 2025

Roadmap (2025 Q4) #1770

Closed

31 tasks

bkryu mentioned this pull request Oct 30, 2025

fix: Enable SM121 for mm_fp4 #2012

Merged

5 tasks

jimmyzho mentioned this pull request Dec 16, 2025

[API Usability] Support checks for all library components #2225

Open

4 tasks

Conversation

nvmbreughe commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nvjullin commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nvmbreughe commented Oct 9, 2025

Uh oh!

nvjullin commented Oct 13, 2025

Uh oh!

nvmbreughe commented Oct 13, 2025

Uh oh!

nvmbreughe commented Oct 13, 2025

Uh oh!

flashinfer-bot commented Oct 13, 2025

Uh oh!

aleozlx left a comment

Choose a reason for hiding this comment

Uh oh!

flashinfer-bot commented Oct 14, 2025

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

sricketts left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

nvmbreughe commented Sep 29, 2025 •

edited

Loading

nvjullin commented Oct 1, 2025 •

edited

Loading