Skip to content

perf: optimize per-token nvfp4 quantization kernel.#3237

Merged
aleozlx merged 57 commits intoflashinfer-ai:mainfrom
IwakuraRein:per-token-opt
May 8, 2026
Merged

perf: optimize per-token nvfp4 quantization kernel.#3237
aleozlx merged 57 commits intoflashinfer-ai:mainfrom
IwakuraRein:per-token-opt

Conversation

@IwakuraRein
Copy link
Copy Markdown
Collaborator

@IwakuraRein IwakuraRein commented May 5, 2026

📌 Description

Optimize the performance of the per-token nvfp4 quantization kernel introduced by #3027.

  1. default block size to 128.
  2. default to fast math path. rename TE_EXACT_FP4 to TRTLLM_DISABLE_FP4_QUANT_FAST_MATH and controlled by environmental variable.
  3. change argument list of get_sf_out_offset_128x4 and get_sf_out_offset_8x4.

TODOs:

  1. optimize low latency cases.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • New Features

    • Added environment variable configuration to disable fast-math optimization in FP4 quantization, enabling behavior alignment with alternative implementations.
  • Tests

    • Added test fixture to validate FP4 quantization functionality with fast-math mode disabled.

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 5, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR introduces runtime control over FP4 quantization fast-math behavior via a new environment variable TRTLLM_DISABLE_FP4_QUANT_FAST_MATH. It refactors kernel signatures to support this flag, updates quantization utility functions to remove std::optional overhead, and integrates the control into the quantization dispatch path and tests.

Changes

FP4 Quantization Fast-Math Runtime Control

Layer / File(s) Summary
Environment Variable Declaration & Implementation
csrc/nv_internal/tensorrt_llm/common/envUtils.h, csrc/nv_internal/cpp/common/envUtils.cpp
New getEnvDisableFP4QuantFastMath() function reads and caches TRTLLM_DISABLE_FP4_QUANT_FAST_MATH; existing getEnvEplbForceGdrcopy() refactored to use consistent static caching pattern.
Quantization Utility Functions
csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh
cvt_warp_fp16_to_fp4 adds DISABLE_FP4_QUANT_FAST_MATH template parameter; SF offset helpers get_sf_out_offset_128x4 and get_sf_out_offset_8x4 refactored from std::optional-based signatures to overloaded int-parameter versions for batched/unbatched cases.
Kernel Template Signature Updates
csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh
nvfp4QuantAndPerTokenScaleKernel template parameters changed from CACHE_LOCAL_AMAX/TE_EXACT_NVFP4 to CACHE_INPUT/DISABLE_FP4_QUANT_FAST_MATH; shared-memory input caching introduced; SF offset computation refactored to use helper functions.
Kernel Dispatch & Integration
csrc/nv_internal/cpp/kernels/quantization.cu, csrc/nv_internal/tensorrt_llm/kernels/quantization.h
Dispatch macro updated to pass DISABLE_FP4_QUANT_FAST_MATH template parameter; invokeNvfp4QuantAndPerTokenScale retrieves environment flag and routes to appropriate kernel instantiation; SF interleave kernels now call offset helpers directly without std::optional wrappers.
Test Infrastructure
tests/utils/test_fp4_quantize.py
PyTest fixture set_te_reference_test_env manages TRTLLM_DISABLE_FP4_QUANT_FAST_MATH environment variable; TE reference test updated to set flag and conditionally skip based on variable state.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

run-ci, op: moe

Suggested reviewers

  • yzh119
  • nv-yunzheq
  • bkryu
  • cyx-6
  • samuellees
  • yongwww
  • kahyunnam
  • jiahanc

Poem

🐰 Whiskers twitching with glee,
Fast-math flags now dance free,
DISABLE or not? The choice is yours!
Kernels heed what env assures.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.69% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main optimization objective of the pull request: improving performance of the per-token nvfp4 quantization kernel.
Description check ✅ Passed The PR description addresses the template requirements: it explains what the PR does (optimizes per-token nvfp4 kernel), includes pre-commit checks marked as completed, and indicates tests have been added/updated and are passing.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to disable fast FP4 quantization math via an environment variable and implements shared memory caching for input vectors in NVFP4 kernels to improve performance. The review feedback highlights several critical concerns regarding potential integer overflows in memory offset calculations where uint32_t was used instead of int64_t, which could lead to out-of-bounds memory access on large tensors. Additionally, the reviewer suggests caching environment variable lookups for better efficiency and resolving inconsistencies between code comments and the implementation of shared memory caching.

Comment thread csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh
Comment thread csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh Outdated
Comment thread csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh Outdated
Comment thread csrc/nv_internal/cpp/common/envUtils.cpp Outdated
Comment thread csrc/nv_internal/cpp/kernels/quantization.cu Outdated
Comment thread csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh Outdated
Comment thread tests/utils/test_fp4_quantize.py
Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
@IwakuraRein
Copy link
Copy Markdown
Collaborator Author

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !632 has been created, and the CI pipeline #50368900 is currently running. I'll report back once the pipeline job completes.

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh (1)

392-415: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Complete the template parameter rename in the sibling function.

cvt_warp_fp16_to_fp4 (line 292) was renamed to use DISABLE_FP4_QUANT_FAST_MATH, but its sibling cvt_warp_fp16_to_fp4_with_vec_max (line 393) still uses the old TE_EXACT_NVFP4 template parameter and branch check (line 415). Both functions implement the same compile-time switch. Rename both occurrences in cvt_warp_fp16_to_fp4_with_vec_max to maintain consistency and avoid confusion for callers using named template arguments.

♻️ Proposed rename
 template <class Type, int SF_VEC_SIZE, int CVT_ELTS_PER_THREAD, bool UE8M0_SF,
-          bool TE_EXACT_NVFP4 = false>
+          bool DISABLE_FP4_QUANT_FAST_MATH = false>
 __device__ std::conditional_t<CVT_ELTS_PER_THREAD == 16, uint64_t, uint32_t>
 cvt_warp_fp16_to_fp4_with_vec_max(PackedVec<Type, CVT_ELTS_PER_THREAD>& vec, float SFScaleVal,
                                   float reciprocalSFScaleVal, float vecMax, uint8_t* SFout) {
@@
-  } else if constexpr (TE_EXACT_NVFP4) {
+  } else if constexpr (DISABLE_FP4_QUANT_FAST_MATH) {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh` around lines
392 - 415, The sibling function cvt_warp_fp16_to_fp4_with_vec_max still declares
and checks the old template parameter TE_EXACT_NVFP4; update its template
parameter list and the corresponding constexpr branch (the occurrence after the
UE8M0_SF branch) to use the new name DISABLE_FP4_QUANT_FAST_MATH so it matches
cvt_warp_fp16_to_fp4 and the compile-time switch is consistent for callers using
named template arguments.
🧹 Nitpick comments (1)
csrc/nv_internal/cpp/kernels/quantization.cu (1)

239-250: 💤 Low value

Typo in macro name: NVP4 should be NVFP4.

The macro is named DISPATCH_NVP4_QUANT_AND_PER_TOKEN_SCALE_KERNEL (missing the F) and is referenced by the same misspelling at lines 275, 277, 282, 284, 289, 291. Worth fixing while the surface is fresh, since the macro is local to this file.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@csrc/nv_internal/cpp/kernels/quantization.cu` around lines 239 - 250, The
macro name is misspelled: rename the macro
DISPATCH_NVP4_QUANT_AND_PER_TOKEN_SCALE_KERNEL to
DISPATCH_NVFP4_QUANT_AND_PER_TOKEN_SCALE_KERNEL (note the added "F") and update
all local references that use the misspelled identifier (e.g., the call sites at
the later references in this file) so they match the new name; ensure you change
both the macro definition and every usage of
DISPATCH_NVP4_QUANT_AND_PER_TOKEN_SCALE_KERNEL to
DISPATCH_NVFP4_QUANT_AND_PER_TOKEN_SCALE_KERNEL so the dispatcher resolves
correctly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/utils/test_fp4_quantize.py`:
- Around line 587-595: The triple-quoted string meant as the test docstring is
placed after set_te_reference_test_env("1") so it becomes a discarded string
expression; move that triple-quoted string to be the very first statement in the
test function (so it is the actual docstring for
test_nvfp4_per_token_quantize_te_reference) and remove the unreachable skip
block that checks os.getenv("TRTLLM_DISABLE_FP4_QUANT_FAST_MATH", "0") == "0"
(since set_te_reference_test_env forces the env var to "1"), leaving only the
valid device/FP4 support skip using _is_fp4_supported(torch.device(device)).
- Around line 553-571: The C++ function getEnvDisableFP4QuantFastMath() caches
the env var in a static bool so changing os.environ in the Python fixture
set_te_reference_test_env() doesn't affect the kernel after the first use
(invokeNvfp4QuantAndPerTokenScale), making tests order-dependent; fix by either
(preferred) adding a C++ API to reset/re-read that cached value (e.g., expose a
resetEnvDisableFP4QuantFastMath() or make getEnvDisableFP4QuantFastMath() read
the env dynamically) and call that reset from the fixture
(set_te_reference_test_env) after changing the env, or alternatively document
the ordering constraint in the fixture docstring; also move the dangling
triple-quoted string into the function start so it becomes a real docstring for
set_te_reference_test_env.

---

Outside diff comments:
In `@csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh`:
- Around line 392-415: The sibling function cvt_warp_fp16_to_fp4_with_vec_max
still declares and checks the old template parameter TE_EXACT_NVFP4; update its
template parameter list and the corresponding constexpr branch (the occurrence
after the UE8M0_SF branch) to use the new name DISABLE_FP4_QUANT_FAST_MATH so it
matches cvt_warp_fp16_to_fp4 and the compile-time switch is consistent for
callers using named template arguments.

---

Nitpick comments:
In `@csrc/nv_internal/cpp/kernels/quantization.cu`:
- Around line 239-250: The macro name is misspelled: rename the macro
DISPATCH_NVP4_QUANT_AND_PER_TOKEN_SCALE_KERNEL to
DISPATCH_NVFP4_QUANT_AND_PER_TOKEN_SCALE_KERNEL (note the added "F") and update
all local references that use the misspelled identifier (e.g., the call sites at
the later references in this file) so they match the new name; ensure you change
both the macro definition and every usage of
DISPATCH_NVP4_QUANT_AND_PER_TOKEN_SCALE_KERNEL to
DISPATCH_NVFP4_QUANT_AND_PER_TOKEN_SCALE_KERNEL so the dispatcher resolves
correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 675e3764-bc08-4651-a522-870fb4c63ad0

📥 Commits

Reviewing files that changed from the base of the PR and between ba30d4f and 62dadb3.

📒 Files selected for processing (7)
  • csrc/nv_internal/cpp/common/envUtils.cpp
  • csrc/nv_internal/cpp/kernels/quantization.cu
  • csrc/nv_internal/tensorrt_llm/common/envUtils.h
  • csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh
  • csrc/nv_internal/tensorrt_llm/kernels/quantization.h
  • csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh
  • tests/utils/test_fp4_quantize.py

Comment on lines +553 to +571
@pytest.fixture
def set_te_reference_test_env():
"""Fixture to set and reset TRTLLM_DISABLE_FP4_QUANT_FAST_MATH environment variable."""
original_value = os.environ.get("TRTLLM_DISABLE_FP4_QUANT_FAST_MATH", None)

def _set_algo(algo: str):
if algo == "auto":
os.environ.pop("TRTLLM_DISABLE_FP4_QUANT_FAST_MATH", None)
else:
os.environ["TRTLLM_DISABLE_FP4_QUANT_FAST_MATH"] = algo

yield _set_algo

# Restore original value
if original_value is None:
os.environ.pop("TRTLLM_DISABLE_FP4_QUANT_FAST_MATH", None)
else:
os.environ["TRTLLM_DISABLE_FP4_QUANT_FAST_MATH"] = original_value

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Confirm the env getter is cached and locate all call sites of the per-token dispatch.
rg -nP -C2 'getEnvDisableFP4QuantFastMath'
rg -nP -C2 'invokeNvfp4QuantAndPerTokenScale\b'
rg -nP -C2 'per_token_activation\s*=\s*True' tests/

Repository: flashinfer-ai/flashinfer

Length of output: 6776


🏁 Script executed:

sed -n '553,595p' tests/utils/test_fp4_quantize.py | cat -n

Repository: flashinfer-ai/flashinfer

Length of output: 2066


C++ static cache defeats fixture's "restore" semantics, causing test order fragility.

getEnvDisableFP4QuantFastMath() in csrc/nv_internal/cpp/common/envUtils.cpp initializes a static bool const on first call, so the C++ side captures whatever the env var is at the moment of the first per-token NVFP4 dispatch and never re-reads it. Consequences:

  • The "restore original value" branch (lines 567–571) only restores os.environ; the C++ runtime keeps the cached value for the rest of the process.
  • If any earlier test in the run dispatches invokeNvfp4QuantAndPerTokenScale before this fixture sets the env var, the cache is locked and the fixture's env var manipulation becomes a no-op for the kernel.
  • Conversely, once this test runs first and locks the cache to a value, all later tests using per-token activation (e.g., at test_fp4_quantize.py:636,688 or test_trtllm_gen_per_token_moe.py:113) will permanently use that cached path.

This works only due to pytest's declaration ordering; it breaks with test sharding, -k filtering, or pytest-randomly. Either document this constraint on the fixture or refactor the kernel to read the env var dynamically when needed for tests.

Minor: Line 36 is a dangling string literal, not a docstring. The first executable statement in the function is at line 35 (set_te_reference_test_env("1")), so the triple-quoted string on line 36 is discarded at runtime. Move it to line 35 or restructure as a proper docstring.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/utils/test_fp4_quantize.py` around lines 553 - 571, The C++ function
getEnvDisableFP4QuantFastMath() caches the env var in a static bool so changing
os.environ in the Python fixture set_te_reference_test_env() doesn't affect the
kernel after the first use (invokeNvfp4QuantAndPerTokenScale), making tests
order-dependent; fix by either (preferred) adding a C++ API to reset/re-read
that cached value (e.g., expose a resetEnvDisableFP4QuantFastMath() or make
getEnvDisableFP4QuantFastMath() read the env dynamically) and call that reset
from the fixture (set_te_reference_test_env) after changing the env, or
alternatively document the ordering constraint in the fixture docstring; also
move the dangling triple-quoted string into the function start so it becomes a
real docstring for set_te_reference_test_env.

Comment on lines +587 to +595
set_te_reference_test_env("1")
"""Per-token NVFP4 quantization should match the TE Python reference bitwise."""
if not _is_fp4_supported(torch.device(device)):
pytest.skip("Nvfp4 Requires compute capability >= 10 and CUDA >= 12.8")
if os.getenv("TRTLLM_DISABLE_FP4_QUANT_FAST_MATH", "0") == "0":
pytest.skip(
"Environment variable TRTLLM_DISABLE_FP4_QUANT_FAST_MATH is not set or false, "
"skipping test_nvfp4_per_token_quantize_te_reference."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Misplaced docstring and unreachable skip block.

Two issues in this body:

  1. Docstring is now a no-op: set_te_reference_test_env("1") is the first statement in the function, so the triple-quoted string on line 588 is no longer interpreted as a docstring — it is a discarded string expression and won't show up in --collect-only/--co -q listings or pydoc.
  2. Dead skip: with the fixture forcing the env var to "1" on line 587, the check at lines 591-595 can never evaluate to True, so the skip is unreachable. (This appears to address the prior request to override the env var inside the test, but the leftover skip should be removed.)
🔧 Suggested cleanup
 def test_nvfp4_per_token_quantize_te_reference(
     dtype: torch.dtype,
     shape: tuple[int, int],
     is_sf_swizzled_layout: bool,
     init_data: str,
     device: str,
     set_te_reference_test_env,
 ) -> None:
-    set_te_reference_test_env("1")
     """Per-token NVFP4 quantization should match the TE Python reference bitwise."""
+    set_te_reference_test_env("1")
     if not _is_fp4_supported(torch.device(device)):
         pytest.skip("Nvfp4 Requires compute capability >= 10 and CUDA >= 12.8")
-    if os.getenv("TRTLLM_DISABLE_FP4_QUANT_FAST_MATH", "0") == "0":
-        pytest.skip(
-            "Environment variable TRTLLM_DISABLE_FP4_QUANT_FAST_MATH is not set or false, "
-            "skipping test_nvfp4_per_token_quantize_te_reference."
-        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
set_te_reference_test_env("1")
"""Per-token NVFP4 quantization should match the TE Python reference bitwise."""
if not _is_fp4_supported(torch.device(device)):
pytest.skip("Nvfp4 Requires compute capability >= 10 and CUDA >= 12.8")
if os.getenv("TRTLLM_DISABLE_FP4_QUANT_FAST_MATH", "0") == "0":
pytest.skip(
"Environment variable TRTLLM_DISABLE_FP4_QUANT_FAST_MATH is not set or false, "
"skipping test_nvfp4_per_token_quantize_te_reference."
)
"""Per-token NVFP4 quantization should match the TE Python reference bitwise."""
set_te_reference_test_env("1")
if not _is_fp4_supported(torch.device(device)):
pytest.skip("Nvfp4 Requires compute capability >= 10 and CUDA >= 12.8")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/utils/test_fp4_quantize.py` around lines 587 - 595, The triple-quoted
string meant as the test docstring is placed after
set_te_reference_test_env("1") so it becomes a discarded string expression; move
that triple-quoted string to be the very first statement in the test function
(so it is the actual docstring for test_nvfp4_per_token_quantize_te_reference)
and remove the unreachable skip block that checks
os.getenv("TRTLLM_DISABLE_FP4_QUANT_FAST_MATH", "0") == "0" (since
set_te_reference_test_env forces the env var to "1"), leaving only the valid
device/FP4 support skip using _is_fp4_supported(torch.device(device)).

@aleozlx aleozlx merged commit 202af00 into flashinfer-ai:main May 8, 2026
46 of 47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants