perf: optimize per-token nvfp4 quantization kernel. by IwakuraRein · Pull Request #3237 · flashinfer-ai/flashinfer

IwakuraRein · 2026-05-05T22:24:55Z

📌 Description

Optimize the performance of the per-token nvfp4 quantization kernel introduced by #3027.

default block size to 128.
default to fast math path. rename TE_EXACT_FP4 to TRTLLM_DISABLE_FP4_QUANT_FAST_MATH and controlled by environmental variable.
change argument list of get_sf_out_offset_128x4 and get_sf_out_offset_8x4.

TODOs:

optimize low latency cases.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

New Features
- Added environment variable configuration to disable fast-math optimization in FP4 quantization, enabling behavior alignment with alternative implementations.
Tests
- Added test fixture to validate FP4 quantization functionality with fast-math mode disabled.

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

coderabbitai · 2026-05-05T22:25:03Z

📝 Walkthrough

Walkthrough

This PR introduces runtime control over FP4 quantization fast-math behavior via a new environment variable TRTLLM_DISABLE_FP4_QUANT_FAST_MATH. It refactors kernel signatures to support this flag, updates quantization utility functions to remove std::optional overhead, and integrates the control into the quantization dispatch path and tests.

Changes

FP4 Quantization Fast-Math Runtime Control

Layer / File(s)	Summary
Environment Variable Declaration & Implementation `csrc/nv_internal/tensorrt_llm/common/envUtils.h`, `csrc/nv_internal/cpp/common/envUtils.cpp`	New `getEnvDisableFP4QuantFastMath()` function reads and caches `TRTLLM_DISABLE_FP4_QUANT_FAST_MATH`; existing `getEnvEplbForceGdrcopy()` refactored to use consistent static caching pattern.
Quantization Utility Functions `csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh`	`cvt_warp_fp16_to_fp4` adds `DISABLE_FP4_QUANT_FAST_MATH` template parameter; SF offset helpers `get_sf_out_offset_128x4` and `get_sf_out_offset_8x4` refactored from std::optional-based signatures to overloaded `int`-parameter versions for batched/unbatched cases.
Kernel Template Signature Updates `csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh`	`nvfp4QuantAndPerTokenScaleKernel` template parameters changed from `CACHE_LOCAL_AMAX`/`TE_EXACT_NVFP4` to `CACHE_INPUT`/`DISABLE_FP4_QUANT_FAST_MATH`; shared-memory input caching introduced; SF offset computation refactored to use helper functions.
Kernel Dispatch & Integration `csrc/nv_internal/cpp/kernels/quantization.cu`, `csrc/nv_internal/tensorrt_llm/kernels/quantization.h`	Dispatch macro updated to pass `DISABLE_FP4_QUANT_FAST_MATH` template parameter; `invokeNvfp4QuantAndPerTokenScale` retrieves environment flag and routes to appropriate kernel instantiation; SF interleave kernels now call offset helpers directly without std::optional wrappers.
Test Infrastructure `tests/utils/test_fp4_quantize.py`	PyTest fixture `set_te_reference_test_env` manages `TRTLLM_DISABLE_FP4_QUANT_FAST_MATH` environment variable; TE reference test updated to set flag and conditionally skip based on variable state.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

flashinfer-ai/flashinfer#2268: Modifies the FP4 quantization conversion path and NVFP4 kernel launch logic.
flashinfer-ai/flashinfer#2343: Updates same quantization kernels and SF offset helpers with related template/signature changes.
flashinfer-ai/flashinfer#3027: Adds per-token NVFP4 kernels and related conversion helpers in the same quantization subsystem.

Suggested labels

run-ci, op: moe

Suggested reviewers

yzh119
nv-yunzheq
bkryu
cyx-6
samuellees
yongwww
kahyunnam
jiahanc

Poem

🐰 Whiskers twitching with glee,
Fast-math flags now dance free,
DISABLE or not? The choice is yours!
Kernels heed what env assures. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 7.69% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main optimization objective of the pull request: improving performance of the per-token nvfp4 quantization kernel.
Description check	✅ Passed	The PR description addresses the template requirements: it explains what the PR does (optimizes per-token nvfp4 kernel), includes pre-commit checks marked as completed, and indicates tests have been added/updated and are passing.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a mechanism to disable fast FP4 quantization math via an environment variable and implements shared memory caching for input vectors in NVFP4 kernels to improve performance. The review feedback highlights several critical concerns regarding potential integer overflows in memory offset calculations where uint32_t was used instead of int64_t, which could lead to out-of-bounds memory access on large tensors. Additionally, the reviewer suggests caching environment variable lookups for better efficiency and resolving inconsistencies between code comments and the implementation of shared memory caching.

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

IwakuraRein · 2026-05-05T23:17:22Z

/bot run

flashinfer-bot · 2026-05-05T23:18:31Z

GitLab MR !632 has been created, and the CI pipeline #50368900 is currently running. I'll report back once the pipeline job completes.

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh (1)
392-415: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Complete the template parameter rename in the sibling function.

cvt_warp_fp16_to_fp4 (line 292) was renamed to use DISABLE_FP4_QUANT_FAST_MATH, but its sibling cvt_warp_fp16_to_fp4_with_vec_max (line 393) still uses the old TE_EXACT_NVFP4 template parameter and branch check (line 415). Both functions implement the same compile-time switch. Rename both occurrences in cvt_warp_fp16_to_fp4_with_vec_max to maintain consistency and avoid confusion for callers using named template arguments.
♻️ Proposed rename
 template <class Type, int SF_VEC_SIZE, int CVT_ELTS_PER_THREAD, bool UE8M0_SF,
-          bool TE_EXACT_NVFP4 = false>
+          bool DISABLE_FP4_QUANT_FAST_MATH = false>
 __device__ std::conditional_t<CVT_ELTS_PER_THREAD == 16, uint64_t, uint32_t>
 cvt_warp_fp16_to_fp4_with_vec_max(PackedVec<Type, CVT_ELTS_PER_THREAD>& vec, float SFScaleVal,
                                   float reciprocalSFScaleVal, float vecMax, uint8_t* SFout) {
@@
-  } else if constexpr (TE_EXACT_NVFP4) {
+  } else if constexpr (DISABLE_FP4_QUANT_FAST_MATH) {
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh` around lines
392 - 415, The sibling function cvt_warp_fp16_to_fp4_with_vec_max still declares
and checks the old template parameter TE_EXACT_NVFP4; update its template
parameter list and the corresponding constexpr branch (the occurrence after the
UE8M0_SF branch) to use the new name DISABLE_FP4_QUANT_FAST_MATH so it matches
cvt_warp_fp16_to_fp4 and the compile-time switch is consistent for callers using
named template arguments.

🧹 Nitpick comments (1)

csrc/nv_internal/cpp/kernels/quantization.cu (1)
239-250: 💤 Low value

Typo in macro name: NVP4 should be NVFP4.

The macro is named DISPATCH_NVP4_QUANT_AND_PER_TOKEN_SCALE_KERNEL (missing the F) and is referenced by the same misspelling at lines 275, 277, 282, 284, 289, 291. Worth fixing while the surface is fresh, since the macro is local to this file.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@csrc/nv_internal/cpp/kernels/quantization.cu` around lines 239 - 250, The
macro name is misspelled: rename the macro
DISPATCH_NVP4_QUANT_AND_PER_TOKEN_SCALE_KERNEL to
DISPATCH_NVFP4_QUANT_AND_PER_TOKEN_SCALE_KERNEL (note the added "F") and update
all local references that use the misspelled identifier (e.g., the call sites at
the later references in this file) so they match the new name; ensure you change
both the macro definition and every usage of
DISPATCH_NVP4_QUANT_AND_PER_TOKEN_SCALE_KERNEL to
DISPATCH_NVFP4_QUANT_AND_PER_TOKEN_SCALE_KERNEL so the dispatcher resolves
correctly.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/utils/test_fp4_quantize.py`:
- Around line 587-595: The triple-quoted string meant as the test docstring is
placed after set_te_reference_test_env("1") so it becomes a discarded string
expression; move that triple-quoted string to be the very first statement in the
test function (so it is the actual docstring for
test_nvfp4_per_token_quantize_te_reference) and remove the unreachable skip
block that checks os.getenv("TRTLLM_DISABLE_FP4_QUANT_FAST_MATH", "0") == "0"
(since set_te_reference_test_env forces the env var to "1"), leaving only the
valid device/FP4 support skip using _is_fp4_supported(torch.device(device)).
- Around line 553-571: The C++ function getEnvDisableFP4QuantFastMath() caches
the env var in a static bool so changing os.environ in the Python fixture
set_te_reference_test_env() doesn't affect the kernel after the first use
(invokeNvfp4QuantAndPerTokenScale), making tests order-dependent; fix by either
(preferred) adding a C++ API to reset/re-read that cached value (e.g., expose a
resetEnvDisableFP4QuantFastMath() or make getEnvDisableFP4QuantFastMath() read
the env dynamically) and call that reset from the fixture
(set_te_reference_test_env) after changing the env, or alternatively document
the ordering constraint in the fixture docstring; also move the dangling
triple-quoted string into the function start so it becomes a real docstring for
set_te_reference_test_env.

---

Outside diff comments:
In `@csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh`:
- Around line 392-415: The sibling function cvt_warp_fp16_to_fp4_with_vec_max
still declares and checks the old template parameter TE_EXACT_NVFP4; update its
template parameter list and the corresponding constexpr branch (the occurrence
after the UE8M0_SF branch) to use the new name DISABLE_FP4_QUANT_FAST_MATH so it
matches cvt_warp_fp16_to_fp4 and the compile-time switch is consistent for
callers using named template arguments.

---

Nitpick comments:
In `@csrc/nv_internal/cpp/kernels/quantization.cu`:
- Around line 239-250: The macro name is misspelled: rename the macro
DISPATCH_NVP4_QUANT_AND_PER_TOKEN_SCALE_KERNEL to
DISPATCH_NVFP4_QUANT_AND_PER_TOKEN_SCALE_KERNEL (note the added "F") and update
all local references that use the misspelled identifier (e.g., the call sites at
the later references in this file) so they match the new name; ensure you change
both the macro definition and every usage of
DISPATCH_NVP4_QUANT_AND_PER_TOKEN_SCALE_KERNEL to
DISPATCH_NVFP4_QUANT_AND_PER_TOKEN_SCALE_KERNEL so the dispatcher resolves
correctly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 675e3764-bc08-4651-a522-870fb4c63ad0

📥 Commits

Reviewing files that changed from the base of the PR and between ba30d4f and 62dadb3.

📒 Files selected for processing (7)

csrc/nv_internal/cpp/common/envUtils.cpp
csrc/nv_internal/cpp/kernels/quantization.cu
csrc/nv_internal/tensorrt_llm/common/envUtils.h
csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh
csrc/nv_internal/tensorrt_llm/kernels/quantization.h
csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh
tests/utils/test_fp4_quantize.py

coderabbitai · 2026-05-07T22:44:18Z

+@pytest.fixture
+def set_te_reference_test_env():
+    """Fixture to set and reset TRTLLM_DISABLE_FP4_QUANT_FAST_MATH environment variable."""
+    original_value = os.environ.get("TRTLLM_DISABLE_FP4_QUANT_FAST_MATH", None)
+
+    def _set_algo(algo: str):
+        if algo == "auto":
+            os.environ.pop("TRTLLM_DISABLE_FP4_QUANT_FAST_MATH", None)
+        else:
+            os.environ["TRTLLM_DISABLE_FP4_QUANT_FAST_MATH"] = algo
+
+    yield _set_algo
+
+    # Restore original value
+    if original_value is None:
+        os.environ.pop("TRTLLM_DISABLE_FP4_QUANT_FAST_MATH", None)
+    else:
+        os.environ["TRTLLM_DISABLE_FP4_QUANT_FAST_MATH"] = original_value
+


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Confirm the env getter is cached and locate all call sites of the per-token dispatch. rg -nP -C2 'getEnvDisableFP4QuantFastMath' rg -nP -C2 'invokeNvfp4QuantAndPerTokenScale\b' rg -nP -C2 'per_token_activation\s*=\s*True' tests/

Repository: flashinfer-ai/flashinfer

Length of output: 6776

🏁 Script executed:

sed -n '553,595p' tests/utils/test_fp4_quantize.py | cat -n

Repository: flashinfer-ai/flashinfer

Length of output: 2066

C++ static cache defeats fixture's "restore" semantics, causing test order fragility.

getEnvDisableFP4QuantFastMath() in csrc/nv_internal/cpp/common/envUtils.cpp initializes a static bool const on first call, so the C++ side captures whatever the env var is at the moment of the first per-token NVFP4 dispatch and never re-reads it. Consequences:

The "restore original value" branch (lines 567–571) only restores os.environ; the C++ runtime keeps the cached value for the rest of the process.

If any earlier test in the run dispatches invokeNvfp4QuantAndPerTokenScale before this fixture sets the env var, the cache is locked and the fixture's env var manipulation becomes a no-op for the kernel.

Conversely, once this test runs first and locks the cache to a value, all later tests using per-token activation (e.g., at test_fp4_quantize.py:636,688 or test_trtllm_gen_per_token_moe.py:113) will permanently use that cached path.

This works only due to pytest's declaration ordering; it breaks with test sharding, -k filtering, or pytest-randomly. Either document this constraint on the fixture or refactor the kernel to read the env var dynamically when needed for tests.

Minor: Line 36 is a dangling string literal, not a docstring. The first executable statement in the function is at line 35 (set_te_reference_test_env("1")), so the triple-quoted string on line 36 is discarded at runtime. Move it to line 35 or restructure as a proper docstring.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/utils/test_fp4_quantize.py` around lines 553 - 571, The C++ function getEnvDisableFP4QuantFastMath() caches the env var in a static bool so changing os.environ in the Python fixture set_te_reference_test_env() doesn't affect the kernel after the first use (invokeNvfp4QuantAndPerTokenScale), making tests order-dependent; fix by either (preferred) adding a C++ API to reset/re-read that cached value (e.g., expose a resetEnvDisableFP4QuantFastMath() or make getEnvDisableFP4QuantFastMath() read the env dynamically) and call that reset from the fixture (set_te_reference_test_env) after changing the env, or alternatively document the ordering constraint in the fixture docstring; also move the dangling triple-quoted string into the function start so it becomes a real docstring for set_te_reference_test_env.

coderabbitai · 2026-05-07T22:44:18Z

+    set_te_reference_test_env("1")
    """Per-token NVFP4 quantization should match the TE Python reference bitwise."""
    if not _is_fp4_supported(torch.device(device)):
        pytest.skip("Nvfp4 Requires compute capability >= 10 and CUDA >= 12.8")
+    if os.getenv("TRTLLM_DISABLE_FP4_QUANT_FAST_MATH", "0") == "0":
+        pytest.skip(
+            "Environment variable TRTLLM_DISABLE_FP4_QUANT_FAST_MATH is not set or false, "
+            "skipping test_nvfp4_per_token_quantize_te_reference."
+        )


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Misplaced docstring and unreachable skip block.

Two issues in this body:

Docstring is now a no-op: set_te_reference_test_env("1") is the first statement in the function, so the triple-quoted string on line 588 is no longer interpreted as a docstring — it is a discarded string expression and won't show up in --collect-only/--co -q listings or pydoc.

Dead skip: with the fixture forcing the env var to "1" on line 587, the check at lines 591-595 can never evaluate to True, so the skip is unreachable. (This appears to address the prior request to override the env var inside the test, but the leftover skip should be removed.)

🔧 Suggested cleanup

def test_nvfp4_per_token_quantize_te_reference( dtype: torch.dtype, shape: tuple[int, int], is_sf_swizzled_layout: bool, init_data: str, device: str, set_te_reference_test_env, ) -> None: - set_te_reference_test_env("1") """Per-token NVFP4 quantization should match the TE Python reference bitwise.""" + set_te_reference_test_env("1") if not _is_fp4_supported(torch.device(device)): pytest.skip("Nvfp4 Requires compute capability >= 10 and CUDA >= 12.8") - if os.getenv("TRTLLM_DISABLE_FP4_QUANT_FAST_MATH", "0") == "0": - pytest.skip( - "Environment variable TRTLLM_DISABLE_FP4_QUANT_FAST_MATH is not set or false, " - "skipping test_nvfp4_per_token_quantize_te_reference." - )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

set_te_reference_test_env("1")

"""Per-token NVFP4 quantization should match the TE Python reference bitwise."""

if not _is_fp4_supported(torch.device(device)):

pytest.skip("Nvfp4 Requires compute capability >= 10 and CUDA >= 12.8")

if os.getenv("TRTLLM_DISABLE_FP4_QUANT_FAST_MATH", "0") == "0":

pytest.skip(

"Environment variable TRTLLM_DISABLE_FP4_QUANT_FAST_MATH is not set or false, "

"skipping test_nvfp4_per_token_quantize_te_reference."

)

"""Per-token NVFP4 quantization should match the TE Python reference bitwise."""

set_te_reference_test_env("1")

if not _is_fp4_supported(torch.device(device)):

pytest.skip("Nvfp4 Requires compute capability >= 10 and CUDA >= 12.8")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/utils/test_fp4_quantize.py` around lines 587 - 595, The triple-quoted string meant as the test docstring is placed after set_te_reference_test_env("1") so it becomes a discarded string expression; move that triple-quoted string to be the very first statement in the test function (so it is the actual docstring for test_nvfp4_per_token_quantize_te_reference) and remove the unreachable skip block that checks os.getenv("TRTLLM_DISABLE_FP4_QUANT_FAST_MATH", "0") == "0" (since set_te_reference_test_env forces the env var to "1"), leaving only the valid device/FP4 support skip using _is_fp4_supported(torch.device(device)).

IwakuraRein added 30 commits April 8, 2026 22:27

fix typos

5f13582

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

WIP

d61d597

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

WIP

aed7865

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

explicit amax and fp4 quant

e94e34c

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

update core.py

07c284e

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

fix quantization bug when tokens >= 1024

1fca13e

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

fix MoERunner initializer

cb1717f

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

fix deduce_trtllm_gen_tensor_dtype

d4ff51e

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

use expanded_idx_to_permuted_idx in amax kernel

9011f09

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

add nvfp4QuantAndPerTokenScaleKernel

69fd614

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

update test; fix pre-commit

05e6f55

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

make test smaller

7ea8538

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

fix typo

a41b623

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

perf opt

6e05cf9

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

guard expandedIdxToPermutedIdx

ef5386e

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

make sf layout as a template parameter

248162a

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

cache the local amax in smem

2b303a9

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

use 256 bit vectorized load; create python binding

56f1f6a

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

no need for inline ptx for vectorized loading

b60d0f8

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

update api; default per_token_scale to None; update artifacts

bfe8d2b

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

kernel code clean up

a9b4968

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

reduce test memory usage

9e37f75

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

Merge remote-tracking branch 'upstream/main' into per-token-fp4

f164ca9

update checksum

884500f

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

address comments

3ecaf6f

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

typo

1225fdd

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

add fp32 quantization kernel

691a6ee

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

Merge branch 'main' into per-token-fp4

bbe4e17

fix test_scaled_fp4_grouped_quantize

5befebd

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

Merge branch 'main' into per-token-fp4

c0053de

IwakuraRein added 6 commits May 1, 2026 00:14

Merge remote-tracking branch 'upstream/main' into per-token-opt

f309baf

update intrinsic

fed5615

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

update intrinsic

cdb1ed4

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

add launch control; revert math intrinsic

c76aa26

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

replace TE_EXACT_FP4 with environment TRTLLM_DISABLE_FP4_QUANT_FAST_MATH

f13c230

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

Merge remote-tracking branch 'upstream/main' into per-token-opt

ee8bdd2

gemini-code-assist Bot reviewed May 5, 2026

View reviewed changes

zianglih reviewed May 5, 2026

View reviewed changes

Comment thread tests/utils/test_fp4_quantize.py

automatically set env in the test

573cc7a

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

zianglih mentioned this pull request May 7, 2026

Support 4over6 nvfp4 for quantizer and fused MoE #3264

Open

5 tasks

fix select_kernel_fp8

62dadb3

Signed-off-by: Siyuan Fu <siyuanf@nvidia.com>

IwakuraRein marked this pull request as ready for review May 7, 2026 22:35

IwakuraRein requested review from aleozlx, bkryu, cyx-6, jimmyzho, kahyunnam, nv-yunzheq, saltyminty, samuellees, sricketts, yongwww, yyihuang and yzh119 as code owners May 7, 2026 22:35

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

aleozlx approved these changes May 8, 2026

View reviewed changes

aleozlx merged commit 202af00 into flashinfer-ai:main May 8, 2026
46 of 47 checks passed

Conversation

IwakuraRein commented May 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

IwakuraRein commented May 5, 2026

Uh oh!

flashinfer-bot commented May 5, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

IwakuraRein commented May 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 5, 2026 •

edited

Loading