Support 4over6 nvfp4 for quantizer and fused MoE by zianglih · Pull Request #3264 · flashinfer-ai/flashinfer

zianglih · 2026-05-07T22:11:05Z

📌 Description

@HumansAnd

Implement 4over6 nvfp4 from:

Paper: https://arxiv.org/abs/2512.02010
Code: https://github.com/mit-han-lab/fouroversix

TE PR:

Implement 4over6 NVFP4 recipe NVIDIA/TransformerEngine#2972

Both original nvfp4 and per-token nvfp4 quantizer and moe are supported.

The results is bitwise exact with reference implementation by enabling:

FLASHINFER_NVFP4_4OVER6_DISABLE_MSE_FAST_MATH=1
TRTLLM_DISABLE_FP4_QUANT_FAST_MATH=1

    Note:
        Set `FLASHINFER_NVFP4_4OVER6=1` to enable the CUDA backend's 4over6
        MSE scale-candidate mode for fp16/bf16 NVFP4 quantization. This mode
        uses the fouroversix adaptive NV scale range, `256 * 6`, instead of the
        standard NVFP4 range, `448 * 6`. For non-per-token outputs, downstream
        dequantization or GEMM code must use the corresponding adjusted global
        scale. Set `FLASHINFER_NVFP4_4OVER6_DISABLE_MSE_FAST_MATH=1` to use
        the bitwise-exact MSE comparison path.

Under strict no fast math mode, the quantizer is bitwise exact with pytorch reference implementation.

Need to rebase after:

Future work:

TE recipe implementation after Implement row-scaled NVFP4 fprop recipe NVIDIA/TransformerEngine#2931 is merged
Performance optimization

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Release Notes

New Features
- Added NVFP4 "4-over-6" quantization mode enabling runtime selection between two FP4 scale candidates for improved accuracy.
- Introduced environment variable controls (FLASHINFER_NVFP4_4OVER6, TRTLLM_DISABLE_FP4_QUANT_FAST_MATH, FLASHINFER_NVFP4_4OVER6_DISABLE_MSE_FAST_MATH) for quantization behavior customization.
Tests
- Extended test coverage for 4-over-6 quantization mode across FP4 and MoE quantization tests.
- Updated quantization reference helpers to support the new mode.

coderabbitai · 2026-05-07T22:11:21Z

📝 Walkthrough

Walkthrough

This PR adds a runtime-configurable NVFP4 4-over-6 per-token quantization mode with MSE-based scale-candidate selection. It extends kernel templates with USE_4OVER6 and DISABLE_4OVER6_MSE_FAST_MATH parameters, implements dual-candidate FP4 scale generation in the conversion path with warp-reduced error selection, refactors kernel dispatch to read environment toggles, updates per-token scale derivation logic, and provides comprehensive test coverage with Python reference implementations.

Changes

NVFP4 4-over-6 per-token quantization

Layer / File(s)	Summary
Environment configuration interface `csrc/nv_internal/tensorrt_llm/common/envUtils.h`, `csrc/nv_internal/cpp/common/envUtils.cpp`	Add `getEnvNVFP4Use4Over6()` and `getEnvNVFP4Disable4Over6MSEFastMath()` env getters; remove static caching from `getEnvDisableFP4QuantFastMath()` to read dynamically.
Template parameter contracts `csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh`	Extend `quantize_with_block_size`, `quantize_with_block_size_tma`, `cvt_fp16_to_fp4_expert`, and `block_scale_interleave_kernel` with `USE_4OVER6` and `DISABLE_4OVER6_MSE_FAST_MATH` template parameters; add compile-time constraints.
Kernel dispatch and template selection `csrc/nv_internal/cpp/kernels/quantization.cu`	Add reusable dispatch helpers (`dispatchBool`, `dispatchSFLayout`, `dispatchFP4QuantMathMode`, `dispatchFP4KernelConfig`); refactor `invokeNvfp4QuantAndPerTokenScale` and `launchFP4QuantizationTma` to read env toggles and use dispatch helpers.
FP16→FP4 conversion with candidate selection `csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh`	Add `e2m1_code_to_float()` helper; extend `cvt_warp_fp16_to_fp4` with `USE_4OVER6` and `DISABLE_4OVER6_MSE_FAST_MATH`; implement dual-candidate generation, MSE computation, warp reduction, and lower-error selection.
Per-token scale derivation and kernel logic `csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh`	Update inverse-scale handling to conditionally use precise division; add `nvfp4QuantAndPerTokenScaleKernel` branch for `USE_4OVER6 && DISABLE_FP4_QUANT_FAST_MATH` that derives adjusted per-token scale with zero/denormal handling.
Cutlass MoE kernel integration `csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh`	Add `dispatchNVFP4QuantConfig` helper; extend `quantizePackedFPXValue`, `expandInputRowsKernel`, and `doActivationKernel` with `USE_4OVER6`/`DISABLE_4OVER6_MSE_FAST_MATH` parameters; wire dispatch into NVFP4 launches.
Python reference quantization helpers `tests/test_helpers/utils_fp4.py`	Update `nvfp4_global_encode_scale_te` and `nvfp4_global_decode_scale_te` to support `use_4over6` parameter; add `_ref_fp4_quant_te_with_decode_scale` and `ref_fp4_quant_4over6_te` for dual-candidate reference quantization.
Test environment and scale layout helpers `tests/moe/utils.py`, `tests/utils/test_fp4_quantize.py`	Add `set_nvfp4_4over6_env` autouse fixture for env-variable management; add `_te_ref_scale_bytes_for_layout` to convert reference scales into per-layout indexing.
Test parametrization and use_4over6 coverage `tests/utils/test_fp4_quantize.py`, `tests/moe/test_trtllm_cutlass_fused_moe.py`, `tests/moe/test_trtllm_gen_*.py`, `tests/moe/test_trtllm_gen_per_token_moe.py`, `tests/moe/test_trtllm_gen_routed_fused_moe.py`	Parametrize multiple tests over `use_4over6`; update global scale computation to use `use_4over6`-aware helpers; apply conditional MSE tolerances and assertions for 4-over-6 vs standard mode.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

flashinfer-ai/flashinfer#3237: Modifies the same NVFP4 quantization codepath and FP4 fast-math env configuration.
flashinfer-ai/flashinfer#2343: Modifies the same quantization templates and device helpers (cvt_warp_fp16_to_fp4, quantize_with_block_size, dispatch paths).
flashinfer-ai/flashinfer#2268: Modifies the FP4 conversion path and related NVFP4 quantization logic.

Suggested reviewers

sricketts
yongwww
yzh119
cyx-6
samuellees

Poem

A rabbit hops through quantized lands,
Where four meets six, both hand in hand,
Candidates duel for error's grace,
As warps reduce across the space—
A 4-over-6 dance, precise and grand! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 44.90% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title clearly and concisely summarizes the main change: adding 4over6 NVFP4 support to the quantizer and fused MoE components.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The pull request description comprehensively documents the implementation of 4over6 NVFP4 quantization, includes rationale with paper/code references, describes the feature scope, and mentions test coverage and pre-commit verification.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a new NVFP4 quantization mode called '4/6 MSE scale-candidate mode,' which is activated via the FLASHINFER_NVFP4_FOUR_OVER_SIX environment variable. The implementation includes updates to CUDA kernels for per-token scaling and quantization, as well as corresponding Python tests and documentation. Reviewer feedback suggests several optimizations for the CUDA code, including refactoring duplicated logic into helper functions, precalculating values to reduce redundant arithmetic operations within loops, and replacing switch statements with lookup tables to improve performance and readability.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/utils/test_fp4_quantize.py (1)

706-747: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Pin FOUR_OVER_SIX off for the baseline TE-reference test.

Line 706 validates the non-4/6 reference path, but this test can be affected by an externally set FLASHINFER_NVFP4_FOUR_OVER_SIX. Make the mode explicit in-test to avoid environment-coupled failures.

🔧 Proposed fix

 def test_nvfp4_per_token_quantize_te_reference(
     dtype: torch.dtype,
     shape: tuple[int, int],
     sf_layout: SfLayout,
     init_data: str,
     device: str,
+    monkeypatch: pytest.MonkeyPatch,
 ) -> None:
     """Per-token NVFP4 quantization should match the TE Python reference bitwise."""
     if not _is_fp4_supported(torch.device(device)):
         pytest.skip("Nvfp4 Requires compute capability >= 10 and CUDA >= 12.8")
+    monkeypatch.setenv("FLASHINFER_NVFP4_FOUR_OVER_SIX", "0")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/utils/test_fp4_quantize.py` around lines 706 - 747, In
test_nvfp4_per_token_quantize_te_reference ensure the FOUR_OVER_SIX mode is
pinned off so the TE-reference path is deterministic: at the start of
test_nvfp4_per_token_quantize_te_reference set the environment flag
FLASHINFER_NVFP4_FOUR_OVER_SIX="0" (or call your library’s setter if available)
before creating x and running ref_fp4_quant_te/nvfp4_quantize, and restore the
previous value at the end of the test to avoid leaking global state.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tests/utils/test_fp4_quantize.py`:
- Around line 706-747: In test_nvfp4_per_token_quantize_te_reference ensure the
FOUR_OVER_SIX mode is pinned off so the TE-reference path is deterministic: at
the start of test_nvfp4_per_token_quantize_te_reference set the environment flag
FLASHINFER_NVFP4_FOUR_OVER_SIX="0" (or call your library’s setter if available)
before creating x and running ref_fp4_quant_te/nvfp4_quantize, and restore the
previous value at the end of the test to avoid leaking global state.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e9dcc260-81db-4c62-b9e1-585a7ba243bb

📥 Commits

Reviewing files that changed from the base of the PR and between c5c089b and 0b79d4f.

📒 Files selected for processing (5)

csrc/nv_internal/cpp/kernels/quantization.cu
csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh
csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh
flashinfer/quantization/fp4_quantization.py
tests/utils/test_fp4_quantize.py

aleozlx

looks good to me so far!

thx for the contrib. pls address conflicts

coderabbitai

Actionable comments posted: 1

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 064e42d2-1286-4387-8bd1-9c66fe18ddac

📥 Commits

Reviewing files that changed from the base of the PR and between 0b79d4f and b36e9a6.

📒 Files selected for processing (7)

csrc/nv_internal/cpp/common/envUtils.cpp
csrc/nv_internal/cpp/kernels/quantization.cu
csrc/nv_internal/tensorrt_llm/common/envUtils.h
csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh
csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh
flashinfer/quantization/fp4_quantization.py
tests/utils/test_fp4_quantize.py

✅ Files skipped from review due to trivial changes (2)

csrc/nv_internal/tensorrt_llm/common/envUtils.h
flashinfer/quantization/fp4_quantization.py

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/moe/test_trtllm_gen_per_token_moe.py (1)
114-134: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

This changes the scales, but not the backend mode.

The new use_4over6 branch only rewrites the Python-side NVFP4 scale factors. The test never enables 4over6 via set_nvfp4_4over6_env before calling nvfp4_quantize() and trtllm_fp4_block_scale_routed_moe(), so the True cases are not validating the actual 4over6 implementation. Apply the shared env helper around the quantize + kernel section so both sides run in the same mode.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/moe/test_trtllm_gen_per_token_moe.py` around lines 114 - 134, The test
only updates Python-side scales via nvfp4_global_decode_scale_te but never flips
the backend mode, so wrap the quantize+kernel calls with the shared helper
set_nvfp4_4over6_env(use_4over6) so the backend is actually in 4over6 mode when
calling nvfp4_quantize and trtllm_fp4_block_scale_routed_moe; specifically, call
set_nvfp4_4over6_env(use_4over6) around the block that computes
hidden_states/hidden_states_scale/per_token_scale_inv with nvfp4_quantize and
the subsequent trtllm_fp4_block_scale_routed_moe invocation so both scale
computation and kernel execution use the same mode (references:
nvfp4_global_decode_scale_te, nvfp4_quantize, set_nvfp4_4over6_env,
trtllm_fp4_block_scale_routed_moe).

♻️ Duplicate comments (1)

csrc/nv_internal/cpp/kernels/quantization.cu (1)
338-362: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

FP32 input still aborts when FLASHINFER_NVFP4_4OVER6=1.

use4Over6 is read unconditionally from the process-global env var, and the if constexpr (std::is_same_v<T, float>) branch then aborts via TLLM_CHECK_WITH_INFO(!USE_4OVER6, ...). Any caller that quantizes a float input in a process where the env var is set (e.g. an MoE test running after a 4-over-6 test set the env in the same process) will fail, even though the legacy FP32 kernel is unchanged and capable of handling the request. Force use4Over6=false for T=float at the env-read site instead of aborting downstream.
💡 Suggested fix
-  bool const disableFP4QuantFastMath = tensorrt_llm::common::getEnvDisableFP4QuantFastMath();
-  bool const use4Over6 = tensorrt_llm::common::getEnvNVFP4Use4Over6();
-  bool const disable4Over6MSEFastMath = tensorrt_llm::common::getEnvNVFP4Disable4Over6MSEFastMath();
+  bool const disableFP4QuantFastMath = tensorrt_llm::common::getEnvDisableFP4QuantFastMath();
+  bool const use4Over6 =
+      !std::is_same_v<T, float> && tensorrt_llm::common::getEnvNVFP4Use4Over6();
+  bool const disable4Over6MSEFastMath =
+      use4Over6 && tensorrt_llm::common::getEnvNVFP4Disable4Over6MSEFastMath();
With that, the TLLM_CHECK_WITH_INFO(!USE_4OVER6, ...) inside the T=float branch becomes unreachable and can be dropped (or kept defensively).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@csrc/nv_internal/cpp/kernels/quantization.cu` around lines 338 - 362, The
code reads the process-global use4Over6 unconditionally which causes FP32
instantiations to abort; fix by making the env-read T-aware: move or re-evaluate
tensorrt_llm::common::getEnvNVFP4Use4Over6() into the template/lambda scope
where T is visible (the launchKernel capture/instantiation) and force it false
for T=float (e.g. compute auto const use4Over6 =
tensorrt_llm::common::getEnvNVFP4Use4Over6() && !std::is_same_v<T,float> and
pass that as the use4Over6Tag/std::bool_constant), then remove or leave the
now-unreachable TLLM_CHECK_WITH_INFO(!USE_4OVER6, ...) in the float branch.

🧹 Nitpick comments (1)

tests/test_helpers/utils_fp4.py (1)

295-302: ⚡ Quick win

Vectorize the per-element MSE accumulation.

The explicit Python loop over block_size=16 is unnecessary work and obscures the intent. A vectorized form is shorter, faster, and (because the reduction order across the last dim is implementation-defined either way) preserves the strict < tiebreak on pick_four.

♻️ Proposed refactor

-    err4 = torch.zeros((m, n // block_size), dtype=torch.float32, device=x.device)
-    err6 = torch.zeros((m, n // block_size), dtype=torch.float32, device=x.device)
-    for i in range(block_size):
-        diff4 = dq4[:, :, i] - x_blocks[:, :, i]
-        diff6 = dq6[:, :, i] - x_blocks[:, :, i]
-        err4 += diff4 * diff4
-        err6 += diff6 * diff6
-    pick_four = err4 < err6
+    err4 = ((dq4 - x_blocks) ** 2).sum(dim=-1)
+    err6 = ((dq6 - x_blocks) ** 2).sum(dim=-1)
+    pick_four = err4 < err6

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_helpers/utils_fp4.py` around lines 295 - 302, The loop computes
per-block MSE by accumulating squared differences across the last dim; replace
the explicit for-loop with a vectorized reduction: compute diff4 = dq4 -
x_blocks and diff6 = dq6 - x_blocks, square them and sum over the last axis to
produce err4 and err6, then set pick_four = err4 < err6 (preserving the strict <
tiebreak). Update variables err4, err6, diff4, diff6 and use the existing dq4,
dq6, x_blocks, and pick_four names so the change is localized to that block.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/moe/test_trtllm_gen_fused_moe.py`:
- Around line 2680-2687: run_moe_test currently only uses use_4over6 for
skipping but never actually sets the process env, so 4over6 paths may not be
exercised; wrap the quantize/reference/production section inside the
set_nvfp4_4over6_env context by calling set_nvfp4_4over6_env(use_4over6) (and
ensure the helper is imported) before entering the FP4
quantize/reference/production logic in run_moe_test and restore/unset it after
that block so the FLASHINFER_NVFP4_4OVER6 env state is consistently applied only
for those test cases.

In `@tests/moe/test_trtllm_gen_moe_autotune_tactics.py`:
- Around line 160-169: The test never actually enables the 4over6 NVFP4 runtime
flag because set_nvfp4_4over6_env is never applied; update the test harness so
that when _quant_mode_config is called with use_4over6=True the runtime
environment is toggled for the kernel run: call set_nvfp4_4over6_env(True)
before invoking _run_kernel_with_tactic (and set_nvfp4_4over6_env(False) or
restore the previous state after) so the launched kernel uses the 4over6 path;
adjust every place that constructs the use_4over6=True matrix (including the
other occurrences you noted) to wrap the kernel invocation with the env setter
rather than only changing scales.

In `@tests/moe/test_trtllm_gen_routed_fused_moe.py`:
- Around line 82-83: The test toggles use_4over6 but never actually flips the
NVFP4 4over6 environment, so fp4_quantize() and the routed/non-routed MoE kernel
calls still use the global env; fix by wrapping the sections that perform FP4
quantization and invoke the MoE kernels (references: fp4_quantize, the routed
MoE kernel call(s) and the non-routed MoE kernel call(s)) in the
set_nvfp4_4over6_env context when use_4over6 is True (e.g., with
set_nvfp4_4over6_env(): ...) so the env is applied for those operations and is
restored afterward; apply this same wrapping to the other similar test blocks
currently duplicated later in the file.

In `@tests/moe/utils.py`:
- Around line 40-65: The fixture set_nvfp4_4over6_env currently force-sets
TRTLLM_DISABLE_FP4_QUANT_FAST_MATH and
FLASHINFER_NVFP4_4OVER6_DISABLE_MSE_FAST_MATH unconditionally; change it so
those two env vars are only set when request.getfixturevalue("use_4over6") is
truthy (i.e., set them inside the branch where use_4over6 is True and leave them
untouched when False), while still recording original_values and restoring them
after yield; keep FLASHINFER_NVFP4_4OVER6 set to "1"/"0" based on use_4over6 as
before.

---

Outside diff comments:
In `@tests/moe/test_trtllm_gen_per_token_moe.py`:
- Around line 114-134: The test only updates Python-side scales via
nvfp4_global_decode_scale_te but never flips the backend mode, so wrap the
quantize+kernel calls with the shared helper set_nvfp4_4over6_env(use_4over6) so
the backend is actually in 4over6 mode when calling nvfp4_quantize and
trtllm_fp4_block_scale_routed_moe; specifically, call
set_nvfp4_4over6_env(use_4over6) around the block that computes
hidden_states/hidden_states_scale/per_token_scale_inv with nvfp4_quantize and
the subsequent trtllm_fp4_block_scale_routed_moe invocation so both scale
computation and kernel execution use the same mode (references:
nvfp4_global_decode_scale_te, nvfp4_quantize, set_nvfp4_4over6_env,
trtllm_fp4_block_scale_routed_moe).

---

Duplicate comments:
In `@csrc/nv_internal/cpp/kernels/quantization.cu`:
- Around line 338-362: The code reads the process-global use4Over6
unconditionally which causes FP32 instantiations to abort; fix by making the
env-read T-aware: move or re-evaluate
tensorrt_llm::common::getEnvNVFP4Use4Over6() into the template/lambda scope
where T is visible (the launchKernel capture/instantiation) and force it false
for T=float (e.g. compute auto const use4Over6 =
tensorrt_llm::common::getEnvNVFP4Use4Over6() && !std::is_same_v<T,float> and
pass that as the use4Over6Tag/std::bool_constant), then remove or leave the
now-unreachable TLLM_CHECK_WITH_INFO(!USE_4OVER6, ...) in the float branch.

---

Nitpick comments:
In `@tests/test_helpers/utils_fp4.py`:
- Around line 295-302: The loop computes per-block MSE by accumulating squared
differences across the last dim; replace the explicit for-loop with a vectorized
reduction: compute diff4 = dq4 - x_blocks and diff6 = dq6 - x_blocks, square
them and sum over the last axis to produce err4 and err6, then set pick_four =
err4 < err6 (preserving the strict < tiebreak). Update variables err4, err6,
diff4, diff6 and use the existing dq4, dq6, x_blocks, and pick_four names so the
change is localized to that block.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 07674709-c056-41fd-8bc9-27c3e59e1102

📥 Commits

Reviewing files that changed from the base of the PR and between b36e9a6 and 7d2f214.

📒 Files selected for processing (14)

csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
csrc/nv_internal/cpp/common/envUtils.cpp
csrc/nv_internal/cpp/kernels/quantization.cu
csrc/nv_internal/tensorrt_llm/common/envUtils.h
csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh
csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh
tests/moe/test_trtllm_cutlass_fused_moe.py
tests/moe/test_trtllm_gen_fused_moe.py
tests/moe/test_trtllm_gen_moe_autotune_tactics.py
tests/moe/test_trtllm_gen_per_token_moe.py
tests/moe/test_trtllm_gen_routed_fused_moe.py
tests/moe/utils.py
tests/test_helpers/utils_fp4.py
tests/utils/test_fp4_quantize.py

🚧 Files skipped from review as they are similar to previous changes (2)

csrc/nv_internal/cpp/common/envUtils.cpp
csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh

IwakuraRein · 2026-05-09T01:39:16Z

/bot run

flashinfer-bot · 2026-05-09T01:39:30Z

GitLab MR !655 has been created, and the CI pipeline #50739016 is currently running. I'll report back once the pipeline job completes.

zianglih requested review from aleozlx, bkryu, cyx-6, jimmyzho, kahyunnam, nv-yunzheq, saltyminty, samuellees, sricketts, yongwww, yyihuang and yzh119 as code owners May 7, 2026 22:11

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

zianglih mentioned this pull request May 8, 2026

[Roadmap] Blackwell MXFP8 and NVFP4 RL training radixark/miles#615

Open

30 tasks

aleozlx reviewed May 8, 2026

View reviewed changes

Implementation after rebase

b36e9a6

ziang-and force-pushed the 4over6 branch from 0b79d4f to b36e9a6 Compare May 8, 2026 19:05

zianglih marked this pull request as draft May 8, 2026 19:08

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

Comment thread csrc/nv_internal/cpp/kernels/quantization.cu Outdated

zianglih added 7 commits May 8, 2026 12:18

Clean up test

6b4a6e0

Support te exact for original nvfp4

f15d0f1

Support 4over6 for original nvfp4

ebb1909

Refactor dispatch

3156f4d

Further refactor dispatch

a6e90d2

Clean up test env var

b6bea62

Clean up test

629cd01

zianglih changed the title ~~Implement 4 over 6 nvfp4 quantizer for per-token nvfp4~~ Implement 4over6 nvfp4 quantizer for per-token nvfp4 May 8, 2026

zianglih changed the title ~~Implement 4over6 nvfp4 quantizer for per-token nvfp4~~ Implement 4over6 nvfp4 quantizer May 8, 2026

Expand moe tests

73b7790

flashinfer-bot added the op: moe label May 8, 2026

zianglih added 4 commits May 8, 2026 16:16

Avoid implicit 256 448 conversion in reference

6d90e63

Require the user to use 256 for global scales

78b7fd0

Extend implementation for silu_and_mul_scaled_nvfp4_experts_quantize

8cc7d6d

Reorder arg list

0ab5369

zianglih changed the title ~~Implement 4over6 nvfp4 quantizer~~ Support 4over6 nvfp4 for quantizer and fused MoE May 9, 2026

zianglih added 2 commits May 8, 2026 18:04

Expand cutlass moe support

20a5b25

Drop padding test

7d2f214

zianglih marked this pull request as ready for review May 9, 2026 01:06

zianglih requested review from IwakuraRein and jiahanc as code owners May 9, 2026 01:06

coderabbitai Bot reviewed May 9, 2026

View reviewed changes

Comment thread tests/moe/test_trtllm_gen_fused_moe.py

Comment thread tests/moe/test_trtllm_gen_moe_autotune_tactics.py

Comment thread tests/moe/test_trtllm_gen_routed_fused_moe.py

Comment thread tests/moe/utils.py

IwakuraRein added the run-ci label May 9, 2026

zianglih mentioned this pull request May 9, 2026

Implement 4over6 NVFP4 recipe NVIDIA/TransformerEngine#2972

Draft

13 tasks

Conversation

zianglih commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

aleozlx left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

IwakuraRein commented May 9, 2026

Uh oh!

flashinfer-bot commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zianglih commented May 7, 2026 •

edited

Loading

coderabbitai Bot commented May 7, 2026 •

edited

Loading