Skip to content

Support 4over6 nvfp4 for quantizer and fused MoE#3264

Open
zianglih wants to merge 15 commits intoflashinfer-ai:mainfrom
zianglih:4over6
Open

Support 4over6 nvfp4 for quantizer and fused MoE#3264
zianglih wants to merge 15 commits intoflashinfer-ai:mainfrom
zianglih:4over6

Conversation

@zianglih
Copy link
Copy Markdown
Contributor

@zianglih zianglih commented May 7, 2026

📌 Description

@HumansAnd

Implement 4over6 nvfp4 from:

TE PR:

Both original nvfp4 and per-token nvfp4 quantizer and moe are supported.

The results is bitwise exact with reference implementation by enabling:

  • FLASHINFER_NVFP4_4OVER6_DISABLE_MSE_FAST_MATH=1
  • TRTLLM_DISABLE_FP4_QUANT_FAST_MATH=1
    Note:
        Set `FLASHINFER_NVFP4_4OVER6=1` to enable the CUDA backend's 4over6
        MSE scale-candidate mode for fp16/bf16 NVFP4 quantization. This mode
        uses the fouroversix adaptive NV scale range, `256 * 6`, instead of the
        standard NVFP4 range, `448 * 6`. For non-per-token outputs, downstream
        dequantization or GEMM code must use the corresponding adjusted global
        scale. Set `FLASHINFER_NVFP4_4OVER6_DISABLE_MSE_FAST_MATH=1` to use
        the bitwise-exact MSE comparison path.

Under strict no fast math mode, the quantizer is bitwise exact with pytorch reference implementation.

Need to rebase after:

Future work:

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Release Notes

  • New Features

    • Added NVFP4 "4-over-6" quantization mode enabling runtime selection between two FP4 scale candidates for improved accuracy.
    • Introduced environment variable controls (FLASHINFER_NVFP4_4OVER6, TRTLLM_DISABLE_FP4_QUANT_FAST_MATH, FLASHINFER_NVFP4_4OVER6_DISABLE_MSE_FAST_MATH) for quantization behavior customization.
  • Tests

    • Extended test coverage for 4-over-6 quantization mode across FP4 and MoE quantization tests.
    • Updated quantization reference helpers to support the new mode.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

📝 Walkthrough

Walkthrough

This PR adds a runtime-configurable NVFP4 4-over-6 per-token quantization mode with MSE-based scale-candidate selection. It extends kernel templates with USE_4OVER6 and DISABLE_4OVER6_MSE_FAST_MATH parameters, implements dual-candidate FP4 scale generation in the conversion path with warp-reduced error selection, refactors kernel dispatch to read environment toggles, updates per-token scale derivation logic, and provides comprehensive test coverage with Python reference implementations.

Changes

NVFP4 4-over-6 per-token quantization

Layer / File(s) Summary
Environment configuration interface
csrc/nv_internal/tensorrt_llm/common/envUtils.h, csrc/nv_internal/cpp/common/envUtils.cpp
Add getEnvNVFP4Use4Over6() and getEnvNVFP4Disable4Over6MSEFastMath() env getters; remove static caching from getEnvDisableFP4QuantFastMath() to read dynamically.
Template parameter contracts
csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh
Extend quantize_with_block_size, quantize_with_block_size_tma, cvt_fp16_to_fp4_expert, and block_scale_interleave_kernel with USE_4OVER6 and DISABLE_4OVER6_MSE_FAST_MATH template parameters; add compile-time constraints.
Kernel dispatch and template selection
csrc/nv_internal/cpp/kernels/quantization.cu
Add reusable dispatch helpers (dispatchBool, dispatchSFLayout, dispatchFP4QuantMathMode, dispatchFP4KernelConfig); refactor invokeNvfp4QuantAndPerTokenScale and launchFP4QuantizationTma to read env toggles and use dispatch helpers.
FP16→FP4 conversion with candidate selection
csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh
Add e2m1_code_to_float() helper; extend cvt_warp_fp16_to_fp4 with USE_4OVER6 and DISABLE_4OVER6_MSE_FAST_MATH; implement dual-candidate generation, MSE computation, warp reduction, and lower-error selection.
Per-token scale derivation and kernel logic
csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh
Update inverse-scale handling to conditionally use precise division; add nvfp4QuantAndPerTokenScaleKernel branch for USE_4OVER6 && DISABLE_FP4_QUANT_FAST_MATH that derives adjusted per-token scale with zero/denormal handling.
Cutlass MoE kernel integration
csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
Add dispatchNVFP4QuantConfig helper; extend quantizePackedFPXValue, expandInputRowsKernel, and doActivationKernel with USE_4OVER6/DISABLE_4OVER6_MSE_FAST_MATH parameters; wire dispatch into NVFP4 launches.
Python reference quantization helpers
tests/test_helpers/utils_fp4.py
Update nvfp4_global_encode_scale_te and nvfp4_global_decode_scale_te to support use_4over6 parameter; add _ref_fp4_quant_te_with_decode_scale and ref_fp4_quant_4over6_te for dual-candidate reference quantization.
Test environment and scale layout helpers
tests/moe/utils.py, tests/utils/test_fp4_quantize.py
Add set_nvfp4_4over6_env autouse fixture for env-variable management; add _te_ref_scale_bytes_for_layout to convert reference scales into per-layout indexing.
Test parametrization and use_4over6 coverage
tests/utils/test_fp4_quantize.py, tests/moe/test_trtllm_cutlass_fused_moe.py, tests/moe/test_trtllm_gen_*.py, tests/moe/test_trtllm_gen_per_token_moe.py, tests/moe/test_trtllm_gen_routed_fused_moe.py
Parametrize multiple tests over use_4over6; update global scale computation to use use_4over6-aware helpers; apply conditional MSE tolerances and assertions for 4-over-6 vs standard mode.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • sricketts
  • yongwww
  • yzh119
  • cyx-6
  • samuellees

Poem

A rabbit hops through quantized lands,
Where four meets six, both hand in hand,
Candidates duel for error's grace,
As warps reduce across the space—
A 4-over-6 dance, precise and grand! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.90% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly and concisely summarizes the main change: adding 4over6 NVFP4 support to the quantizer and fused MoE components.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description comprehensively documents the implementation of 4over6 NVFP4 quantization, includes rationale with paper/code references, describes the feature scope, and mentions test coverage and pre-commit verification.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new NVFP4 quantization mode called '4/6 MSE scale-candidate mode,' which is activated via the FLASHINFER_NVFP4_FOUR_OVER_SIX environment variable. The implementation includes updates to CUDA kernels for per-token scaling and quantization, as well as corresponding Python tests and documentation. Reviewer feedback suggests several optimizations for the CUDA code, including refactoring duplicated logic into helper functions, precalculating values to reduce redundant arithmetic operations within loops, and replacing switch statements with lookup tables to improve performance and readability.

Comment thread csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh Outdated
Comment thread csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh Outdated
Comment thread csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh Outdated
Comment thread csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh
Comment thread csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh Outdated
Comment thread csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/utils/test_fp4_quantize.py (1)

706-747: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Pin FOUR_OVER_SIX off for the baseline TE-reference test.

Line 706 validates the non-4/6 reference path, but this test can be affected by an externally set FLASHINFER_NVFP4_FOUR_OVER_SIX. Make the mode explicit in-test to avoid environment-coupled failures.

🔧 Proposed fix
 def test_nvfp4_per_token_quantize_te_reference(
     dtype: torch.dtype,
     shape: tuple[int, int],
     sf_layout: SfLayout,
     init_data: str,
     device: str,
+    monkeypatch: pytest.MonkeyPatch,
 ) -> None:
     """Per-token NVFP4 quantization should match the TE Python reference bitwise."""
     if not _is_fp4_supported(torch.device(device)):
         pytest.skip("Nvfp4 Requires compute capability >= 10 and CUDA >= 12.8")
+    monkeypatch.setenv("FLASHINFER_NVFP4_FOUR_OVER_SIX", "0")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/utils/test_fp4_quantize.py` around lines 706 - 747, In
test_nvfp4_per_token_quantize_te_reference ensure the FOUR_OVER_SIX mode is
pinned off so the TE-reference path is deterministic: at the start of
test_nvfp4_per_token_quantize_te_reference set the environment flag
FLASHINFER_NVFP4_FOUR_OVER_SIX="0" (or call your library’s setter if available)
before creating x and running ref_fp4_quant_te/nvfp4_quantize, and restore the
previous value at the end of the test to avoid leaking global state.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tests/utils/test_fp4_quantize.py`:
- Around line 706-747: In test_nvfp4_per_token_quantize_te_reference ensure the
FOUR_OVER_SIX mode is pinned off so the TE-reference path is deterministic: at
the start of test_nvfp4_per_token_quantize_te_reference set the environment flag
FLASHINFER_NVFP4_FOUR_OVER_SIX="0" (or call your library’s setter if available)
before creating x and running ref_fp4_quant_te/nvfp4_quantize, and restore the
previous value at the end of the test to avoid leaking global state.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e9dcc260-81db-4c62-b9e1-585a7ba243bb

📥 Commits

Reviewing files that changed from the base of the PR and between c5c089b and 0b79d4f.

📒 Files selected for processing (5)
  • csrc/nv_internal/cpp/kernels/quantization.cu
  • csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh
  • csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh
  • flashinfer/quantization/fp4_quantization.py
  • tests/utils/test_fp4_quantize.py

Copy link
Copy Markdown
Collaborator

@aleozlx aleozlx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me so far!

thx for the contrib. pls address conflicts

@zianglih zianglih marked this pull request as draft May 8, 2026 19:08
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 064e42d2-1286-4387-8bd1-9c66fe18ddac

📥 Commits

Reviewing files that changed from the base of the PR and between 0b79d4f and b36e9a6.

📒 Files selected for processing (7)
  • csrc/nv_internal/cpp/common/envUtils.cpp
  • csrc/nv_internal/cpp/kernels/quantization.cu
  • csrc/nv_internal/tensorrt_llm/common/envUtils.h
  • csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh
  • csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh
  • flashinfer/quantization/fp4_quantization.py
  • tests/utils/test_fp4_quantize.py
✅ Files skipped from review due to trivial changes (2)
  • csrc/nv_internal/tensorrt_llm/common/envUtils.h
  • flashinfer/quantization/fp4_quantization.py

Comment thread csrc/nv_internal/cpp/kernels/quantization.cu Outdated
@zianglih zianglih changed the title Implement 4 over 6 nvfp4 quantizer for per-token nvfp4 Implement 4over6 nvfp4 quantizer for per-token nvfp4 May 8, 2026
@zianglih zianglih changed the title Implement 4over6 nvfp4 quantizer for per-token nvfp4 Implement 4over6 nvfp4 quantizer May 8, 2026
@zianglih zianglih changed the title Implement 4over6 nvfp4 quantizer Support 4over6 nvfp4 for quantizer and fused MoE May 9, 2026
@zianglih zianglih marked this pull request as ready for review May 9, 2026 01:06
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/moe/test_trtllm_gen_per_token_moe.py (1)

114-134: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

This changes the scales, but not the backend mode.

The new use_4over6 branch only rewrites the Python-side NVFP4 scale factors. The test never enables 4over6 via set_nvfp4_4over6_env before calling nvfp4_quantize() and trtllm_fp4_block_scale_routed_moe(), so the True cases are not validating the actual 4over6 implementation. Apply the shared env helper around the quantize + kernel section so both sides run in the same mode.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/moe/test_trtllm_gen_per_token_moe.py` around lines 114 - 134, The test
only updates Python-side scales via nvfp4_global_decode_scale_te but never flips
the backend mode, so wrap the quantize+kernel calls with the shared helper
set_nvfp4_4over6_env(use_4over6) so the backend is actually in 4over6 mode when
calling nvfp4_quantize and trtllm_fp4_block_scale_routed_moe; specifically, call
set_nvfp4_4over6_env(use_4over6) around the block that computes
hidden_states/hidden_states_scale/per_token_scale_inv with nvfp4_quantize and
the subsequent trtllm_fp4_block_scale_routed_moe invocation so both scale
computation and kernel execution use the same mode (references:
nvfp4_global_decode_scale_te, nvfp4_quantize, set_nvfp4_4over6_env,
trtllm_fp4_block_scale_routed_moe).
♻️ Duplicate comments (1)
csrc/nv_internal/cpp/kernels/quantization.cu (1)

338-362: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

FP32 input still aborts when FLASHINFER_NVFP4_4OVER6=1.

use4Over6 is read unconditionally from the process-global env var, and the if constexpr (std::is_same_v<T, float>) branch then aborts via TLLM_CHECK_WITH_INFO(!USE_4OVER6, ...). Any caller that quantizes a float input in a process where the env var is set (e.g. an MoE test running after a 4-over-6 test set the env in the same process) will fail, even though the legacy FP32 kernel is unchanged and capable of handling the request. Force use4Over6=false for T=float at the env-read site instead of aborting downstream.

💡 Suggested fix
-  bool const disableFP4QuantFastMath = tensorrt_llm::common::getEnvDisableFP4QuantFastMath();
-  bool const use4Over6 = tensorrt_llm::common::getEnvNVFP4Use4Over6();
-  bool const disable4Over6MSEFastMath = tensorrt_llm::common::getEnvNVFP4Disable4Over6MSEFastMath();
+  bool const disableFP4QuantFastMath = tensorrt_llm::common::getEnvDisableFP4QuantFastMath();
+  bool const use4Over6 =
+      !std::is_same_v<T, float> && tensorrt_llm::common::getEnvNVFP4Use4Over6();
+  bool const disable4Over6MSEFastMath =
+      use4Over6 && tensorrt_llm::common::getEnvNVFP4Disable4Over6MSEFastMath();

With that, the TLLM_CHECK_WITH_INFO(!USE_4OVER6, ...) inside the T=float branch becomes unreachable and can be dropped (or kept defensively).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@csrc/nv_internal/cpp/kernels/quantization.cu` around lines 338 - 362, The
code reads the process-global use4Over6 unconditionally which causes FP32
instantiations to abort; fix by making the env-read T-aware: move or re-evaluate
tensorrt_llm::common::getEnvNVFP4Use4Over6() into the template/lambda scope
where T is visible (the launchKernel capture/instantiation) and force it false
for T=float (e.g. compute auto const use4Over6 =
tensorrt_llm::common::getEnvNVFP4Use4Over6() && !std::is_same_v<T,float> and
pass that as the use4Over6Tag/std::bool_constant), then remove or leave the
now-unreachable TLLM_CHECK_WITH_INFO(!USE_4OVER6, ...) in the float branch.
🧹 Nitpick comments (1)
tests/test_helpers/utils_fp4.py (1)

295-302: ⚡ Quick win

Vectorize the per-element MSE accumulation.

The explicit Python loop over block_size=16 is unnecessary work and obscures the intent. A vectorized form is shorter, faster, and (because the reduction order across the last dim is implementation-defined either way) preserves the strict < tiebreak on pick_four.

♻️ Proposed refactor
-    err4 = torch.zeros((m, n // block_size), dtype=torch.float32, device=x.device)
-    err6 = torch.zeros((m, n // block_size), dtype=torch.float32, device=x.device)
-    for i in range(block_size):
-        diff4 = dq4[:, :, i] - x_blocks[:, :, i]
-        diff6 = dq6[:, :, i] - x_blocks[:, :, i]
-        err4 += diff4 * diff4
-        err6 += diff6 * diff6
-    pick_four = err4 < err6
+    err4 = ((dq4 - x_blocks) ** 2).sum(dim=-1)
+    err6 = ((dq6 - x_blocks) ** 2).sum(dim=-1)
+    pick_four = err4 < err6
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_helpers/utils_fp4.py` around lines 295 - 302, The loop computes
per-block MSE by accumulating squared differences across the last dim; replace
the explicit for-loop with a vectorized reduction: compute diff4 = dq4 -
x_blocks and diff6 = dq6 - x_blocks, square them and sum over the last axis to
produce err4 and err6, then set pick_four = err4 < err6 (preserving the strict <
tiebreak). Update variables err4, err6, diff4, diff6 and use the existing dq4,
dq6, x_blocks, and pick_four names so the change is localized to that block.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/moe/test_trtllm_gen_fused_moe.py`:
- Around line 2680-2687: run_moe_test currently only uses use_4over6 for
skipping but never actually sets the process env, so 4over6 paths may not be
exercised; wrap the quantize/reference/production section inside the
set_nvfp4_4over6_env context by calling set_nvfp4_4over6_env(use_4over6) (and
ensure the helper is imported) before entering the FP4
quantize/reference/production logic in run_moe_test and restore/unset it after
that block so the FLASHINFER_NVFP4_4OVER6 env state is consistently applied only
for those test cases.

In `@tests/moe/test_trtllm_gen_moe_autotune_tactics.py`:
- Around line 160-169: The test never actually enables the 4over6 NVFP4 runtime
flag because set_nvfp4_4over6_env is never applied; update the test harness so
that when _quant_mode_config is called with use_4over6=True the runtime
environment is toggled for the kernel run: call set_nvfp4_4over6_env(True)
before invoking _run_kernel_with_tactic (and set_nvfp4_4over6_env(False) or
restore the previous state after) so the launched kernel uses the 4over6 path;
adjust every place that constructs the use_4over6=True matrix (including the
other occurrences you noted) to wrap the kernel invocation with the env setter
rather than only changing scales.

In `@tests/moe/test_trtllm_gen_routed_fused_moe.py`:
- Around line 82-83: The test toggles use_4over6 but never actually flips the
NVFP4 4over6 environment, so fp4_quantize() and the routed/non-routed MoE kernel
calls still use the global env; fix by wrapping the sections that perform FP4
quantization and invoke the MoE kernels (references: fp4_quantize, the routed
MoE kernel call(s) and the non-routed MoE kernel call(s)) in the
set_nvfp4_4over6_env context when use_4over6 is True (e.g., with
set_nvfp4_4over6_env(): ...) so the env is applied for those operations and is
restored afterward; apply this same wrapping to the other similar test blocks
currently duplicated later in the file.

In `@tests/moe/utils.py`:
- Around line 40-65: The fixture set_nvfp4_4over6_env currently force-sets
TRTLLM_DISABLE_FP4_QUANT_FAST_MATH and
FLASHINFER_NVFP4_4OVER6_DISABLE_MSE_FAST_MATH unconditionally; change it so
those two env vars are only set when request.getfixturevalue("use_4over6") is
truthy (i.e., set them inside the branch where use_4over6 is True and leave them
untouched when False), while still recording original_values and restoring them
after yield; keep FLASHINFER_NVFP4_4OVER6 set to "1"/"0" based on use_4over6 as
before.

---

Outside diff comments:
In `@tests/moe/test_trtllm_gen_per_token_moe.py`:
- Around line 114-134: The test only updates Python-side scales via
nvfp4_global_decode_scale_te but never flips the backend mode, so wrap the
quantize+kernel calls with the shared helper set_nvfp4_4over6_env(use_4over6) so
the backend is actually in 4over6 mode when calling nvfp4_quantize and
trtllm_fp4_block_scale_routed_moe; specifically, call
set_nvfp4_4over6_env(use_4over6) around the block that computes
hidden_states/hidden_states_scale/per_token_scale_inv with nvfp4_quantize and
the subsequent trtllm_fp4_block_scale_routed_moe invocation so both scale
computation and kernel execution use the same mode (references:
nvfp4_global_decode_scale_te, nvfp4_quantize, set_nvfp4_4over6_env,
trtllm_fp4_block_scale_routed_moe).

---

Duplicate comments:
In `@csrc/nv_internal/cpp/kernels/quantization.cu`:
- Around line 338-362: The code reads the process-global use4Over6
unconditionally which causes FP32 instantiations to abort; fix by making the
env-read T-aware: move or re-evaluate
tensorrt_llm::common::getEnvNVFP4Use4Over6() into the template/lambda scope
where T is visible (the launchKernel capture/instantiation) and force it false
for T=float (e.g. compute auto const use4Over6 =
tensorrt_llm::common::getEnvNVFP4Use4Over6() && !std::is_same_v<T,float> and
pass that as the use4Over6Tag/std::bool_constant), then remove or leave the
now-unreachable TLLM_CHECK_WITH_INFO(!USE_4OVER6, ...) in the float branch.

---

Nitpick comments:
In `@tests/test_helpers/utils_fp4.py`:
- Around line 295-302: The loop computes per-block MSE by accumulating squared
differences across the last dim; replace the explicit for-loop with a vectorized
reduction: compute diff4 = dq4 - x_blocks and diff6 = dq6 - x_blocks, square
them and sum over the last axis to produce err4 and err6, then set pick_four =
err4 < err6 (preserving the strict < tiebreak). Update variables err4, err6,
diff4, diff6 and use the existing dq4, dq6, x_blocks, and pick_four names so the
change is localized to that block.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 07674709-c056-41fd-8bc9-27c3e59e1102

📥 Commits

Reviewing files that changed from the base of the PR and between b36e9a6 and 7d2f214.

📒 Files selected for processing (14)
  • csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh
  • csrc/nv_internal/cpp/common/envUtils.cpp
  • csrc/nv_internal/cpp/kernels/quantization.cu
  • csrc/nv_internal/tensorrt_llm/common/envUtils.h
  • csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh
  • csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh
  • tests/moe/test_trtllm_cutlass_fused_moe.py
  • tests/moe/test_trtllm_gen_fused_moe.py
  • tests/moe/test_trtllm_gen_moe_autotune_tactics.py
  • tests/moe/test_trtllm_gen_per_token_moe.py
  • tests/moe/test_trtllm_gen_routed_fused_moe.py
  • tests/moe/utils.py
  • tests/test_helpers/utils_fp4.py
  • tests/utils/test_fp4_quantize.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • csrc/nv_internal/cpp/common/envUtils.cpp
  • csrc/nv_internal/tensorrt_llm/kernels/quantization_utils.cuh

Comment thread tests/moe/test_trtllm_gen_fused_moe.py
Comment thread tests/moe/test_trtllm_gen_moe_autotune_tactics.py
Comment thread tests/moe/test_trtllm_gen_routed_fused_moe.py
Comment thread tests/moe/utils.py
@IwakuraRein
Copy link
Copy Markdown
Collaborator

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !655 has been created, and the CI pipeline #50739016 is currently running. I'll report back once the pipeline job completes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants