Implement 4over6 NVFP4 recipe#2972
Conversation
Signed-off-by: Ziang Li <ziangli@umich.edu>
Greptile SummaryThis PR implements 4over6 block-scale selection for NVFP4 1D quantization: for each 16-element block the kernel computes two scale candidates (map-to-4 and map-to-6), quantizes with both, and picks the one with lower MSE, using a reduced global-scale ceiling of 256 instead of 448 to give the map-4 branch room to represent larger blocks.
Confidence Score: 4/5The 4over6 logic is well-guarded with incompatible-mode checks at every entry point; the most notable gap is that NVTE_USE_FAST_MATH is wired only into the split-quantization path and has no effect on the standard single-tensor path. The core quantization math and scale-selection logic look correct and is backed by a reference implementation with tests. The main concern is the NVTE_USE_FAST_MATH env var being silently ignored for single-tensor 4over6 quantization, and the hardcoded [2] array sizes in the CUDA kernel that rely on an implicit loop-bound assumption without a static assert. transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu (hardcoded array sizing) and transformer_engine/pytorch/csrc/extensions/cast.cpp (fast-math env var only applied in the split path). Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["NVFP4BlockScaling(enable_4over6=True)"] --> B[NVFP4BlockScalingRecipeState]
B --> C["NVFP4Quantizer(use_4over6=True)"]
C --> D{quantize path}
D -->|single tensor| E["quantize_impl (quantizer.cpp)"]
D -->|split tensor| F["split_quantize_nvfp4_impl_helper (cast.cpp)"]
E --> G["QuantizationConfig.set_nvfp4_4over6(true)"]
F --> G
F -->|reads env var| H["NVTE_USE_FAST_MATH → config.set_use_fast_math"]
G --> I{kernel dispatch}
I -->|tuned 1D| J["quantize_transpose_nvfp4_tuned_1D_kernel<USE_4OVER6=true>"]
I -->|vector blockwise| K["block_scaled_1d_cast_transpose_kernel<kUse4Over6=true>"]
J --> L["rowwise_scaling: compute map4+map6 scales, pick lower MSE"]
K --> M["cvt_fp32_to_fp4_8x_with_mse_rn: err_map4 vs err_map6"]
L --> N["NVFP4Tensor with _use_4over6=True"]
M --> N
Reviews (1): Last reviewed commit: "Initial implementation" | Re-trigger Greptile |
| need_separate_rng_states, quant_config_list, | ||
| dummy_quant_config_list_colwise); // colwise rng states are not needed in this case | ||
|
|
||
| for (auto &config : quant_config_list) { | ||
| config.set_nvfp4_4over6(quantizer.use_4over6); | ||
| } | ||
|
|
||
| const auto use_fast_math = transformer_engine::getenv<bool>("NVTE_USE_FAST_MATH"); | ||
| if (use_fast_math) { |
There was a problem hiding this comment.
NVTE_USE_FAST_MATH only applied in split-quantization path
NVTE_USE_FAST_MATH is read and forwarded to quant_config_list only here in split_quantize_nvfp4_impl_helper. However, in quantizer.cpp::quantize_impl, no equivalent env-var read exists, so the use_fast_math field on the single-tensor QuantizationConfig stays false even when NVTE_USE_FAST_MATH=1. This means the fast-math variant of cvt_fp32_to_fp4_8x_with_mse_rn (USE_FAST_MATH=true) is unreachable for ordinary single-tensor 4over6 quantization — the env var is silently ignored on that path.
| const float x[8] = { | ||
| static_cast<float>(smem_vec[2 * (i + 0)].data.elt[smem_idx]), | ||
| static_cast<float>(smem_vec[2 * (i + 0) + 1].data.elt[smem_idx]), | ||
| static_cast<float>(smem_vec[2 * (i + 1)].data.elt[smem_idx]), | ||
| static_cast<float>(smem_vec[2 * (i + 1) + 1].data.elt[smem_idx]), | ||
| static_cast<float>(smem_vec[2 * (i + 2)].data.elt[smem_idx]), | ||
| static_cast<float>(smem_vec[2 * (i + 2) + 1].data.elt[smem_idx]), | ||
| static_cast<float>(smem_vec[2 * (i + 3)].data.elt[smem_idx]), | ||
| static_cast<float>(smem_vec[2 * (i + 3) + 1].data.elt[smem_idx]), | ||
| }; | ||
| output_vec_map4[out_idx] = | ||
| transformer_engine::dispatch::nvfp4::core::cvt_fp32_to_fp4_8x_with_mse_rn< | ||
| kUseFastMath>(x, encode_scale_map4, scale_inv_map4, global_amax[0], &err_map4); | ||
| output_vec_map6[out_idx] = | ||
| transformer_engine::dispatch::nvfp4::core::cvt_fp32_to_fp4_8x_with_mse_rn< | ||
| kUseFastMath>(x, encode_scale_map6, scale_inv_map6, global_amax[0], &err_map6); | ||
| } | ||
|
|
||
| if (err_map4 < err_map6) { | ||
| scale_inv = scale_inv_map4; | ||
| *reinterpret_cast<uint32_t*>(&output_vec.data.elt[0]) = output_vec_map4[0]; | ||
| *reinterpret_cast<uint32_t*>(&output_vec.data.elt[4]) = output_vec_map4[1]; | ||
| } else { |
There was a problem hiding this comment.
Hardcoded
[2] array sizes tied to implicit loop-bound assumption
output_vec_map4[2] and output_vec_map6[2] are sized for exactly two uint32_t outputs, which is correct only when kNVecOut / kNVecSMem == 8 (so the loop runs i = 0, 4). The same pattern appears for the transpose path (kNVecOut / kNFP4PerContainer). There is no static_assert verifying that either ratio equals 8. If a future tuning changes kNVecSMem, kNFP4PerContainer, or kNVecOut, the index out_idx = i / 4 will silently write past the end of these stack arrays, corrupting adjacent local state inside the register file.
| assert ( | ||
| self.backward_override in _BACKWARD_OVERRIDES | ||
| ), "NVTE_BACKWARD_OVERRIDE must be unset or one of: 'high_precision', 'dequantized'." | ||
| if self.enable_4over6: | ||
| assert self.disable_rht, "NVFP4 4over6 currently requires RHT to be disabled" | ||
| assert ( | ||
| self.disable_stochastic_rounding | ||
| ), "NVFP4 4over6 currently requires stochastic rounding to be disabled" |
There was a problem hiding this comment.
enable_4over6 silently fails when set via env var while other env vars are not set
enable_4over6 is read from NVTE_NVFP4_ENABLE_4OVER6, but the __post_init__ asserts immediately require disable_rht, disable_stochastic_rounding, and disable_2d_quantization to all be True. A user who sets only NVTE_NVFP4_ENABLE_4OVER6=1 — following a natural reading of the env var docs — gets an AssertionError at recipe construction with no actionable hint. Consider surfacing the required sibling env vars in the assertion message (e.g. "Set NVTE_NVFP4_DISABLE_RHT=1").
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
Signed-off-by: Ziang Li <ziangli@umich.edu>
| if use_bias: | ||
| check_grouped_bias(te_grouped_linear, num_gemms, ffn_hidden_size) | ||
| if use_bias and te_grouped_linear.single_grouped_bias: | ||
| check_grouped_bias(te_grouped_linear, num_gemms, ffn_hidden_size) |
There was a problem hiding this comment.
We need this to get the test passed but it seems unrelated to our changes.
Description
@HumansAnd
Implement 4over6 nvfp4 from:
FlashInfer PR:
Enable per-block map-to-4 versus map-to-6 candidate selection for NVFP4 1D quantization in the
NVFP4BlockScalingrecipe. This mode currently requires RHT, stochastic rounding, and 2D quantization to be disabled. Both original per-tensor scaling and row-scaling NVFP4 introduced by #2931 are supported.This PR also fixes a few minor bugs for row-scaled NVFP4 from #2931.
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: