Skip to content

Round the ue8m0 FP8 scale before quantizing so dequant matches the stored inverse#46763

Merged
SunMarc merged 1 commit into
huggingface:mainfrom
Incheonkirin:fix-ue8m0-quantize-scale-ordering
Jun 24, 2026
Merged

Round the ue8m0 FP8 scale before quantizing so dequant matches the stored inverse#46763
SunMarc merged 1 commit into
huggingface:mainfrom
Incheonkirin:fix-ue8m0-quantize-scale-ordering

Conversation

@Incheonkirin

Copy link
Copy Markdown
Contributor

Fp8Quantize._quantize_one quantizes the weight with the unrounded block scale, then for scale_fmt="ue8m0" (DeepSeek-V4 style) rounds only the stored weight_scale_inv up to a power of two. At load time Fp8Dequantize computes weight * weight_scale_inv, so the dequant scale can be up to a full octave (~2x) larger than the scale the weight was actually divided by. Nothing flags it: the shapes and dtypes are valid, the weights are just wrong.

DeepSeek's own DeepGEMM reference rounds the scale first and quantizes with that same rounded scale (deep_gemm/utils/math.py: sf = ceil_to_ue8m0(sf); x_fp8 = x / sf), so quant and dequant agree. This change matches that ordering: round inv_scales to the power-of-two grid, re-derive scales = 1 / inv_scales, then quantize. The rounding direction is unchanged; only the order moves. The scale_fmt="float" path never reassigns scales, so non-ue8m0 checkpoints quantize bit-identically.

Tests: test_fp8_ue8m0_quantize_dequantize_round_trip checks the round-trip (main ~0.6, with this change ~0.022, matching the e4m3 floor); test_fp8_float_scale_fmt_quantization_unchanged asserts the scale_fmt="float" path stays bit-identical to the original formula across 100 random 128x128 blocks. CPU-only.

…ored inverse

For scale_fmt="ue8m0", Fp8Quantize._quantize_one quantized the weight with the
unrounded block scale but stored a weight_scale_inv rounded up to a power of two.
Fp8Dequantize multiplies by that stored inverse, so the round-trip was off by up to
a full octave per block. Round the inverse scale first, re-derive the forward scale
from it, then quantize, matching DeepGEMM's order. The scale_fmt="float" path is
unchanged.
@github-actions

Copy link
Copy Markdown
Contributor

CI Dashboard: View test results in Grafana

@Rocketknight1

Copy link
Copy Markdown
Member

cc @SunMarc for quants

@SunMarc SunMarc left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, otherwise, it doesn't match. cc @ArthurZucker for confirmation

)
torch.testing.assert_close(dequantized_q, expected_q, rtol=1e-2, atol=1e-2)

def test_fp8_ue8m0_quantize_dequantize_round_trip(self):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't put that here, you should put that in quantization folder / fp8 folder.

@SunMarc SunMarc requested a review from ArthurZucker June 22, 2026 17:03
@SunMarc SunMarc enabled auto-merge June 24, 2026 14:04
@SunMarc SunMarc added this pull request to the merge queue Jun 24, 2026
@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Merged via the queue into huggingface:main with commit e32afc7 Jun 24, 2026
106 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants