Round the ue8m0 FP8 scale before quantizing so dequant matches the stored inverse by Incheonkirin · Pull Request #46763 · huggingface/transformers

Incheonkirin · 2026-06-19T08:13:20Z

Fp8Quantize._quantize_one quantizes the weight with the unrounded block scale, then for scale_fmt="ue8m0" (DeepSeek-V4 style) rounds only the stored weight_scale_inv up to a power of two. At load time Fp8Dequantize computes weight * weight_scale_inv, so the dequant scale can be up to a full octave (~2x) larger than the scale the weight was actually divided by. Nothing flags it: the shapes and dtypes are valid, the weights are just wrong.

DeepSeek's own DeepGEMM reference rounds the scale first and quantizes with that same rounded scale (deep_gemm/utils/math.py: sf = ceil_to_ue8m0(sf); x_fp8 = x / sf), so quant and dequant agree. This change matches that ordering: round inv_scales to the power-of-two grid, re-derive scales = 1 / inv_scales, then quantize. The rounding direction is unchanged; only the order moves. The scale_fmt="float" path never reassigns scales, so non-ue8m0 checkpoints quantize bit-identically.

Tests: test_fp8_ue8m0_quantize_dequantize_round_trip checks the round-trip (main ~0.6, with this change ~0.022, matching the e4m3 floor); test_fp8_float_scale_fmt_quantization_unchanged asserts the scale_fmt="float" path stays bit-identical to the original formula across 100 random 128x128 blocks. CPU-only.

…ored inverse For scale_fmt="ue8m0", Fp8Quantize._quantize_one quantized the weight with the unrounded block scale but stored a weight_scale_inv rounded up to a power of two. Fp8Dequantize multiplies by that stored inverse, so the round-trip was off by up to a full octave per block. Round the inverse scale first, re-derive the forward scale from it, then quantize, matching DeepGEMM's order. The scale_fmt="float" path is unchanged.

github-actions · 2026-06-19T08:32:31Z

CI Dashboard: View test results in Grafana

Rocketknight1 · 2026-06-19T14:13:01Z

cc @SunMarc for quants

SunMarc

Indeed, otherwise, it doesn't match. cc @ArthurZucker for confirmation

SunMarc · 2026-06-22T17:01:34Z

        )
        torch.testing.assert_close(dequantized_q, expected_q, rtol=1e-2, atol=1e-2)

+    def test_fp8_ue8m0_quantize_dequantize_round_trip(self):


don't put that here, you should put that in quantization folder / fp8 folder.

HuggingFaceDocBuilderDev · 2026-06-24T14:17:09Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

SunMarc approved these changes Jun 22, 2026

View reviewed changes

SunMarc requested a review from ArthurZucker June 22, 2026 17:03

SunMarc enabled auto-merge June 24, 2026 14:04

SunMarc added this pull request to the merge queue Jun 24, 2026

Merged via the queue into huggingface:main with commit e32afc7 Jun 24, 2026
106 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Round the ue8m0 FP8 scale before quantizing so dequant matches the stored inverse#46763

Round the ue8m0 FP8 scale before quantizing so dequant matches the stored inverse#46763
SunMarc merged 1 commit into
huggingface:mainfrom
Incheonkirin:fix-ue8m0-quantize-scale-ordering

Incheonkirin commented Jun 19, 2026

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

Rocketknight1 commented Jun 19, 2026

Uh oh!

SunMarc left a comment

Uh oh!

SunMarc Jun 22, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

Incheonkirin commented Jun 19, 2026

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

Rocketknight1 commented Jun 19, 2026

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

SunMarc Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants