[CPU] add Float8OpaqueTensor for dynamic float8 act float8 weight by Xia-Weiwen · Pull Request #3075 · pytorch/ao

Xia-Weiwen · 2025-09-26T03:18:08Z

Summary
We split the original big PR #2505 into the following smaller ones:

Unify get_block_size #3039 (relanded by [Reland] Unify get_block_size #3059)
[CPU] Add ops for float8 linear #3052
And this PR [CPU] add Float8OpaqueTensor for dynamic float8 act float8 weight #3075, which as the Float8OpaqueTensor for dynamic float8 act float8 weight quantization on CPU

Test plan

pytest -sv test/quantization/quantize_/workflows/float8/test_float8_opaque_tensor.py

pytorch-bot · 2025-09-26T03:18:12Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3075

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c26d34b with merge base 1a9b6f4 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Xia-Weiwen · 2025-09-26T03:19:01Z

CC @mingfeima for review. Thanks.

Xia-Weiwen · 2025-09-28T01:16:59Z

Hi @mingfeima @jerryzh168 @andrewor14 Could you please review this PR? Thanks.

Xia-Weiwen · 2025-09-30T01:38:23Z

Hi @mingfeima @jerryzh168 @andrewor14 Though this PR depends on #3100, could you please review this PR? Thanks.

Xia-Weiwen · 2025-10-14T02:04:04Z

@jerryzh168 Could you please review this PR again? Thanks.

jerryzh168 · 2025-11-11T18:33:19Z

+        f"Shapes of input and weight do not match, input:{input_tensor.shape}, weight: {weight_tensor.shape}"
+    )
+
+    act_mat = input_tensor.contiguous()


isn't this going to be slow?

On CPU, we require input tensors to be contiguous. In fact, we almost always get contiguous inputs. So, the reordering won't actually happen. Here it just ensures the assumption.

if it's always contiguous I feel it might be better to do an assert here

I mean in most cases it's contiguous so we don't need to worry about performance. But we cannot guarantee that. It is acceptable for us if the input tensor is not contiguous and it's slow to make it contiguous.

OK sounds good

jerryzh168 · 2025-11-11T18:34:12Z

+        granularity = weight_tensor.act_quant_kwargs.granularity
+        if isinstance(granularity, PerGroup):
+            group_size = granularity.group_size
+            if weight_tensor.block_size[1] < weight_tensor.size(-1):
+                # weight_tensor is also per group quantized
+                assert weight_tensor.block_size[1] == group_size, (
+                    "input and weight should have the same group size but got"
+                    f" {weight_tensor.block_size[1]} and {group_size}"
+                )
+        act_block_size = get_block_size(act_mat.shape, granularity)
+        act_scale = _choose_scale_float8(
+            act_mat,
+            float8_dtype=torch.float8_e4m3fn,
+            block_size=act_block_size,
+        )
+        act_mat = _quantize_affine_float8(act_mat, act_scale, torch.float8_e4m3fn)


why is this not using

ao/torchao/quantization/quantize_/workflows/float8/float8_tensor.py

Line 311 in b4ec4cb

input_tensor = _choose_quant_func_and_quantize_tensor(

Thanks for the pointer. However, _choose_quant_func_and_quantize_tensor does the following:

if isinstance(quant_kwargs, QuantizeTensorToFloat8Kwargs): return Float8Tensor.from_hp(...)

Unfortunately, Float8OpaqueTensor also uses QuantizeTensorToFloat8Kwargs so it cannot distinguish them.
Besides, in the implementation of Float8Tensor, activation is quantized by Float8Tensor.from_hp to a Float8Tensor and then unwrapped to get the quantized tensor data for computation. And this part of logic is not exposed to users. So, I feel that it's unnecessary to use Float8OpaqueTensor.from_hp to quantize the activation then unwrap it. It looks good to quantize it with _quantize_affine_float8.
What do you think? If you want Float8OpaqueTensor to be aligned with Float8Tensor, we may need to define a counterpart of QuantizeTensorToFloat8Kwargs for Float8OpaqueTensor so that we can distinguish them. Thanks.

we should add one of QuantizeTensorToFloat8Kwargs for each tensor I think, so should create QuantizeTensorToOpaqueFloat8Kwargs

yeah it's optional, if you feel it's better to not use it, it's OK as well.

main thing is using this will reduce duplication and make it easier to adapt in the future

Thanks for the suggestion. Since input tensors should be plain, we can reuse QuantizeTensorToFloat8Kwargs and _choose_quant_func_and_quantize_tensor here. I have updated this part.

Xia-Weiwen · 2025-11-13T03:08:49Z

Hi @jerryzh168 I have updated this PR per your comments. Could you please review again? Thanks.

jerryzh168 · 2025-11-14T01:32:58Z

    return activation


+def _input_activation_quant_cpu_fp8(


is this function used?

Thanks. Removed.

jerryzh168 · 2025-11-14T01:33:52Z

            return weight

-    elif not _fp8_mm_compat(weight):
+    elif float8_packing_format == Float8PackingFormat.PLAIN and not _fp8_mm_compat(


to make these less complicated, can you lift the float8_packing_format == Float8PackingFormat.PLAIN before line 1851

Updated. Thanks.

jerryzh168 · 2025-11-14T01:40:13Z


    # Note: Tiny-GEMM kernel only uses BF16 inputs
-    def example_inputs(self, batch_size=1):
+    def example_inputs(self, batch_size=1, dtype=None):


when do you use the dtype that's not the same as the original weight dtype?

I have removed this. Thanks.

jerryzh168 · 2025-11-14T01:40:47Z

+                return
+        device = "cpu"
+        m = ToyTwoLinearModel(256, 256, 256, dtype, device, bias).eval()
+        example_inputs = m.example_inputs(batch_size=bs, dtype=dtype)


dtype here seems to be the same as the one taken by model? so you don't need to specify it right?

Updated. Thanks.

Xia-Weiwen · 2025-11-14T02:14:24Z

Hi @jerryzh168 Please review again. Thanks.

…torch#3075) * [CPU] add Float8OpaqueTensor for dynamic float8 act float8 weight * Update _normalize_granularity * Update torchao/quantization/quant_api.py * Fix CI * remove unnecessary changes * Refine code * Refine code * Refine code

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 26, 2025

Xia-Weiwen added the topic: new feature Use this tag if this PR adds a new feature label Sep 26, 2025

Xia-Weiwen requested review from andrewor14 and jerryzh168 September 26, 2025 06:10

Xia-Weiwen marked this pull request as ready for review September 26, 2025 06:10

Xia-Weiwen mentioned this pull request Sep 26, 2025

[CPU] Add Float8OpaqueTensor for dynamic float8 act float8 weight #2505

Closed

[CPU] add Float8OpaqueTensor for dynamic float8 act float8 weight

d460134

mingfeima reviewed Sep 28, 2025

View reviewed changes

Comment thread test/quantization/quantize_/workflows/float8/test_float8_opaque_tensor.py

Comment thread torchao/quantization/quantize_/workflows/float8/float8_opaque_tensor.py

Comment thread torchao/quantization/quantize_/workflows/float8/float8_opaque_tensor.py

Xia-Weiwen requested a review from mingfeima September 28, 2025 02:08

Xia-Weiwen marked this pull request as draft September 30, 2025 01:28

Xia-Weiwen marked this pull request as ready for review September 30, 2025 01:35

jerryzh168 reviewed Oct 6, 2025

View reviewed changes

Comment thread torchao/quantization/quantize_/workflows/float8/float8_opaque_tensor.py

jerryzh168 reviewed Oct 6, 2025

View reviewed changes

Comment thread torchao/float8/inference.py Outdated

jerryzh168 reviewed Oct 6, 2025

View reviewed changes

Comment thread torchao/float8/types.py Outdated

jerryzh168 reviewed Oct 6, 2025

View reviewed changes

Comment thread torchao/quantization/quant_api.py Outdated

jerryzh168 reviewed Oct 6, 2025

View reviewed changes

Comment thread torchao/quantization/quant_api.py Outdated

jerryzh168 reviewed Oct 6, 2025

View reviewed changes

Comment thread torchao/quantization/quant_api.py Outdated

Xia-Weiwen added 3 commits October 9, 2025 10:00

Update _normalize_granularity

cf8dc09

Update torchao/quantization/quant_api.py

4333727

Fix CI

6e1c2a2

Xia-Weiwen requested a review from jerryzh168 October 14, 2025 01:59

jerryzh168 reviewed Oct 14, 2025

View reviewed changes

Comment thread torchao/float8/inference.py Outdated

Xia-Weiwen added 3 commits October 14, 2025 09:50

Merge branch 'main' into float8_opaque_tensor_new

7980de8

Merge branch 'main' into float8_opaque_tensor_new

ecf5e1a

remove unnecessary changes

1044dca