[CPU] add Float8OpaqueTensor for dynamic float8 act float8 weight#3075
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3075
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit c26d34b with merge base 1a9b6f4 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
CC @mingfeima for review. Thanks. |
|
Hi @mingfeima @jerryzh168 @andrewor14 Could you please review this PR? Thanks. |
|
Hi @mingfeima @jerryzh168 @andrewor14 Though this PR depends on #3100, could you please review this PR? Thanks. |
|
@jerryzh168 Could you please review this PR again? Thanks. |
| f"Shapes of input and weight do not match, input:{input_tensor.shape}, weight: {weight_tensor.shape}" | ||
| ) | ||
|
|
||
| act_mat = input_tensor.contiguous() |
There was a problem hiding this comment.
isn't this going to be slow?
There was a problem hiding this comment.
On CPU, we require input tensors to be contiguous. In fact, we almost always get contiguous inputs. So, the reordering won't actually happen. Here it just ensures the assumption.
There was a problem hiding this comment.
if it's always contiguous I feel it might be better to do an assert here
There was a problem hiding this comment.
I mean in most cases it's contiguous so we don't need to worry about performance. But we cannot guarantee that. It is acceptable for us if the input tensor is not contiguous and it's slow to make it contiguous.
| granularity = weight_tensor.act_quant_kwargs.granularity | ||
| if isinstance(granularity, PerGroup): | ||
| group_size = granularity.group_size | ||
| if weight_tensor.block_size[1] < weight_tensor.size(-1): | ||
| # weight_tensor is also per group quantized | ||
| assert weight_tensor.block_size[1] == group_size, ( | ||
| "input and weight should have the same group size but got" | ||
| f" {weight_tensor.block_size[1]} and {group_size}" | ||
| ) | ||
| act_block_size = get_block_size(act_mat.shape, granularity) | ||
| act_scale = _choose_scale_float8( | ||
| act_mat, | ||
| float8_dtype=torch.float8_e4m3fn, | ||
| block_size=act_block_size, | ||
| ) | ||
| act_mat = _quantize_affine_float8(act_mat, act_scale, torch.float8_e4m3fn) |
There was a problem hiding this comment.
why is this not using
There was a problem hiding this comment.
Thanks for the pointer. However, _choose_quant_func_and_quantize_tensor does the following:
if isinstance(quant_kwargs, QuantizeTensorToFloat8Kwargs):
return Float8Tensor.from_hp(...)Unfortunately, Float8OpaqueTensor also uses QuantizeTensorToFloat8Kwargs so it cannot distinguish them.
Besides, in the implementation of Float8Tensor, activation is quantized by Float8Tensor.from_hp to a Float8Tensor and then unwrapped to get the quantized tensor data for computation. And this part of logic is not exposed to users. So, I feel that it's unnecessary to use Float8OpaqueTensor.from_hp to quantize the activation then unwrap it. It looks good to quantize it with _quantize_affine_float8.
What do you think? If you want Float8OpaqueTensor to be aligned with Float8Tensor, we may need to define a counterpart of QuantizeTensorToFloat8Kwargs for Float8OpaqueTensor so that we can distinguish them. Thanks.
There was a problem hiding this comment.
we should add one of QuantizeTensorToFloat8Kwargs for each tensor I think, so should create QuantizeTensorToOpaqueFloat8Kwargs
There was a problem hiding this comment.
yeah it's optional, if you feel it's better to not use it, it's OK as well.
main thing is using this will reduce duplication and make it easier to adapt in the future
There was a problem hiding this comment.
Thanks for the suggestion. Since input tensors should be plain, we can reuse QuantizeTensorToFloat8Kwargs and _choose_quant_func_and_quantize_tensor here. I have updated this part.
|
Hi @jerryzh168 I have updated this PR per your comments. Could you please review again? Thanks. |
| return activation | ||
|
|
||
|
|
||
| def _input_activation_quant_cpu_fp8( |
There was a problem hiding this comment.
Thanks. Removed.
| return weight | ||
|
|
||
| elif not _fp8_mm_compat(weight): | ||
| elif float8_packing_format == Float8PackingFormat.PLAIN and not _fp8_mm_compat( |
There was a problem hiding this comment.
to make these less complicated, can you lift the float8_packing_format == Float8PackingFormat.PLAIN before line 1851
There was a problem hiding this comment.
Updated. Thanks.
|
|
||
| # Note: Tiny-GEMM kernel only uses BF16 inputs | ||
| def example_inputs(self, batch_size=1): | ||
| def example_inputs(self, batch_size=1, dtype=None): |
There was a problem hiding this comment.
when do you use the dtype that's not the same as the original weight dtype?
There was a problem hiding this comment.
I have removed this. Thanks.
| return | ||
| device = "cpu" | ||
| m = ToyTwoLinearModel(256, 256, 256, dtype, device, bias).eval() | ||
| example_inputs = m.example_inputs(batch_size=bs, dtype=dtype) |
There was a problem hiding this comment.
dtype here seems to be the same as the one taken by model? so you don't need to specify it right?
There was a problem hiding this comment.
Updated. Thanks.
|
Hi @jerryzh168 Please review again. Thanks. |
…torch#3075) * [CPU] add Float8OpaqueTensor for dynamic float8 act float8 weight * Update _normalize_granularity * Update torchao/quantization/quant_api.py * Fix CI * remove unnecessary changes * Refine code * Refine code * Refine code
Summary
We split the original big PR #2505 into the following smaller ones:
Test plan