Add UIntxWeightOnlyConfig

GemliteUIntxWeightOnlyConfig (https://github.com/pytorch/ao/blob/e915c0731ca80d4b69e024187829fb231dd70c1e/torchao/prototype/quantization/quant_api.py#L174) is still using old AQT design, and we'd like to migrate to the v2 design. see https://github.com/pytorch/ao/issues/2752 for details.

There are many examples of the new design, you can trace the code path of https://github.com/pytorch/ao/blob/e915c0731ca80d4b69e024187829fb231dd70c1e/torchao/quantization/quant_api.py#L1333-L1365 for example.

In the end we are trying to remove AffineQuantizedTensor from torchao and this is one of the blockers.


For implementation details and file structure, check https://github.com/pytorch/ao/tree/main/torchao/prototype/quantization/float8_static_quant for example.

```
# in torchao/prototype/quantization/quant_api.py
UIntxWeightOnlyConfig(..., uintx_packing_format="bit_packed")

# torchao/prototype/quantization/uintx/uintx_bit_packed_tensor.py
We will have UIntxBitPackedTensor defined here

# test will be defined in test/prototype/test_uintx_bit_packed_tensor.py
```

Note: 
* gemlite is requested by executorch team for desktop use cases
* GemliteUIntxWeightOnlyConfig doesn't align with our design (we shouldn't have gemlite in the name) so we migrate that to  `UIntxWeightOnlyConfig`

	class Float8WeightOnlyConfig(AOBaseConfig):
	"""
	Configuration for applying float8 weight-only symmetric per-channel quantization to linear layers.

	Args:
	weight_dtype (torch.dtype): The target data type for weight quantization. Default is torch.float8_e4m3fn.
	set_inductor_config (bool): if True, adjusts `torchinductor` settings to recommended values.
	version (int): the version of the config, version 1 is deprecated, version 2 is using Float8Tensor (default)

	Note:
	The actual matmul will be computed in original precision of the weight tensor.

	Example:

	.. literalinclude:: ../../examples/inference/float8_weight_only.py
	:language: python
	"""

	weight_dtype: torch.dtype = e4m3_dtype
	set_inductor_config: bool = True
	version: int = 2

	def __post_init__(self):
	torch._C._log_api_usage_once("torchao.quantization.Float8WeightOnlyConfig")


	def _float8_weight_only_quant_tensor(weight, config):
	assert config.version == 2, f"Unexpected version: {config.version}"
	weight_dtype = config.weight_dtype
	new_weight = Float8Tensor.from_hp(
	weight, float8_dtype=weight_dtype, granularity=PerRow()
	)
	return new_weight

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add UIntxWeightOnlyConfig #3891

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add UIntxWeightOnlyConfig #3891

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions