Add scale-only version of the HQQ algorithm for IntxWeightOnlyConfig/Int8DynamicActivationIntxWeightConfig#3110
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3110
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 1 PendingAs of commit be24ef2 with merge base 5cbbd73 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
Thanks! Add some unit tests for this path as well? |
Added new E2E test |
| """ | ||
| Uses `torchao.quantization.quant_primitives._choose_qparams_and_quantize_scale_only_hqq` | ||
| """ | ||
| HQQ = "hqq" |
There was a problem hiding this comment.
There are two variants of HQQ today, which makes it confusing:
_choose_qparams_and_quantize_affine_hqq -- which is used in version 1 of the config when use_hqq=True
_choose_qparams_and_quantize_scale_only_hqq - this one
There was a problem hiding this comment.
These configs are used in ExecuTorch, and never supported HQQ in any previous version.
use_hqq was used in version 1 of the int4 server config.
If you think it would be clearer, I could call the algorithm enum HQQ_SCALE_ONLY or HQQ_NO_ZERO_POINT
There was a problem hiding this comment.
If you think it would be clearer, I could call the algorithm enum HQQ_SCALE_ONLY or HQQ_NO_ZERO_POINT
Yeah, let's do HQQ_SCALE_ONLY then
|
This is great btw Why would one ever not do HQQ_scale_only on XNNPACK on ET if it improves the accuracy compared to naive? It doesn't take a long time to quantize, right? Roughly a few minutes? Hope we can make it to the 0.14 branch cut, which is Oct 6. |
|
|
||
| # can switch to StrEnum (https://docs.python.org/3/library/enum.html#enum.StrEnum) | ||
| # after python 3.10 is end of life (https://devguide.python.org/versions/) | ||
| class IntxChooseQParamsAlgorithm(str, Enum): |
There was a problem hiding this comment.
Also, do we need to introduce yet another class?
Can we just extend existing Int4ChooseQParamsAlgorithm and add affine and hqq_scale_only?
And then rename/promote Int4ChooseQParamsAlgorithm to IntxChooseQParamsAlgorithm in a follow-up PR?
There was a problem hiding this comment.
In torchao's refactor (removing AffineQuantizedTensor), the direction is for subclasses to not share higher-level abstractions, but instead define their own enums. This is how packing format works as well (intx_packing_format for the intx subclass, and int4_packing_format for the int4 subclass).
I'll let @jerryzh168 comment here as well
There was a problem hiding this comment.
yeah we want local abstractions instead of global abstractions unless it's required.
Yes, I want to measure on some more models and benchmarks and if results are neutral to positive, I'll update the default in etLLM. |
mergennachin
left a comment
There was a problem hiding this comment.
LGTM, will defer the question to @jerryzh168
|
|
||
| # can switch to StrEnum (https://docs.python.org/3/library/enum.html#enum.StrEnum) | ||
| # after python 3.10 is end of life (https://devguide.python.org/versions/) | ||
| class IntxChooseQParamsAlgorithm(str, Enum): |
There was a problem hiding this comment.
btw, are the changes in this file tested as well?
There was a problem hiding this comment.
I think this exisiting unit test covers them: https://github.com/pytorch/ao/blob/main/test/quantization/test_qat.py#L2321
This PR introduces a scale-only version of the HQQ algorithm for IntxWeightOnlyConfig/Int8DynamicActivationIntxWeightConfig, which we find improves model quality on mmlu from 0.50 to 0.55 when quantizing Gemma3-4B.