Build SmoothQuant release pipeline#3010
Conversation
Adds SMOOTHQUANT-W8A8 quantization method to the TorchAO model release pipeline. - Adjusted defaults: Increased calibration samples from 10 to 128 to ensure consistency, reduced max sequence length (SeqLen) from 2048 to 1024 - Updated HF CLI command: `huggingface-cli login` to `hf auth login` Test plan: ```bash python quantize_and_upload.py --model_id Qwen/Qwen3-8B --quant SMOOTHQUANT-W8A8 --push_to_hub --task bbh ```
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3010
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
|
||
| ### AWQ-INT4 | ||
| [AWQ](https://arxiv.org/abs/2306.00978) is a technique to improve accuracy for weight only quantization. It improves accuracy by preserving "salient" weight channels that has high impact on the accuracy of output, through multiplying the weight channel by a scale, and do the reverse for the correspnoding activation, since activation is not quantized, there is no additional loss from activation, while the quantization loss from weight can be reduced. | ||
| ### SMOOTHQUANT-W8A8 & AWQ-INT4 |
There was a problem hiding this comment.
can you add a separate section for smoothquant?
There was a problem hiding this comment.
Yes, separating them and linking them seems better.
| "model.embed_tokens": _int8_int4_embedding_config, | ||
| } | ||
| ), | ||
| "SMOOTHQUANT-W8A8": Int8DynamicActivationInt8WeightConfig(), |
There was a problem hiding this comment.
to standardize on naming, this should be: SmoothQuant-INT8-INT8 I think
There was a problem hiding this comment.
Thanks for correcting it; I missed the standard name in this script.
|
|
||
| Note: for initial release, please include `--populate_model_card_template` to populate model card template. | ||
|
|
||
| ### SMOOTHQUANT-W8A8 |
There was a problem hiding this comment.
can you add the command to generate smoothquant checkpoints as well? similar to AWQ-INT4
| # with some calibration_limit (number of samples) | ||
| python quantize_and_upload.py --model_id Qwen/Qwen3-8B --quant AWQ-INT4 --push_to_hub --task bbh --calibration_limit 2 | ||
|
|
||
| # release SMOOTHQUANT-INT8-INT8 model, calibrated with a specific task |
There was a problem hiding this comment.
this should be added in SMOOTHQUANT-INT8-INT8 section I think
| quantized_model = model | ||
| quant_config = AWQConfig(base_config, step="prepare_for_loading") | ||
| quantized_model.config.quantization_config = TorchAoConfig(quant_config) | ||
| elif quant == "SMOOTHQUANT-INT8-INT8": |
There was a problem hiding this comment.
nit: can you change to SmoothQuant-INT8-INT8? I feel that's slightly easier to read
There was a problem hiding this comment.
But how about keeping upper letter to ensure consistency in quant_to_quant_code ? Upper letter seems right pattern I think.
There was a problem hiding this comment.
it's abbreviations, that's why they are upper case, maybe you can use SQ-INT8-INT8 then? but SmoothQuant will be clearer though
There was a problem hiding this comment.
okay then SmoothQuant-INT8-INT8 looks best
|
|
||
| Note: for initial release, please include `--populate_model_card_template` to populate model card template. | ||
|
|
||
| ### SMOOTHQUANT-INT8-INT8 |
There was a problem hiding this comment.
nit: can you update this to SmoothQuant-INT8-INT8 as well
|
|
||
| Examples: | ||
| ``` | ||
| # release SMOOTHQUANT-INT8-INT8 model, calibrated with a specific task |
jerryzh168
left a comment
There was a problem hiding this comment.
looks good, see some nit comments inline
| type=int, | ||
| default=2048, | ||
| help="Maximum sequence length of examples to calibrate and evaluate model on. Default is 2048", | ||
| default=1024, |
There was a problem hiding this comment.
actually for this one, can you keep as is? I remember some models even need larger like 4096
* Summary: Adds SMOOTHQUANT-W8A8 quantization method to the TorchAO model release pipeline. - Adjusted defaults: Increased calibration samples from 10 to 128 to ensure consistency, reduced max sequence length (SeqLen) from 2048 to 1024 - Updated HF CLI command: `huggingface-cli login` to `hf auth login` Test plan: ```bash python quantize_and_upload.py --model_id Qwen/Qwen3-8B --quant SMOOTHQUANT-W8A8 --push_to_hub --task bbh ``` * add SmoothQuant uploader * separate docs for AWQ & SmoothQuant * rename SMOOTHQUANT-W8A8 to SMOOTHQUANT-INT8-INT8 * add SmoothQuant release example * update example in docs * rename SMOOTHQUANT-INT8-INT8 to SmoothQuant-INT8-INT8 * rename SMOOTHQUANT to SmoothQuant * revert max_seq_length default to 2048
Summary:
Adds SMOOTHQUANT-W8A8 quantization method to the TorchAO model release pipeline.
huggingface-cli logintohf auth loginTest plan: