Efficient Inference Kernel for SpQR #34976
Conversation
|
Hey @elvircrn, Thanks for adding this quantization method ! A very smooth integration 🔥! Just left a few comments |
There was a problem hiding this comment.
No need to add the checks here, validate_environment in the quantizer will take care of that
There was a problem hiding this comment.
Resolved by removing the checks.
There was a problem hiding this comment.
Resolved by removing
There was a problem hiding this comment.
let's raise a warning here if you need to override the torch_dtype like in other quantizers
There was a problem hiding this comment.
Since SpQR really only supports float16, I raise an error if it's None or anything other than float16. Is this the desired behavior?
There was a problem hiding this comment.
Sorry I can't find this model on the hub 🤔
There was a problem hiding this comment.
Ah, didn't expect a review so soon. Thank you!
I've uploaded the model and updated this part of the test.
There was a problem hiding this comment.
No worries 😊 take your time and ping me when you finish the integration
There was a problem hiding this comment.
No need to add the requirement here, it's already verified with the decorator require_spqr
There was a problem hiding this comment.
Resolved by removing.
docs/source/en/quantization/spqr.md
Outdated
There was a problem hiding this comment.
We can also add a small section about how to load a model quantized using SpQR using transformers
There was a problem hiding this comment.
Resolved by updating the README.
There was a problem hiding this comment.
if the quantization method requires a gpu we need to check if it's availabe here
There was a problem hiding this comment.
Resolved by adding an is_cuda check.
There was a problem hiding this comment.
Is it possible to add tests for multi_gpu setups ?
There was a problem hiding this comment.
Resolving by bringing this back form AQLM. Will the CI be able to pick this up? I don't have the means of testing this.
There was a problem hiding this comment.
+1 on that, you can add the test and we will check to make sure it to pass the right expected output
fda6b66 to
ee10336
Compare
|
@MekkCyber I've removed the Draft tag from the PR. It seems that I am unable to pass the |
56c016d to
e1a91fa
Compare
|
Could you rebase @elvircrn to update the branch :) |
e1a91fa to
4ea16a9
Compare
|
@MekkCyber Rebase done. |
4ea16a9 to
51c9f8d
Compare
|
Hey @elvircrn, thanks for iterating ! It looks great I just left some minor comments |
There was a problem hiding this comment.
We need to add the checks here if the accelerate and spqr_quant packages are available before doing the import, what we don't need to do is raising errors, because the packages are assumed to be installed when executing the function
There was a problem hiding this comment.
Not sure if I understood this, please double-check my newest change.
There was a problem hiding this comment.
We are not assuming here because we already validated the environment 😉
There was a problem hiding this comment.
| elif torch_dtype != torch.float16: | |
| raise ValueError( | |
| "You cannot any type other than torch.float16 for SpQR. Please either leave it None or set it to" | |
| "torch.float16 explicitly." | |
| ) | |
| return torch_dtype | |
| elif torch_dtype != torch.float16: | |
| raise ValueError( | |
| "You cannot use any type other than torch.float16 for SpQR. Please either leave it None or set it to" | |
| "torch.float16 explicitly." | |
| ) | |
| return torch_dtype |
There was a problem hiding this comment.
Just FMI, are the beta and shapes parameters used solely for loading, or do they also play a role in quantization? In other words, if a model is quantized using specific beta or shapes values, can it only be loaded with those same parameters?
There was a problem hiding this comment.
SpQR quantized weights are a product of tile-based (beta corresponds to the dimension of this tile where beta1, beta2 represent tile width and height respectively) bi-level quantization. You always specify beta before the quantization starts.
Each SpQR weight also comes with a sparse tensor representing the outlier weights. As this tensor is unstructured, one had to keep track of the size of this matrix during loading.
There was a problem hiding this comment.
The following is a visualization of the compression format from the original publication (https://arxiv.org/pdf/2306.03078):
There was a problem hiding this comment.
Thanks for the detailed explanation
There was a problem hiding this comment.
Let's add it to the doc!
There was a problem hiding this comment.
in the shapes attribute do we need to specify the shape for all the tensors ?
There was a problem hiding this comment.
The colval size needs to be specified ahead of time for all weights. The rest can be computed from m, n and bits. Can we keep it the way it is now?
There was a problem hiding this comment.
I can do it both ways easily, not sure what is preferred here.
There was a problem hiding this comment.
I'm just worried about the quantization config, I'm not sure if we can verify that shapes contains all the necessary elements in the post_init. Maybe before trying to access the key in replace_with_spqr_linear we verify if it's there or we raise an error to inform the user there is something wrong with the config.json they are using, wdyt @SunMarc
There was a problem hiding this comment.
In my estimate, these configurations are directly tied to the quantized weights. This would be a fatal error in quantization serialization if the shapes don't cover the full set of quantized weights. Perhaps this is something that should be done on the side of SpQR (https://pypi.org/project/spqr-quant/)?
If we were to do it on huggingface/transformers-side:
a) during replace_with_linear, we raise a fatal error (please let me know which one would be appropriate here) if the quantized weights don't have the corresponding key in the shapes config.
b) we do a full check before conducting the replacement in order to conserve memory/device cycles.
Let me know what best suits transformers here.
There was a problem hiding this comment.
I didn't see the tag to @SunMarc - I thought the wdyt referred to me.
There was a problem hiding this comment.
If possible it's better to use small models for the CI like Qwen 0.5, Bloom 560M, SmolLM 135M...
There was a problem hiding this comment.
Will have to get back to you on this (might require a significant time investment).
There was a problem hiding this comment.
Is this a blocker for the merge?
There was a problem hiding this comment.
Not really a blocker for now
436697e to
16c0278
Compare
|
@MekkCyber The latest batch of comments are addressed. |
|
Thanks for the quick iteration @elvircrn 🔥 ! can you re-rebase please 😅 |
There was a problem hiding this comment.
I'm just worried about the quantization config, I'm not sure if we can verify that shapes contains all the necessary elements in the post_init. Maybe before trying to access the key in replace_with_spqr_linear we verify if it's there or we raise an error to inform the user there is something wrong with the config.json they are using, wdyt @SunMarc
There was a problem hiding this comment.
Thanks for the detailed explanation
There was a problem hiding this comment.
Yes that's exactly what I meant, thank you for fixing it !
There was a problem hiding this comment.
Not really a blocker for now
16c0278 to
efd9f20
Compare
|
@MekkCyber rebased. |
|
@MekkCyber Resolved the latest batch of comments. |
|
@SunMarc @MekkCyber |
|
pinging @ArthurZucker to have a quick look ! |
|
Could you fix the conflits @elvircrn ? |
|
@SunMarc resolved quickly, as it was just an md conflict in the overview table. |
* gptqmodel Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix format Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * update readme Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * gptqmodel need use checkpoint_format (huggingface#1) * gptqmodel need use checkpoint_format * fix quantize * Update quantization_config.py * Update quantization_config.py * Update quantization_config.py --------- Co-authored-by: ZX-ModelCloud <zx@modelcloud.ai> Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai> * Revert quantizer_gptq.py (huggingface#2) * revert quantizer_gptq.py change * pass **kwargs * limit gptqmodel and optimum version Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix format Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix warning Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix version check Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * revert unrelated changes Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * enable gptqmodel tests Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix requires gptq Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * Fix Transformer compat (huggingface#3) * revert quantizer_gptq.py change * pass **kwargs * add meta info * cleanup * cleanup * Update quantization_config.py * hf_select_quant_linear pass checkpoint_format and meta * fix GPTQTestCUDA * Update test_gptq.py * gptqmodel.hf_select_quant_linear() now does not select ExllamaV2 * cleanup * add backend * cleanup * cleanup * no need check exllama version * Update quantization_config.py * lower checkpoint_format and backend * check none * cleanup * Update quantization_config.py * fix self.use_exllama == False * spell * fix unittest * fix unittest --------- Co-authored-by: LRL <lrl@lbx.dev> Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai> * fix format Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix format again Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * update gptqmodel version (huggingface#6) * update gptqmodel version * update gptqmodel version * fix unit test (huggingface#5) * update gptqmodel version * update gptqmodel version * "not self.use_exllama" is not equivalent to "self.use_exllama==False" * fix unittest * update gptqmodel version * backend is loading_attibutes (huggingface#7) * fix format and tests Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix memory check Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix device mismatch Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * fix result check Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * Update src/transformers/quantizers/quantizer_gptq.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update src/transformers/quantizers/quantizer_gptq.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update src/transformers/quantizers/quantizer_gptq.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * update tests Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * review: update docs (huggingface#10) * review: update docs (huggingface#12) * review: update docs * fix typo * update tests for gptqmodel Signed-off-by: jiqing-feng <jiqing.feng@intel.com> * update document (huggingface#9) * update overview.md * cleanup * Update overview.md * Update overview.md * Update overview.md * update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md * Update gptq.md --------- Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai> * typo * doc note for asymmetric quant * typo with apple silicon(e) * typo for marlin * column name revert: review * doc rocm support * Update docs/source/en/quantization/gptq.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/quantization/gptq.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/quantization/gptq.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/quantization/gptq.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/quantization/overview.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/quantization/overview.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --------- Signed-off-by: jiqing-feng <jiqing.feng@intel.com> Co-authored-by: LRL-ModelCloud <165116337+LRL-ModelCloud@users.noreply.github.com> Co-authored-by: ZX-ModelCloud <zx@modelcloud.ai> Co-authored-by: Qubitium-ModelCloud <qubitium@modelcloud.ai> Co-authored-by: ZX-ModelCloud <165115237+ZX-ModelCloud@users.noreply.github.com> Co-authored-by: LRL <lrl@lbx.dev> Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* fixing nemotron processor * make style
64d2039 to
c4273e2
Compare
|
@SunMarc @ArthurZucker Conflicts resolved. |
| device_map="auto" | ||
| ) | ||
| tokenizer = AutoTokenizer.from_pretrained("elvircrn/Llama-2-7b-SPQR-3Bit-16x16-red_pajama-hf") | ||
| ``` |
There was a problem hiding this comment.
we can't quantize a model on the fly with quantization_config=... using spqr as the quantize type?
There was a problem hiding this comment.
We can't, this is not supported as of yet. This PR only adds Inference support.
There was a problem hiding this comment.
Let's add it to the doc!
|
Can you just fix this: |
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
|
@ArthurZucker Is the PR in order now? |
0340152 to
14f21c1
Compare
|
@ArthurZucker Erreneous rebase issue resolved. |
|
Hi @elvircrn, can you just resolve the conflicts and once the CI is green we can merge ! Again sorry for the delay 🙏 and thanks for your patience |
|
@MekkCyber Done. |
|
@MekkCyber The failing tests, based on the logs, seem to be unrelated to my own source code. |
|
Yes they seem unrelated @elvircrn, i tried updating the branch I hope it works |
|
@MekkCyber I don't think that did it - unrelated tests are still failing. |
|
@MekkCyber Seems that everything is green now. :) |
|
Thanks everyone! |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |

What does this PR do?
Adds support for efficient single-batch inference for SpQR.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@SunMarc @MekkCyber