Conversation
younesbelkada
left a comment
There was a problem hiding this comment.
Thanks a lot for this work and integration! Left some initial comments before the review of @fxmarty with respect to potential transformers integration later on
| if not torch.cuda.is_available(): | ||
| raise RuntimeError("No GPU found. A GPU is needed to quantize model.") |
There was a problem hiding this comment.
Do we actually need a GPU to quantize the model? I don't have the details right out of my head now, but is it a strong requirement?
There was a problem hiding this comment.
(Not considering the actual inference process which I understand requires a GPU, maybe quantization/calibration can be done on CPU too?)
There was a problem hiding this comment.
I don't think it is a strong requirement. I need to test but you can have a look at fasterquant method from GPTQ class. Looks like it uses cuda to speed up some operations (cholesky, cholesky_inverse, matmul) and I can see that he calls torch.cuda.synchronize().
There was a problem hiding this comment.
It is in good shape, thank you for working on it!
I still think that using exllama kernel would be very interesting (for ref https://github.com/fxmarty/q4f16-gemm-gemv-benchmark & huggingface/text-generation-inference#553 (comment), second one is with Triton, that I see is not used here), but integrating with AutoGPTQ is already a nice step (making sure that cuda-old is used when possible).
Is there a hard requirement on accelerate? If so, why?
I think it would be good to add a documentation as well in this PR, and CI.
It is also not clear to me what should go in accelerate and optimum, if this PR puts a hard requirement on accelerate. Why not put it in accelerate directly? Why bitsandbytes in accelerate & this in optimum? I think it can be quite confusing to users.
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
fxmarty
left a comment
There was a problem hiding this comment.
Looks great! I left a few questions/style comments
Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>
fxmarty
left a comment
There was a problem hiding this comment.
LGTM, thank you for adding it!
Just left taste comments / answers to the threads above. I believe this breaks with transformers 4.31 now, but maybe that's fine
|
Can you run |
I've changed it so that we install from source. I will change it back after the release of transformers. |
regisss
left a comment
There was a problem hiding this comment.
Very nice PR @SunMarc 🔥 🚀
Just commenting the title of the new doc section.
Also, could you add a bullet point for GPTQ in this section please?
What does this PR do?
This PR adds the possibility to perform GTPQ quantization. I tried to be as generic as possible to support any kind of models ( whether it is Transformer models or not). The backend relies on auto_gptq library where we use
GTPQclass andQuantLinearclass. With this API you can do conversion, saving and loading quantized weights. We have a dependence onaccelerateto save the weights and load the quantized weights efficiently without allocating more memory than needed.Quantize model:
Save model
Convert model and Load quantized weights
PS: In a follow-up PR, I will use this API to integrate GTPQ quantization into transformers, so that we take into account GPTQ quantized model, just like what we did for
bitsandbytesTODO: