Conversation
|
The documentation is not available anymore as the PR was closed or merged. |
younesbelkada
left a comment
There was a problem hiding this comment.
Looks really great thanks a lot !
What about changing disable_exllamav2 to disable_exllama_v2 ?
Can you also add few lines in the documentation adding some expected speedups and how to use this argument? You can copy paste the table you shared in the auto-gptq PR
I've named it that way in auto-gptq, so let's keep
I added a link to the optimum benchmark and I will update the benchmark in a follow-up PR . It will be easier to maintain this way. |
ArthurZucker
left a comment
There was a problem hiding this comment.
Thanks. a bit curious as to why we are naming this disable_ when it's disable by default for v1 and enabled for 2. Bit counter intuitive no?
| With the release of the exllamav2 kernel, you can get faster inference speed compared to the exllama kernels. You just need to | ||
| pass `disable_exllamav2` in [`GPTQConfig`]: |
There was a problem hiding this comment.
it doesn't make a lot of sense to have a disable_exllamav2 instead of enable_ or use_ which we have for most of the additional features like flash attention or use_cache. No?
There was a problem hiding this comment.
It is named it this way because the goal was to enable it by default (False value). Since the v2 came out, I wanted to disable the v1 and enable v2 by default as v2 is just a faster version of v1. cc @fxmarty as you did the v1 integration.
There was a problem hiding this comment.
Yes I agree the name is not very scalable. It could make sense to rename them.
| batch_size: int = 1, | ||
| pad_token_id: Optional[int] = None, | ||
| disable_exllama: bool = False, | ||
| disable_exllama: bool = True, |
There was a problem hiding this comment.
is this not breaking
There was a problem hiding this comment.
Yeah, I was worried about that too. I would be great to force user to use exllamav2 instead of exllama. Otherwise, we can keep leave it to False and we can either:
- set
disable_exllamav2toFalseand we check if indeed they have access to exllamav2 kernel, otherwise the user will fallback to exllama kernel. - set
disable_exllamav2toTrueand the user will have to enable it manually.
Compared to the second option, the first option will make it easy for current users and new users to discover and use exllamav2. However, it will be more verbose (fallback to exllama).
LMK what you think =) cc @younesbelkada
There was a problem hiding this comment.
I decided to go with the second option and rename the variable
ArthurZucker
left a comment
There was a problem hiding this comment.
Looks good, really think we should make things clearer (deprecate the disable_exllama) / try to think about exllamav3 and how not have to do the same PR again! But good to go otherwise
| def get_loading_attributes(self): | ||
| attibutes_dict = copy.deepcopy(self.__dict__) | ||
| loading_attibutes = ["disable_exllama", "use_cuda_fp16", "max_input_length"] | ||
| loading_attibutes = ["disable_exllama", "use_exllama_v2", "use_cuda_fp16", "max_input_length"] |
There was a problem hiding this comment.
would be great if we can do a deprecation cycle for disable_exllama 👿
There was a problem hiding this comment.
By great I meant can you do a deprecation cycle for this?
younesbelkada
left a comment
There was a problem hiding this comment.
Thanks a mile! Just left one comment - LGTM
| self.max_input_length = max_input_length | ||
| self.use_exllama_v2 = use_exllama_v2 | ||
| # needed for compatibility with optimum gptq config | ||
| self.disable_exllamav2 = not use_exllama_v2 |
There was a problem hiding this comment.
Yeah let's just warn users that disable_exllama will be deprecated in future versions here ( or inside post_init) and re-think about a better API in the future, but for now I think all good!
There was a problem hiding this comment.
i've added it in the post_init
| def get_loading_attributes(self): | ||
| attibutes_dict = copy.deepcopy(self.__dict__) | ||
| loading_attibutes = ["disable_exllama", "use_cuda_fp16", "max_input_length"] | ||
| loading_attibutes = ["disable_exllama", "use_exllama_v2", "use_cuda_fp16", "max_input_length"] |
There was a problem hiding this comment.
By great I meant can you do a deprecation cycle for this?
| group_size=self.group_size, | ||
| bits=self.bits, | ||
| disable_exllama=self.disable_exllama, | ||
| disable_exllamav2=True, |
There was a problem hiding this comment.
ouch hurts my eyes 🤣 . If you use one or the other, the other will be disabled
There was a problem hiding this comment.
Oh for these tests, we don't use exllamav2 at all. It's just that the dynamically_import_QuantLinear have disable_exllamav2 = False by default. So I need to set it to True, so that I don't import the wrong Linear.
This reverts commit 8214d6e.
* add_ xllamav2 arg * add test * style * add check * add doc * replace by use_exllama_v2 * fix tests * fix doc * style * better condition * fix logic * add deprecate msg
Revert "add exllamav2 arg (huggingface#26437)" This reverts commit 8214d6e.
What does this PR do ?
This PR adds the possibility to choose exllamav2 kernels for GPTQ model. This PR follows the integration of the kernels in auto-gptq and the integration in optimum.