Added cache_block_outputs option to enable GPTQ for non-regular models#27032
Merged
amyeroberts merged 5 commits intohuggingface:mainfrom Nov 1, 2023
Merged
Added cache_block_outputs option to enable GPTQ for non-regular models#27032amyeroberts merged 5 commits intohuggingface:mainfrom
amyeroberts merged 5 commits intohuggingface:mainfrom
Conversation
SunMarc
reviewed
Oct 24, 2023
Member
SunMarc
left a comment
There was a problem hiding this comment.
Thanks @AlexKoff88, let's wait for the optimum PR to be merged. We might not need this argument.
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Member
|
@AlexKoff88 please run |
Contributor
Author
Fixed |
amyeroberts
approved these changes
Nov 1, 2023
Contributor
amyeroberts
left a comment
There was a problem hiding this comment.
Thanks for adding this!
Just a small nit on the docstring.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
EduardoPach
pushed a commit
to EduardoPach/transformers
that referenced
this pull request
Nov 19, 2023
huggingface#27032) * Added cache_block_outputs option to enable GPTQ for non-regular models * Update src/transformers/utils/quantization_config.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update src/transformers/utils/quantization_config.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Fixed style * Update src/transformers/utils/quantization_config.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
cache_block_outputs enables the collection of the block output to speed up GPTQ process. However, it does not work for some models such as ChatGLM where the LayerNorm is the first layer in the block.
Just compare:
OPT structure:
model.decoder.layers.0.self_attn
model.decoder.layers.0.self_attn.k_proj
model.decoder.layers.0.self_attn.v_proj
model.decoder.layers.0.self_attn.q_proj
model.decoder.layers.0.self_attn.out_proj
model.decoder.layers.0.activation_fn
model.decoder.layers.0.self_attn_layer_norm
model.decoder.layers.0.fc1
model.decoder.layers.0.fc2
model.decoder.layers.0.final_layer_norm
ChatGLM structure:
transformer.encoder.layers.0
transformer.encoder.layers.0.input_layernorm
transformer.encoder.layers.0.self_attention
transformer.encoder.layers.0.self_attention.query_key_value
transformer.encoder.layers.0.self_attention.core_attention
transformer.encoder.layers.0.self_attention.core_attention.attention_dropout
transformer.encoder.layers.0.self_attention.dense
transformer.encoder.layers.0.post_attention_layernorm
transformer.encoder.layers.0.mlp
transformer.encoder.layers.0.mlp.dense_h_to_4h
transformer.encoder.layers.0.mlp.dense_4h_to_h
The solution is to disable SA block output caching and collect the quantizing block inputs starting from the beginning of the model. It slows down the optimization a bit but works more stable.
Related PR to Optimum: huggingface/optimum#1479