quantize: Handle user-defined quantization levels for additional tensors#12511
quantize: Handle user-defined quantization levels for additional tensors#12511ggerganov merged 35 commits intoggml-org:masterfrom
Conversation
|
That's an excellent idea! and it'll allow to add all supported tensor types (50+) without creating a mess of parameters. Plus, it will give me something to do over the weekend 😆 |
Yeah, I think this is definitely the way to go - the regex support of that PR gives really good flexibility. |
|
TL;DR: A combination of Tensor-Wise Quantization (TWQ) and Layer-Wise Quantization (LWQ) is useful to generate custom models. Using DeepSeek-R1-Distill-Llama-8B-Q4_K_M as an example, LWQ yields a 10.4% smaller model with only a 0.83% 𝜌PPL penalty compared to the naive model. More info here Test results
|
@EAddario |
| void * kv_overrides; // pointer to vector containing overrides | ||
| void * tensor_types; // pointer to vector containing tensor types | ||
| } llama_model_quantize_params; |
There was a problem hiding this comment.
This changes the public interface, so add a comment in #9289.
Note that passing C++ objects here is not correct and we eventually have to fix this API to not do that. It hasn't become a problem yet because the quantization functions are likely not used frequently by 3rd party applications.
@EAddario If you are interested, you can give it a shot in another PR and fix these structs to become C compatible.
slaren
left a comment
There was a problem hiding this comment.
This is a bit too hacky for my preference, but I suppose if people are already creating custom mixes by modifying the code it is better to at least have a tool to do it.
I would prefer if the allowed tensor check is removed, it doesn't really work as a reliable check, and it will prevent some legitimate uses.
|
Thanks for approving @slaren. Any particular use case you have in mind it will prevent? Maybe I can work it into the logic. |
|
Got a better quality LWQ mix using the stats from the modified llama-imatrix. More info here Test results
|
For example, using |
|
I see what you mean. The choice of approach was a trade-off between ensuring the program continues to work exactly as before (backwards compatibility), not introducing new options that are already available (--pure, --output-tensor-type and --token-embedding-type), and adding new capabilities in a way that's consistent with existing error checking logic. By restricting the tensors, users won't be able to do things that clearly are not useful, like trying to quantize norms, lerps, ropes, etc., but you're right in that users wanting to quantize all attn tensors would need to pass three options (--tensor-type attn_q=q4_k --tensor-type attn_k=q4_k --tensor-type attn_v=q4_k) instead of just one (--tensor-type attn=q4_k). Once the changes are merged, I'll open a new PR to address this, within the tensor checking logic to avoid matching instances like attn_norm, ffn_norm, etc., plus implementing @ggerganov's recommendation to make the struct C compatible. |
Late to this conversation, but isn't this case already handled by a regex that uses grouping?
|
|
Not quite @acbits. For the reasons described above, the program requires the full tensor name, with the regex applying only to preceding characters. I'll improve this behaviour in the next PR. |
|
@EAddario Congrats! 🚀 |
|
Really late on that. And nice PR. 1 idea can we define all logic in more general way, for example using json format? and possibly read it from a file for most advance case? |
|
Thanks @Djip007. @ngxson had a similar suggestion, and it's in my to-do list. The way I'm thinking about it is for llama-imatrix (#12718) to generate a file with "recommended" quants, based on relevant statistics, which can then be processed by llama-quantize. The file can of course be changed/created by hand. I don't know exactly what "recommended" means yet so open to suggestions. |
I'll think about it... if I have any ideas, I'll try to share them. |
|
Feel free to comment on #12718 |
…ors (ggml-org#12511) * Add llama_model_quantize_params parameters * Add new quantize parameters parsing and validation * Update usage * Add new parameters defaults * Add new quantization parameters logic * Add llama_model_quantize_params parameters * Add new quantize parameters parsing and validation * Update usage * Add new parameters defaults * Add new quantization parameters logic * Minor refactoring as per the contributors' coding guidelines * Update descriptions to match existing style * Add llama_model_quantize_params parameters * Add new quantize parameters parsing and validation * Update usage * Add new parameters defaults * Add new quantization parameters logic * Minor refactoring as per the contributors' guidelines * Implement general --tensor-type instead of tensor-specific command option * Fix implied type bug * Restore missing #includes * Add regex capability for tensor selection * Refactor function name and update ALLOWED_TENSOR_TYPE * Add missing #include * Handle edge case when tensor name is cls.output * Minor logging improvement
…ors (ggml-org#12511) * Add llama_model_quantize_params parameters * Add new quantize parameters parsing and validation * Update usage * Add new parameters defaults * Add new quantization parameters logic * Add llama_model_quantize_params parameters * Add new quantize parameters parsing and validation * Update usage * Add new parameters defaults * Add new quantization parameters logic * Minor refactoring as per the contributors' coding guidelines * Update descriptions to match existing style * Add llama_model_quantize_params parameters * Add new quantize parameters parsing and validation * Update usage * Add new parameters defaults * Add new quantization parameters logic * Minor refactoring as per the contributors' guidelines * Implement general --tensor-type instead of tensor-specific command option * Fix implied type bug * Restore missing #includes * Add regex capability for tensor selection * Refactor function name and update ALLOWED_TENSOR_TYPE * Add missing #include * Handle edge case when tensor name is cls.output * Minor logging improvement
This PR adds the ability to quantize other tensors, beyond token-embedding and output-tensor. It handles most of the supported architectures.
except Mamba, RWKV6, RWKV6QWEN2 and T5 to avoid having too many command options, but can add as well if maintainers request it.For full background on the PR, please see: Squeezing Tensor Bits: the quest for smaller LLMs