Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range#5721
Adding IQ2_S and IQ2_M to complete coverage of the 2-3 bit quantization range#5721
Conversation
ggerganov
left a comment
There was a problem hiding this comment.
To me it looks like we need a quantization type with about 4 bpw to close the gap between IQ3_M and Q4_K.
Yes, I agree
The question is: |
At 5 bits and above there isn't much gain from alternative quantization, at least not for the models that I'm using for testing where, once you use an imatrix, |
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
fun fact, |
|
@ikawrakow Thanks for the amazing work. While testing IQ3_S/IQ3_M from #5676 I'm getting segfault when using more than 2 threads with I added the output here #5676 (comment). |
|
@dranger003 Can you post a failing model somewhere where I can download? I have quantized many models with these quantization type without issue (and yes, I'm always using multi-threading), so don't know what could be wrong without a test case. |
@ikawrakow Yes, although you might hate me quite a bit given its size. See here. EDIT: Adding details here as I find out more, hopefully this can help. Another finding is that it crashes using 8 or 12 threads but it doesn't crash using 2 or 16 threads. I have devtools installed and can debug the code if you need me to lookup something specific, but I just don't know where to look otherwise without some guidance. EDIT2: I think this may be a race condition and not directly tied to the thread count. For example, if I run the quantize several times in a row with the same thread count, say 12, then after a number of failed attempts one of the run will go through fine. Also, I just tested IQ2_S/IQ2_M and I get the same behavior. I have been quantizing several models and I only get this issue with the new IQ3/IQ2 quant types. |
…on range (ggml-org#5721) * Adding IQ2_S and IQ2_M as a single cumulative commit * Update examples/quantize/quantize.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This PR adds two new quantization types,
IQ2_SandIQ2_M, to complete the coverage of the 2-3 bit quantization range.Why? The reason for having all these new quantization types is best explained with the following graph, which shows the quantization error defined as
PPL(Q)/PPL(fp16)-1as a function of bits-per-weight (bpw). The bpw is for the complete model, includingoutput.weightandtoken_embd.weighttensors. The data is for LLaMA-v2-13B, but other models show a very similar behavior.The black/blue symbols show the results for k-/legacy quants using 668b31f, which is the last commit before I started adding i-quants and imatrix stuff. The red symbols represent the new i-quants and updated k-quants, including
IQ2_SandIQ2_Madded by this PR; magenta circles are for legacy quants (with all i-, k-, and legacy quants using imatrix fromwiki.train.raw). So, in a nutshellQ4_1behaves as expected instead of having a higher quantization error thanQ4_0as it often was the case).Q2_K -> IQ3_XXS, Q3_K_S -> IQ3_XS, Q3_K_M -> IQ3_S, Q3_K_L -> IQ3_MInterestingly enough, the
IQ2_XXS...IQ3_Mquantization error can be described with a simple fit in the form ofa * exp(-b * bpw). The 1.5 bpw quantizationIQ1_S(which I'm not showing here to not have too a large y-axis range) nearly falls onto the same fit. If we were able to keep the rate of quantization error reduction with bpw beyond 4 bpw, we would getQ6_Kperformance at about 5.3 bpw.To me it looks like we need a quantization type with about 4 bpw to close the gap between
IQ3_MandQ4_K.