Conversation
As usual, Metal / Apple Silicon don't like my quants.
PPL vs size is good, but CPU performance suffers: on M2 Max TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when using IQ3_S vs 133 t/s with pure IQ4_XS.
|
Great work! Why not make an IQ4_XXS by using IQ3_S for the attn_k and attn_q ? |
get_k_quant_type : tensor cols 13696 x 5120 are not divisible by 256, required for iq4_xsllama_model_quantize: failed to quantize: |
Because I need to fix |
|
@sorasoras Thanks! I keep forgetting this check. It should be fixed now. |
It's working. That's great! |
|
@ikawrakow Thanks a lot for your hard work! It is very much appreciated. Do you think that we can fix the slower Metal speeds with better kernels or does it require a whole new quantisation type? I am wondering why there is such a difference. Is it because of the additional overhead/calculations that are required for the new IQ quant methods? |
KL-divergence data for Mistral-7B
Very nice, seems to be a solid replacement for Q4KS, which was my default recommendation. |
The quantization in this PR is non-linear, hence it requires a table lookup. If you compare to |
* Try IQ4_NL with blocks of 64 - does not look good * iq4_xs: go to super-blocks of 256 and 6-bit scales for blocks of 32 * iq4_xs: CUDA works - 133.2 t/s * iq4_xs: AVX2 dot product * iq4_xs: ARM_NEON dot product * iq4_nl: Metal implementation As usual, Metal / Apple Silicon don't like my quants. * iq3_xs: minor fix * iq4_xs: shrink by using IQ3_S for attn_k and attn_q * iq4_xs: revert using IQ3_S for attn_k and attn_v PPL vs size is good, but CPU performance suffers: on M2 Max TG-128 drops to 21.7 t/s from 28.8, and on a Ryzen-7950X to 14.5 t/s from 15.8 t/s. On CUDA we have 135 t/s when using IQ3_S vs 133 t/s with pure IQ4_XS. * Fix CI * iq4_xs: Added forgotten check for 256 divisibility --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
|
@ikawrakow |


This is basically the same as
IQ4_NL, but in super-blocks of 256 with 6-bit scales for the blocks of 32 weights. It looks pretty good on the quantization error vs quantized model size curve:It is possible to move the point closer to the
IQ2_XXS...IQ3_Mfit line by usingIQ3_Sfor theattn_kandattn_qtensors. This reduces the quantized model size to about 4.1 bpw at the expense of a ~0.3% increase in PPL. But given that currently CPU performance forIQ3_Sis pretty bad, I decided against this. Speaking of performance, it is excellent on all platforms where I can test except Metal (as usual):Q4_0Q4_0Q4_0Q4_0