IQ4_NL: 4-bit non-linear quants with blocks of 32#5590
Conversation
* Basics (quantize, dequantize) * CUDA dequantize and dot product * Slightly faster CUDA dot product (120 t/s) * Switch to 6-bit scales * Scalar dot product * AVX2 dot product * ARM_NEON dot product * Works on metal, but still slow * Slightly better Metal dot product * Another small Metal improvement * Metal dot product is getting there * Faster CUDA dot product * Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided * Report the actual bpw * Add _xs mix that is 4.05 bpw for non-MoE models * Remove IQ4_XS for now, slightly adjust kvalues_iq4nl * AVX2 dot product uses Q8_0 instead of Q8_K * Add to test-backend-ops * Minor fix * Also use use Q5_K for attn_output in MoE models * Fixes after merging latest master * Switching to blocks of 32 * AVX2 for blocks of 32 * Scaler dot product for blocks of 32 * ARM_NEON dot product for blocks of 32 * Metal kernels for blocks of 32 * Slightly faster Metal kernels
|
looking good. so basically, the question is can i have a mix between Q4_K super-block of 256 mixing with 32block of IQ4_nl to get even bigger space saving. |
|
@sorasoras |
hmm, Could we expect a even denser version IQ4 in the future? |
ROCm benchmarks
|
|
7900XTX at 400W TGP
It's surprise that NL offer comparable performance to Q4_1 |
|
Tested on QWEN1.5-14B, saved about 150MB file size on 3K_X_S (3.71 BPW --> 3.63 BPW) with roughly the same PPL. Thanks for the contribution. |
|
with change introduce by IQ4_NL, IQ2_XS can beat the mainline Q2_K_S in term of PPL with the same imatrix |
|
@ikawrakow due to the recent big changes and new implementations of k-quant, could you help compile a table showing the difference among all quant types? |
|
Can not run IQ4_NL with mmq on 4070ti |
* iq4_nl: squash commits for easier rebase * Basics (quantize, dequantize) * CUDA dequantize and dot product * Slightly faster CUDA dot product (120 t/s) * Switch to 6-bit scales * Scalar dot product * AVX2 dot product * ARM_NEON dot product * Works on metal, but still slow * Slightly better Metal dot product * Another small Metal improvement * Metal dot product is getting there * Faster CUDA dot product * Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided * Report the actual bpw * Add _xs mix that is 4.05 bpw for non-MoE models * Remove IQ4_XS for now, slightly adjust kvalues_iq4nl * AVX2 dot product uses Q8_0 instead of Q8_K * Add to test-backend-ops * Minor fix * Also use use Q5_K for attn_output in MoE models * Fixes after merging latest master * Switching to blocks of 32 * AVX2 for blocks of 32 * Scaler dot product for blocks of 32 * ARM_NEON dot product for blocks of 32 * Metal kernels for blocks of 32 * Slightly faster Metal kernels * iq4_nl: Fix after merging with master * iq4_nl: another fix after merging with master * Use IQ4_NL instead of Q4_K when using k-quants is not possible * Fix typo that makes several tests fail * It was the ggml_vdotq thing missed inside the brackets --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Add description source ggml-org/llama.cpp#5590

TL;DR
The main purpose of this PR is to provide a 4-bit quantization type that can be used when k- and i-quants that use blocks of 256 are not available (because the number of columns in some tensors are not a multiple of 256).
In short
IQ4_NLuses blocks of 32 weights with afp16block scales exactly likeQ4_0, so models quantized withIQ4_NLare the exact same size asQ4_0andQ4_K_S.IQ4_NLuses a non-linear mapping to convert quants to weights (more on this below)Q4_0and almost on par withQ4_K_S.Q4_0except on Metal, where it is 8% (prompt processing) or 20% (token generation) slower thanQ4_0.fp16block scales are replaced withint8_tblock scales (plus one floating point scale per row, which adds a negligible amount of bits), this would be a 4.25 bpw quantization, which has the same quantization error as the 4.5 bpwIQ4_NLadded by this PR.PPL comparisons
The following tables show PPL comparisons between
Q4_0,Q4_K_S, andIQ4_NL. We start with the case of not using an importance matrix (I find this to be an important use case as at 4-bit quantization ideally one should not worry too much about having a suitable imatrix to quantize a model).Table 1 PPL comparison without imatrix for context of 512 tokens
The next table is with an imatrix created from
wiki.train.rawTable 2 PPL comparison with imatrix for context of 512 tokens
Just in case researchers working on quantization happen to see this PR, here are some PPL results for a context of 4096 (LLaMA-v2 and Mistral) or 2048 (LLaMA-v1)
Table 3 PPL comparison with imatrix for context of 4096/2048 tokens
To make the comparison with the approaches that are currently claiming to be SOTA, the next table shows the quantization error defined as
QErr = PPL(Qunatized)/PPL(fp16) - 1. I took the values for AQLM and QuIP# from the latest QuIP# paper.Table 4 Quantization error comparisons
Performance comparisons
Table 5 shows PP-512 and TG-128 values for a 7B LLaMA on various platforms
Metalis on an M2-Max 30-core GPUARM_NEONis on an M2-Max CPU using the 8 performance coresCUDAis on an RTX-4080AVX2is on a Ryzen-7950X CPU using 16 (PP-512) or 8 (TG-128) threads.Additional details
It all comes down to this set of 16 magic values
Where do they come from? I had implemented a K-means clustering based quantization in my private repository (similar to what, e.g., SqeezeLLM does), with clustering done per tensor row. Although I was getting similar or even slightly better results than SqeezeLLM, I was not particularly happy with the quantization quality, so decided to see what happens if I apply block-wise scaling before clustering. It turned out that the cluster means end up being (nearly) independent of the tensor/tensor row. I collected statistics of the cluster means from a few quantized model, and saw that the 16 means of the cluster means can be fit with a 3rd order polynomial that maps quant index to a (scaled) model weight. Using the polynomial fit directly results in a very decent performance on CUDA, acceptable performance on Metal, but is a no-go for CPU SIMD instructions. On the CPU the only thing that gives a good performance is a lookup table containing
int8_tvalues. So, after scaling the polynomial fit to the fullint8_trange and rounding to the nearest integer, we end up with the above 16 values.The initial work on this was done before I implemented the importance matrix. Without imatrix, the non-linear quantization was basically on par with
Q4_Kin terms of quantization error (see Table 1), while using ~7% fewer bits (if implemented row-wise with blocks of 32). But after the imatrix was added,Q4_Kbecame again slightly better (Tables 2 and 3). The non-linear quantization outperformsQ4_Kwith blocks of 16. If implemented using super-blocks of 256 with 6-bit block scales, this would be a 4.4375 bpw SOTA quantization (SOTA in the sense that I'm not aware of a quantization approach that achieves a lower quantization error with less than 5 bpw).