Conversation
JohannesGaessler
left a comment
There was a problem hiding this comment.
Sorry for causing more work for you; I thought I had checked QK_K=64 but it seems I forgot. I would have fixed it myself but I didn't work on llama.cpp the last few days.
Using LLAMA_CUDA_FORCE_DMMV = ON and -nommq it runs and produces a meaningful result.
f547c58 to
061f777
Compare
Keep in mind that mul_mat_q reduces VRAM usage and thus allows you to run better quantization though. So I would argue that with the same hardware you can still achieve better perplexity.
The overwhelming majority of users are running LLaMA-based models and I think the defaults should reflect that. So I think mul_mat_q should remain the default. |
|
I just remembered: the |
This is highly likely to be causing problems. On Metal, building with https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf One can explicitly use "precise" math functions by calling Simply changing the kernel to use |
|
I'm not sure what |
| int nx = tensor->ne[0]; | ||
| if (nx % QK_K == 0) { | ||
| if (model.arch == LLM_ARCH_FALCON || nx % QK_K != 0) { | ||
| new_type = GGML_TYPE_Q8_0; |
There was a problem hiding this comment.
Why don't we use Q8_0 when GGML_USE_K_QUANTS is disabled?


Falcon-7b requires using k-qunats super-blocks of
QK_K=64instead of the usualQK_K=256(LLAMA_QKK_64=ONwhen building). This PRLLAMA_CUDA_FORCE_DMMV=ONand run with-nommq(CUDA does not build when QK_K = 64 #2815) There are also many warnings when compilingggml-cuda.cu.QK_K = 256andQK_K = 64Q8_0quantization of theoutput.weighttensor for Falcon models for all quantization types. This makes a huge difference forQ4/5_0/1. For instance,Q4_0perplexity becomes 7.2451 from 8.3948 without the changes in this PR! ForQ5_0the change is from 7.4725 to 7.1605 (Falcon-7b perplexity forfp16is 7.1213).Some observations:
Q3_K_Mare not really viable for Falcon-7b.Q4/5_0/1are highly competitive with the k_quants when theoutput.weighttensor is quantized withQ8_0.-nommq) and the quantized implementation is much bigger compared the LLaMA models. For instance, forQ4_0,-nommqis 0.031 lower, which I think is not acceptable. In comparison, for LLaMA-v2-7B the difference is 0.006 (which is also quite big for my taste, but borderline acceptable). Perhaps we should consider reverting CUDA: use mul_mat_q kernels by default #2683 so quantized matrix multiplications are opt-in rather than the default?The following graph shows perplexity scores for Falcon-7B for different quantization types using this PR. All calculations were run with
-nommq.