remove bug in convert.py permute function#3364
Conversation
|
@TheBloke did you ever have issues with GQA 70B and hf models? |
|
No I haven't had issues with those for ages - not since the issues were fixed shortly after GGUF released. I've done loads in the last few weeks, all work fine. @jzhang38 it's not true to say that all 70B models come from Meta PTH weights. 99% of 70B conversions now are done from HF weights in pytorch_model.bin or model.safetensors format - because they're fine tuned models. Do you want me to test this updated script with a 70B HF model? I have one to convert in a minute actually |
yea please do, it's kind of hard to "just" convert one of those for me 😅 |
|
@jzhang38 I can indeed confirm that this fixes the converted tinyllama models 👍 |
|
@TheBloke Yeah the actual reason would be Llama 2 70B use 64 heads and 8 key-value heads, which makes
the same as
So the bug is not triggered. |
|
ran perplexity on the first 300 chunks(batch 512) of wikitest on the f32 models q8_0 (gpu): it is safe to say that it works with this pr |
|
are there other GQA/MQA models we can test? |
|
The 70B Llama 2 model worked fine BTW |
Mistral 7B (#3362) seems to be GQA, but I don't know if there is an HF conversion already. |
I've notice https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.1/discussions/4#651432a05d12b3abdd5d16bd |
this one has a different context management. "sliding window context", so the trained context of 32768 is gonna result in a wrong user experience. Window size should be 4096. |
|
@TheBloke how did you convert Mistral-7B-v0.1 ? |
|
I applied the GQA fix from this PR, and then I deleted Then I just ran convert.py as normal. Same with Mistral-7B-Instruct-v0.1, except I didn't need to delete added_tokens.json there, so I guess they realised it wasn't meant to be there. |
|
Oh, that easy... can you add a note that llama.cpp does currently not perform sliding window context, and that max context should be set to |
|
OK sure. Someone on the other thread said it seemed to work at 8192? But I'll say it's not yet supported |
this might be just like llama2 where, contrary to llama1, it does not immediately deteriorate when going past the trained size. |
|
from #3362
so you used this pr? |
|
Yes, changing \= to =. Before I applied that the ggufs produced gibberish after a few words |
…example * 'master' of github.com:ggerganov/llama.cpp: convert : remove bug in convert.py permute function (ggml-org#3364) make-ggml.py : compatibility with more models and GGUF (ggml-org#3290) gguf : fix a few general keys (ggml-org#3341) metal : reusing llama.cpp logging (ggml-org#3152) build : add ACCELERATE_NEW_LAPACK to fix warning on macOS Sonoma (ggml-org#3342) readme : add some recent perplexity and bpw measurements to READMES, link for k-quants (ggml-org#3340) cmake : fix build-info.h on MSVC (ggml-org#3309) docs: Fix typo CLBlast_DIR var. (ggml-org#3330) nix : add cuda, use a symlinked toolkit for cmake (ggml-org#3202)
This bug will only be triggered by HuggingFace GQA models. Nobody realized it because
we never used convert.py to convert the HF llama2 70B modelLlama 2 70B has 64 heads and 8 num_key_value_heads. 64 / 8 = 8.This bug has caused models from the [TinyLlama] (https://github.com/jzhang38/TinyLlama) projects to not be able to convert correctly. (TinyLlama is a 1.1B model that uses GQA)
jzhang38/TinyLlama#24