remove bug in convert.py permute function by jzhang38 · Pull Request #3364 · ggml-org/llama.cpp

jzhang38 · 2023-09-27T13:40:32Z

This bug will only be triggered by HuggingFace GQA models. Nobody realized it because ~~we never used convert.py to convert the HF llama2 70B model~~ Llama 2 70B has 64 heads and 8 num_key_value_heads. 64 / 8 = 8.

This bug has caused models from the [TinyLlama] (https://github.com/jzhang38/TinyLlama) projects to not be able to convert correctly. (TinyLlama is a 1.1B model that uses GQA)

jzhang38/TinyLlama#24

Green-Sky · 2023-09-27T14:04:57Z

@TheBloke did you ever have issues with GQA 70B and hf models?

TheBloke · 2023-09-27T14:07:59Z

No I haven't had issues with those for ages - not since the issues were fixed shortly after GGUF released. I've done loads in the last few weeks, all work fine.

@jzhang38 it's not true to say that all 70B models come from Meta PTH weights. 99% of 70B conversions now are done from HF weights in pytorch_model.bin or model.safetensors format - because they're fine tuned models.

Do you want me to test this updated script with a 70B HF model? I have one to convert in a minute actually

Green-Sky · 2023-09-27T14:13:51Z

Do you want me to test this updated script with a 70B HF model? I have one to convert in a minute actually

yea please do, it's kind of hard to "just" convert one of those for me 😅

Green-Sky · 2023-09-27T14:15:27Z

@jzhang38 I can indeed confirm that this fixes the converted tinyllama models 👍

jzhang38 · 2023-09-27T15:11:42Z

@TheBloke Yeah the actual reason would be Llama 2 70B use 64 heads and 8 key-value heads, which makes

n_head //= n_head_kv

the same as

n_head = n_head_kv

So the bug is not triggered.

Green-Sky · 2023-09-27T15:23:18Z

ran perplexity on the first 300 chunks(batch 512) of wikitest on the f32 models
105b: 11.8159 +/- 0.11470
503b: 11.3646 +/- 0.10860

q8_0 (gpu):
105b: 11.8219 +/- 0.11475
503b: 11.3588 +/- 0.10852

it is safe to say that it works with this pr

Green-Sky · 2023-09-27T16:29:12Z

are there other GQA/MQA models we can test?

TheBloke · 2023-09-27T16:30:27Z

The 70B Llama 2 model worked fine BTW

slaren · 2023-09-27T16:32:10Z

are there other GQA/MQA models we can test?

Mistral 7B (#3362) seems to be GQA, but I don't know if there is an HF conversion already.

Ar57m · 2023-09-27T16:36:14Z

This bug will only be triggered by HuggingFace GQA models. Nobody realized it because ~~we never used convert.py to convert the HF llama2 70B model~~ Llama 2 70B has 64 heads and 8 num_key_value_heads. 64 / 8 = 8.

This bug has caused models from the [TinyLlama] (https://github.com/jzhang38/TinyLlama) projects to not be able to convert correctly. (TinyLlama is a 1.1B model that uses GQA)

jzhang38/TinyLlama#24

I've notice https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.1/discussions/4#651432a05d12b3abdd5d16bd
but I thought it was my fault or something wrong with the model.

Green-Sky · 2023-09-27T17:12:53Z

Mistral 7B (#3362) seems to be GQA, but I don't know if there is an HF conversion already.

this one has a different context management. "sliding window context", so the trained context of 32768 is gonna result in a wrong user experience. Window size should be 4096.

Green-Sky · 2023-09-27T18:05:40Z

@TheBloke how did you convert Mistral-7B-v0.1 ?

TheBloke · 2023-09-27T18:07:12Z

I applied the GQA fix from this PR, and then I deleted added_tokens.json which probably shouldn't be there, and breaks convert.py

Then I just ran convert.py as normal.

Same with Mistral-7B-Instruct-v0.1, except I didn't need to delete added_tokens.json there, so I guess they realised it wasn't meant to be there.

Green-Sky · 2023-09-27T18:10:51Z

Oh, that easy... can you add a note that llama.cpp does currently not perform sliding window context, and that max context should be set to 4096 (sliding_window).
since it should work the same before the window starts sliding.

TheBloke · 2023-09-27T18:17:10Z

OK sure. Someone on the other thread said it seemed to work at 8192? But I'll say it's not yet supported

Green-Sky · 2023-09-27T18:21:12Z

OK sure. Someone on the other thread said it seemed to work at 8192? But I'll say it's not yet supported

this might be just like llama2 where, contrary to llama1, it does not immediately deteriorate when going past the trained size.

Green-Sky · 2023-09-27T18:28:10Z

from #3362

Actually no my quants don't work fine! I needed that permute fix. Re-making now

so you used this pr?

TheBloke · 2023-09-27T18:38:10Z

Yes, changing \= to =. Before I applied that the ggufs produced gibberish after a few words

…example * 'master' of github.com:ggerganov/llama.cpp: convert : remove bug in convert.py permute function (ggml-org#3364) make-ggml.py : compatibility with more models and GGUF (ggml-org#3290) gguf : fix a few general keys (ggml-org#3341) metal : reusing llama.cpp logging (ggml-org#3152) build : add ACCELERATE_NEW_LAPACK to fix warning on macOS Sonoma (ggml-org#3342) readme : add some recent perplexity and bpw measurements to READMES, link for k-quants (ggml-org#3340) cmake : fix build-info.h on MSVC (ggml-org#3309) docs: Fix typo CLBlast_DIR var. (ggml-org#3330) nix : add cuda, use a symlinked toolkit for cmake (ggml-org#3202)

remove bug in convert.py permute function

add6fa8

jzhang38 mentioned this pull request Sep 27, 2023

Getting gibberish output when running on llama.cpp jzhang38/TinyLlama#24

Closed

Green-Sky approved these changes Sep 27, 2023

View reviewed changes

Green-Sky mentioned this pull request Sep 27, 2023

Trying to build a model of PY007/TinyLlama-1.1B-step-50K-105b #3018

Closed

4 tasks

Green-Sky linked an issue Sep 27, 2023 that may be closed by this pull request

Trying to build a model of PY007/TinyLlama-1.1B-step-50K-105b #3018

Closed

4 tasks

magician-blue mentioned this pull request Sep 27, 2023

Support GQA export, better run.c, Support tinyllama-1.1B karpathy/llama2.c#410

Open

This was referenced Sep 27, 2023

Remove falcon style rope tairov/llama2.mojo#34

Closed

Remove Falcon style ROPE tairov/llama2.mojo#35

Merged

Green-Sky merged commit e519621 into ggml-org:master Sep 27, 2023

jzhang38 mentioned this pull request Sep 28, 2023

Helps needed in testing out TinyLlama flexflow/flexflow-train#1154

Closed

Conversation

jzhang38 commented Sep 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Green-Sky commented Sep 27, 2023

Uh oh!

TheBloke commented Sep 27, 2023

Uh oh!

Green-Sky commented Sep 27, 2023

Uh oh!

Green-Sky commented Sep 27, 2023

Uh oh!

jzhang38 commented Sep 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Green-Sky commented Sep 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Green-Sky commented Sep 27, 2023

Uh oh!

TheBloke commented Sep 27, 2023

Uh oh!

slaren commented Sep 27, 2023

Uh oh!

Ar57m commented Sep 27, 2023

Uh oh!

Green-Sky commented Sep 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Green-Sky commented Sep 27, 2023

Uh oh!

TheBloke commented Sep 27, 2023

Uh oh!

Green-Sky commented Sep 27, 2023

Uh oh!

TheBloke commented Sep 27, 2023

Uh oh!

Green-Sky commented Sep 27, 2023

Uh oh!

Green-Sky commented Sep 27, 2023

Uh oh!

TheBloke commented Sep 27, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jzhang38 commented Sep 27, 2023 •

edited

Loading

jzhang38 commented Sep 27, 2023 •

edited

Loading

Green-Sky commented Sep 27, 2023 •

edited

Loading

Green-Sky commented Sep 27, 2023 •

edited

Loading