support bpe tokenizer in convert by ftgreat · Pull Request #2228 · ggml-org/llama.cpp

ftgreat · 2023-07-15T06:20:36Z

Our released Aquila models used bpe tokenizer, so in convert.py we just add one branch for preprocessing bpe tokenizer vocab into sentencepiece in order to use following modules like inference or int4. we have make sure all encoding ids are all the same and have no impact other modules.

Could you please review this pr, thanks.
Related issue: #2093

Signed-off-by: ldwang <ftgreat@gmail.com>

howard0su · 2023-07-17T12:16:00Z

Can you provide test instruction so that I can verify the change?

Signed-off-by: ldwang <ftgreat@gmail.com>

ftgreat · 2023-07-18T03:20:37Z

Can you provide test instruction so that I can verify the change?

instruction:
python convert.py models/7B --vocab-only --outfile models/aquila-vocab.bin --vocabtype bpe

requirements:
put vocab.json in models dir, vocab.json from Aquila-tokenizer https://github.com/FlagAI-Open/FlagAI/blob/master/examples/Aquila/Aquila-tokenizer-hf/vocab.json

klosax · 2023-07-19T22:46:15Z

Note: Using an llama model with gpt2 tokenizer will be fully supported in the new ggml file format. ggml-org/ggml#302

ftgreat · 2023-07-25T10:37:31Z

Note: Using an llama model with gpt2 tokenizer will be fully supported in the new ggml file format. ggerganov/ggml#302

Could you please give me the support schedule?
And how to add our released models, thanks.

ldwang added 2 commits July 15, 2023 14:12

support bpe tokenizer in convert

d7aab2e

Signed-off-by: ldwang <ftgreat@gmail.com>

support bpe tokenizer in convert

ee6bc14

Signed-off-by: ldwang <ftgreat@gmail.com>

ftgreat mentioned this pull request Jul 15, 2023

add support of Aquila 7B models #2093

Closed

support bpe tokenizer in convert, fix

64b8aaf

Signed-off-by: ldwang <ftgreat@gmail.com>

ggerganov approved these changes Jul 25, 2023

View reviewed changes

ggerganov merged commit fce48ca into ggml-org:master Jul 25, 2023

This was referenced Jul 27, 2023

Enable support more diverse tokenizers #2418

Closed

supporting more diverse tokenizers #2420

Merged

ftgreat mentioned this pull request Aug 2, 2023

support Aquila-7B model series #2487

Merged

klosax mentioned this pull request Aug 3, 2023

[User] Producing tokenizer.model from transformers tokenizers.json #2443

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support bpe tokenizer in convert#2228

support bpe tokenizer in convert#2228
ggerganov merged 3 commits intoggml-org:masterfrom
ftgreat:master

ftgreat commented Jul 15, 2023

Uh oh!

howard0su commented Jul 17, 2023

Uh oh!

ftgreat commented Jul 18, 2023

Uh oh!

klosax commented Jul 19, 2023

Uh oh!

ftgreat commented Jul 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ftgreat commented Jul 15, 2023

Uh oh!

howard0su commented Jul 17, 2023

Uh oh!

ftgreat commented Jul 18, 2023

Uh oh!

klosax commented Jul 19, 2023

Uh oh!

ftgreat commented Jul 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants