FIx parsing single-byte UTF-8 tokens by manually parsing the protobuf#73
FIx parsing single-byte UTF-8 tokens by manually parsing the protobuf#73j-f1 wants to merge 2 commits intoggml-org:masterfrom
Conversation
|
Nice! You will still have to update the tokenizer in C++ code quite a bit. I think this is a test prompt to verify it is working: 关于爱因斯坦的生平。他出生于 If not you can try just this character as prompt: 篇篇篇篇篇篇 |
|
You will also need to replace spaces in input text to the unicode underscore you used in the python script now in order for it to find any token with a space. |
|
Seems fine to me? (Using Apple Terminal). The token readout at the start is messed up as expected (since some of the tokens aren’t valid UTF-8 strings) but that’s fine IMO.
.
.
Compare the 13B model (without my patch):
Those underscore things are in the token file, so I’m replacing them with a regular space when constructing the ggml bin file. I don’t think the C++ code needs to be updated to handle that? |
|
Oh interesting, right you are that is good. But beware of the input. What does the output from the program report about your input prompt? Cause your input may be garbled I would assume from this code unable to find tokens, and the input garbled can still result in a correct output prompt. |
|
See here:
The tokenizer in this code can only return 1 token per string. You need multiple tokens for a string. Oh edit maybe Im wrong, wrong function!
Maybe it just works!?? |
|
Please check the sequence of tokens. Using the tokenizer I get this and yours should match (I also have garbled at the start its consequence of the other code there): So |
|
Looks right: |
|
That's beautiful ship it! But now I have to regenerate my models :( |
|
I think this might work to avoid using protobuf? for i in range(32000):
if tokenizer.is_unknown(i):
# "<unk>" token (translated as ??)
text = " \u2047 ".encode("utf-8")
fout.write(struct.pack("i", len(text)))
fout.write(text)
elif tokenizer.is_control(i):
# "<s>"/"</s>" tokens
fout.write(struct.pack("i", 0))
elif tokenizer.is_byte(i):
# "<U+XX>" tokens (which may be invalid UTF-8)
piece = tokenizer.id_to_piece(i)
if len(piece) != 6:
print("Invalid token: " + piece)
sys.exit(1)
byte_value = int(piece[3:-1], 16)
fout.write(struct.pack("i", 1))
fout.write(struct.pack("B", byte_value))
else:
# normal token. Uses U+2581 (LOWER ONE EIGHTH BLOCK) to represent spaces.
text = tokenizer.id_to_piece(i).replace("\u2581", " ").encode("utf-8")
fout.write(struct.pack("i", len(text)))
fout.write(text)I can see that it writes the correct bytes, but my terminal has a hard time handling them for some reason. |
|
@kharvd yes that is true. I'm somewhat confused because sentencepiece uses protobuf. Maybe the c++ version compiled into python wheel has it built in which made it not possible to use as a sub package from sentencepiece? Either way I think that approach also works. But its irony we don't want to use sentencepiece C++ library, so instead we will use require sentencepiece python library 😅 Here is the PR to include sentencepiece C++ library Happy to close it if we merge this masterpiece. But some questions remain such as this model is not portable between the webui etc now. If we used C++ version we could have portable model files floating around between the two projects I think. |
|
Oh yeah, I figured out why my terminal still made weird characters: it's the |
|
Sample output: |
|
@kharvd what model are you using there, The google translate of your output appears to be gibberish. I think we need a translator :) Heres some examples from me at 16B model 关于爱因斯坦的生平。他出生于1856年,是一位欧洲科学家和教育家。他在1902年获得了诺基丛大学院士学位。 I dont have 7B ready at the moment but it shouldn't be that bad I didn't think? Here is the Google Translate of your output: |
|
This is 7B |
|
Ah no worries with your settings I also get gibberish at 16B
Try
|
|
Here's 13B with default parameters: "About Einstein's life. Born in 1856, he was a German chemist, astronomer and thermostat researcher. Found in high-flying aircraft carriers in the early 20th century, Einstein used" |
|
Closing, #79 is better. |
(processing with grep, less, etc.)
* iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2 PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up from 133.2 t/s. * Speed up float -> iq4_nl conversion on CUDA --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Everything seems to be working fine after regenerating and requantizing the 7B model!
There may still be issues with printing the tokens, my quantization step hasn’t finished yet so I haven’t tested the updated models.I decided to vendor the protobuf file (and the .py file generated via
protoc --python_out=. sentencepiece_model.proto) since they are very very unlikely to change and so that the install process can remain simple.