-
Notifications
You must be signed in to change notification settings - Fork 15.4k
Closed
Labels
bugSomething isn't workingSomething isn't workingduplicateThis issue or pull request already existsThis issue or pull request already existsenhancementNew feature or requestNew feature or request
Description
I have found that when having a Unicode UTF- emoji char like
Unicode Character โ๐โ (U+1F44D)
The prompts breaks up.
I'm reading a sample prompt from a text file:
cat prompt
Tweet: "I hate it when my phone battery dies."
Sentiment: Negative
###
Tweet: "My day has been ๐"
Sentiment: Positive
###
Tweet: "This is the link to the article"
Sentiment: Neutral
###
Tweet: "This new music video was incredibile"
Sentiment:Looking at logs I can see in fact that the tokenizers breaks at the (U+1F44D) char code:
(base)$ p=$(cat prompt); ./main -m ./models/13B/ggml-model-q4_0.bin -p $p -t 4 -n 512
main: seed = 1678656464
llama_model_load: loading model from './models/13B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 5120
llama_model_load: n_mult = 256
llama_model_load: n_head = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 13824
llama_model_load: n_parts = 2
llama_model_load: ggml ctx size = 8559.49 MB
llama_model_load: memory_size = 800.00 MB, n_mem = 20480
llama_model_load: loading model part 1/2 from './models/13B/ggml-model-q4_0.bin'
llama_model_load: ............................................. done
llama_model_load: model size = 3880.49 MB / num tensors = 363
llama_model_load: loading model part 2/2 from './models/13B/ggml-model-q4_0.bin.1'
llama_model_load: ............................................. done
llama_model_load: model size = 3880.49 MB / num tensors = 363
main: prompt: 'Tweet: "I hate it when my phone battery dies."
Sentiment: Negative
###
Tweet: "My day has been ๐"
Sentiment: Positive
###
Tweet: "This is the link to the article"
Sentiment: Neutral
###
Tweet: "This new music video was incredibile"
Sentiment:'
main: number of tokens in prompt = 36
1 -> ''
27418 -> 'Tw'
3905 -> 'ee'
29873 -> 't'
29901 -> ':'
376 -> ' "'
29902 -> 'I'
26277 -> ' hate'
372 -> ' it'
746 -> ' when'
590 -> ' my'
9008 -> ' phone'
16988 -> ' battery'
2977 -> ' dies'
1213 -> '."'
13 -> '
'
2008 -> 'Se'
593 -> 'nt'
2073 -> 'iment'
29901 -> ':'
12610 -> ' Neg'
1230 -> 'ative'
13 -> '
'
2277 -> '##'
29937 -> '#'
13 -> '
'
27418 -> 'Tw'
3905 -> 'ee'
29873 -> 't'
29901 -> ':'
376 -> ' "'
3421 -> 'My'
2462 -> ' day'
756 -> ' has'
1063 -> ' been'
29871 -> ' '
sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
Tweet: "I hate it when my phone battery dies."
Sentiment: Negative
###
Tweet: "My day has been 10 times better than yesterday. Now I have to sleep again..."
Sentiment: Neutral
###
Twitter is not about talking; Twitter is a social network for listening and responding instantly, as the tweets of Steve Jobs demonstrate well in Figure A-2 (page ). Just be sure you can interpret the information accurately. If the sentiment isn't clearly positive or negativeโas^C
resulting in a broken input prompt.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingduplicateThis issue or pull request already existsThis issue or pull request already existsenhancementNew feature or requestNew feature or request