We could use std::unordered_map over std::map#305
Conversation
…d::map<id, token> id_to_token; to std::vector<token> id_to_token;
…token_to_id.size());
|
the token vector should prob be a struct now which also includes the score (see 074bea2) |
I did a commit merging with the new changes using struct, I am not sure about the names or the organization of the struct |
|
Are you able to measure the performance gains from these changes? Interested to see how much of an impact they have. Great work! |
Thanks! Sadly it's hard to do consistent tests, but what I got is: last commit (bd4b46d) + this pull request: (same command above) This differences should be more noticeable with larger datasets and more tokens, I am using the model 7B. |
there should not really be any. none of the code is particularly hot. |
utils.h
Outdated
| // Vocab utils | ||
| // | ||
|
|
||
| struct token_score { |
There was a problem hiding this comment.
this is confusingly named, same with token_t. the type is only used inside gpt_vocab, so why not nest it.
There was a problem hiding this comment.
also gpt_vocab is token_t already in this case
There was a problem hiding this comment.
hm, first thought was token_t, but that is too close to token, so, just leave it as token_score.
There was a problem hiding this comment.
This compiler version seems to not accept token token;
/home/runner/work/llama.cpp/llama.cpp/utils.h:68:15: error: declaration of ‘gpt_vocab::token gpt_vocab::token_score::token’ changes meaning of ‘token’ [-fpermissive]
68 | token token;
| ^~~~~
/home/runner/work/llama.cpp/llama.cpp/utils.h:65:11: note: ‘token’ declared here as ‘using token = std::string’
65 | using token = std::string;
Should I rename the using token = std::string; to token_t?
There was a problem hiding this comment.
The quickest and simplest fix to that would be to just rename the data member to tok.
|
Apologies for the conflicts - lets resolve and merge. Regarding the C++ standard question from the other thread: Overall, my experience tells me this is the better way - or at least it is better in my views and understandings. And if there ever appears a very good reason to bump the standard - we will do it. But at the moment, there is no good enough reason to do it. |
Thanks! I think I resolved the conflicts, if there are some problems I am happy to fix. |

If it is not necessary sorted maps, change std::map to std::unordered_map
std::unordered_map is a hash table so it should be faster than std::map when storing many items.
std::map<id, token> can be a std::vector since the vector index can be equal to the token id