Add benchmark for deserializing large added vocab + optimizations#1782
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
…on-and-add-token-metho
…token-metho' of github.com:ArthurZucker/tokenizers into codex/optimize-addedvocabulary-deserialization-and-add-token-metho
…dd tokens to the vocab man....
|
@ArthurZucker this PR is fantastic! I was facing the same issue of extremely slow loading of tokenizers with many AddedTokens. But the PR is open since half a year, is there any chance it could be merged soon? It would be tremendously helpful. I can try to help to make it happen, pls let me know |
|
Hey! I am actually not sure it made anything faster, might be why I did not pursue further! |
|
does it help? |
|
Hi @ArthurZucker, I'm sure it makes things much faster! I have a tokenizer with 160K tokens from which 149K are special tokens and loading time with It would be amazing to have this included in the next release |
|
Ok I will finish it then! My bench might have needed just a bigger vocab! |
…ptimize-addedvocabulary-deserialization-and-add-token-metho
|
@ArthurZucker thx for the great work on this and the pre-release 0.22.2rc0. Any chances you are going for a full release anytime soon? I saw normally it came within a day or two of the pre-release but not this time |
|
@ArthurZucker Sure installing from source works but PyPI only has the pre-release. Looks like sth went wrong in the release CI, see here |
|
ah shit |
|
I am stupid sorry |

No description provided.