Skip to content

Add benchmark for deserializing large added vocab + optimizations#1782

Closed
ArthurZucker wants to merge 20 commits intohuggingface:mainfrom
ArthurZucker:codex/optimize-addedvocabulary-deserialization-and-add-token-metho
Closed

Add benchmark for deserializing large added vocab + optimizations#1782
ArthurZucker wants to merge 20 commits intohuggingface:mainfrom
ArthurZucker:codex/optimize-addedvocabulary-deserialization-and-add-token-metho

Conversation

@ArthurZucker
Copy link
Copy Markdown
Collaborator

No description provided.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@jannisborn
Copy link
Copy Markdown

@ArthurZucker this PR is fantastic! I was facing the same issue of extremely slow loading of tokenizers with many AddedTokens. But the PR is open since half a year, is there any chance it could be merged soon? It would be tremendously helpful. I can try to help to make it happen, pls let me know

@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

Hey! I am actually not sure it made anything faster, might be why I did not pursue further!

@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

does it help?

@jannisborn
Copy link
Copy Markdown

Hi @ArthurZucker, I'm sure it makes things much faster! I have a tokenizer with 160K tokens from which 149K are special tokens and loading time with tokenizers==0.22.1 is around 3 minutes. Instead this PR loads the tokenizer in 3sec. To verify, I tokenized a test dataset of 200K samples with both versions and the results are identical up to the last token.

It would be amazing to have this included in the next release

@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

Ok I will finish it then! My bench might have needed just a bigger vocab!
On it!

…ptimize-addedvocabulary-deserialization-and-add-token-metho
@jannisborn
Copy link
Copy Markdown

@ArthurZucker thx for the great work on this and the pre-release 0.22.2rc0. Any chances you are going for a full release anytime soon? I saw normally it came within a day or two of the pre-release but not this time

@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

@jannisborn
Copy link
Copy Markdown

@ArthurZucker Sure installing from source works but PyPI only has the pre-release. Looks like sth went wrong in the release CI, see here
image

@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

ah shit

@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

I am stupid sorry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants