Tokenizer: Add native async bindings, via py03-async-runtimes.#1843
Merged
ArthurZucker merged 8 commits intohuggingface:mainfrom Aug 29, 2025
Merged
Conversation
Collaborator
|
Thanks for the PR, will have a look! |
ArthurZucker
approved these changes
Aug 28, 2025
Collaborator
ArthurZucker
left a comment
There was a problem hiding this comment.
Okay! I had to ask help from @McPatate as I am not super super familiar with all this!
- This is def something we want to adress: indeed if you have a big batch, or a just one very long request,
tokenizerswill block the python thread, which can be non optimal - Let's just add
async_encodeas well, to also showcase good example of how we can do this in a non batch manner for example - Can you detail the test a little bit with
long_batchthat would have longer text?
Otherwise happy to merge
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Collaborator
|
@michaelfeil I took the liberty to commit my updates as I want to release today! |
…eil/tokenizers into mf/add-async-tokenizer-bindings
38045b6 to
c4eb850
Compare
Collaborator
|
thanks @michaelfeil |
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds native support via py03-async bindings. https://github.com/PyO3/pyo3-async-runtimes :
gives access to:
Why is this relevant:
This is mostly relevant for online-inference (vllm, sglang, trt-llm, ..) that have only 1 python thread.
A common scenario, is that e.g. few users will request very long input (e.g 160k tokens), which typically will take >0.5s to process.
A solution is to use the
batch_encode()py03 api, which releases GIL. Since the operation is still blocking, you would still starve the asyncio python runtime, which has only 1 thread, for a single task.Relive would be the use of e.g. ray workers, or threadpools.
Quote: vLLM docs: https://docs.vllm.ai/en/v0.8.3/serving/openai_compatible_server.html
The dependency of ray is much heavier than e.g. py03. For e.g. a small project of mine (github.com/michaelfeil/infinity), it seems overkill.
Summary:
How is it implemented
I use the same approach that has been tested in py03 here: https://github.com/basetenlabs/truss/tree/main/baseten-performance-client/python_bindings.
The py03_async runtime requires a non-main thread rust runtime to be initialized. This is done via LazyInit.
https://github.com/PyO3/pyo3-async-runtimes
Performance:
Performance is okay ish. Probably better than the current pools, but worse than the sync bindings. If you have many threads available, and can contend your gil without running the gpu in the same pid, its propably faster
Below a example of what you need to do in async inference endinges;
#1797 cc @ArthurZucker