[Entrypoint][Model-Gateway] Add GetTokenizer endpoint for gRPC server and load Tokenizer to registry from temp zip for model gateway#12407
Conversation
Summary of ChangesHello @YouNeedCryDear, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the gRPC server's capabilities by introducing a new Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new GetTokenizer gRPC endpoint to stream tokenizer artifacts. The implementation is well-structured, including bundling tokenizer files into a zip archive, streaming it in chunks, and providing comprehensive tests for the new functionality. The protobuf definitions are clear and follow good practices for streaming data.
My review includes two suggestions for improvement. One addresses a potential non-determinism issue in how chat templates are resolved, which could lead to incorrect behavior. The other is a maintainability suggestion to use a dataclass instead of a tuple for storing file information, which would make the code more readable and robust against future changes.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Could we merge #12408 to here ? Both has proto changes and it is hard to compare the changes |
CatherineSue
left a comment
There was a problem hiding this comment.
I'm a bit unclear with this requirement. Why do we need grpc server to stream back tokenizer files? We already have tokenizer_path in GetModelInfo. This introduces concerns on what kind of files we are streaming back and the latency.
tokenizer crate doesn't take care of multimodal files and chat templates. These will be required for the correct request process in grpc.
I feel all the problems we face here can be solved by adding a TokenizerRegistry
@CatherineSue If the router is running on a different node than the backend workers, then the |
aabe0ea to
821de53
Compare
f508f96 to
c5b2055
Compare
|
Hi, is there any updates about the compatibility of "router + gRPC + service discovery (ome)"? I am happy to quickly work on it, but since there is already ongoing PR, I feel not good to do so and thus need to wait for the existing PR, thus wondering whether is an ETA about it. |
4d0fed0 to
9f26e93
Compare
46b227d to
f529a9c
Compare
f529a9c to
d249eeb
Compare
|
/tag-and-rerun-ci |
d249eeb to
b959827
Compare
|
Excluded those two tests for tokenizer streaming and loading from GPU related CI. |
9a5b981 to
507fad1
Compare
for grpc server add get tokenizer method to sglang schedule in router add tokenizer bundle loading from SGLang server
507fad1 to
2e86770
Compare
Summary
This PR adds a new GetTokenizer endpoint to the gRPC server, enabling clients to retrieve tokenizer information. Also adds a new GetTokenizer gRPC method to the SGLang scheduler service, enabling the model gateway to dynamically fetch tokenizer files from the backend via streaming then load into the tokenizer registry. This eliminates the need for static tokenizer configuration or local access to tokenizer files in the model gateway and ensures the router always uses the same tokenizer as the backend.
Motivation
Currently, the SGLang model gateway requires to access the tokenizer files locally. This creates several challenges:
Configuration complexity: Users must manually ensure the model gateway has the same tokenizer files access matching the backend model
Deployment overhead: Tokenizer files must be distributed separately from the backend
Dynamic model support: The model gateway cannot adapt to backend model changes without reconfiguration
By enabling dynamic tokenizer streaming, the router can automatically obtain the correct tokenizer from the backend, simplifying deployment and ensuring consistency.
GetTokenizer endpoint will be used by sglang model gateway when inference gateway mode (multi model) is turned on. Router has no knowledge of what tokenizer should be used during start up and it will get tokenizer at runtime when workers are added into the registry.
Changes
On SGLang Server:
On SGLang Router:
Output