Skip to content

[Entrypoint][Model-Gateway] Add GetTokenizer endpoint for gRPC server and load Tokenizer to registry from temp zip for model gateway#12407

Closed
YouNeedCryDear wants to merge 1 commit intosgl-project:mainfrom
YouNeedCryDear:grpc-server-get-tokenizer
Closed

[Entrypoint][Model-Gateway] Add GetTokenizer endpoint for gRPC server and load Tokenizer to registry from temp zip for model gateway#12407
YouNeedCryDear wants to merge 1 commit intosgl-project:mainfrom
YouNeedCryDear:grpc-server-get-tokenizer

Conversation

@YouNeedCryDear
Copy link
Copy Markdown
Contributor

@YouNeedCryDear YouNeedCryDear commented Oct 30, 2025

Summary

This PR adds a new GetTokenizer endpoint to the gRPC server, enabling clients to retrieve tokenizer information. Also adds a new GetTokenizer gRPC method to the SGLang scheduler service, enabling the model gateway to dynamically fetch tokenizer files from the backend via streaming then load into the tokenizer registry. This eliminates the need for static tokenizer configuration or local access to tokenizer files in the model gateway and ensures the router always uses the same tokenizer as the backend.

Motivation

Currently, the SGLang model gateway requires to access the tokenizer files locally. This creates several challenges:

Configuration complexity: Users must manually ensure the model gateway has the same tokenizer files access matching the backend model
Deployment overhead: Tokenizer files must be distributed separately from the backend
Dynamic model support: The model gateway cannot adapt to backend model changes without reconfiguration
By enabling dynamic tokenizer streaming, the router can automatically obtain the correct tokenizer from the backend, simplifying deployment and ensuring consistency.

GetTokenizer endpoint will be used by sglang model gateway when inference gateway mode (multi model) is turned on. Router has no knowledge of what tokenizer should be used during start up and it will get tokenizer at runtime when workers are added into the registry.

Changes

On SGLang Server:

  • Added GetTokenizer RPC definition to sglang_scheduler.proto
  • Implemented GetTokenizer endpoint in grpc_server.py with streaming support
  • Generated updated protocol buffer bindings (pb2.py, pb2.pyi, pb2_grpc.py)
  • Added comprehensive test coverage in test_tokenizer_stream.py

On SGLang Router:

  • Added new protobuf messages for tokenizer streaming
  • Added get_tokenizer() method to SglangSchedulerClient
  • Added logic to load tokenizer from streamed zip bundle into registry in worker registration

Output

sglang git:(grpc-server-get-tokenizer) ✗ python3 ./test/srt/entrypoints/grpc_server/test_get_tokenizer.py
Sending GetTokenizer request...
Received metadata: meta.llama-3-8b
Received file chunk: tokenizer_bundle.zip, size: 2097152
Received file chunk: tokenizer_bundle.zip, size: 237091
Total zip size: 2334243 bytes
Extracting to /tmp/tmpm4e0l07n
Listing extracted files:
 - tokenizer_config.json
 - tokenizer.json
 - tokenizer.zip
Loading tokenizer with transformers...
Tokenizer loaded successfully!
Test text: Hello, world! This is a test.
Encoded: [128000, 9906, 11, 1917, 0, 1115, 374, 264, 1296, 13]
Decoded: <|begin_of_text|>Hello, world! This is a test.
2026-01-08 22:06:31  INFO smg::core::steps::worker::shared::activate: src/core/steps/worker/shared/activate.rs:27: Activated 1 worker(s) (marked as healthy)
2026-01-08 22:06:31  INFO smg::workflow::event: src/workflow/event.rs:131: Step succeeded instance_id=06e7a708-2146-470d-bf17-7bf768b84200 step_id=activate_workers duration_ms=0
2026-01-08 22:06:31 DEBUG smg::core::steps::worker::local::register_tokenizer: src/core/steps/worker/local/register_tokenizer.rs:41: Registering tokenizer for model gpt-oss-20b from /models/openai/gpt-oss-20b
2026-01-08 22:06:31 DEBUG smg::policies::registry: src/policies/registry.rs:61: Worker added for model gpt-oss-20b, count: 1
2026-01-08 22:06:31 DEBUG smg::workflow::engine: src/workflow/engine.rs:353: Step completed step_id=activate_workers result=Success
2026-01-08 22:06:31 DEBUG smg::tokenizer::registry: src/tokenizer/registry.rs:103: Tokenizer cache miss for name: gpt-oss-20b
2026-01-08 22:06:31 DEBUG smg::policies::registry: src/policies/registry.rs:156: Using default policy for model gpt-oss-20b
2026-01-08 22:06:31  INFO smg::policies::registry: src/policies/registry.rs:77: Assigning policy cache_aware to new model gpt-oss-20b
2026-01-08 22:06:31  INFO smg::tokenizer::registry: src/tokenizer/registry.rs:121: Loading tokenizer 'gpt-oss-20b' from source: /models/openai/gpt-oss-20b
2026-01-08 22:06:31 DEBUG smg::policies::registry: src/policies/registry.rs:272: Initializing cache-aware policy with 1 workers for model gpt-oss-20b
2026-01-08 22:06:31  INFO smg::core::steps::worker::local::register_tokenizer: src/core/steps/worker/local/register_tokenizer.rs:87: Fetching tokenizer from worker: grpc://127.0.0.1:18080
2026-01-08 22:06:31  INFO smg::core::worker: src/core/worker.rs:761: Lazily initializing gRPC client (sglang) for worker: grpc://127.0.0.1:18080
2026-01-08 22:06:31 DEBUG smg::grpc_client::sglang_scheduler: src/grpc_client/sglang_scheduler.rs:138: Connecting to SGLang scheduler at grpc://127.0.0.1:18080
2026-01-08 22:06:31 DEBUG smg::core::steps::worker::shared::update_policies: src/core/steps/worker/shared/update_policies.rs:133: Updated policies for 1 workers across 1 models
2026-01-08 22:06:31  INFO smg::workflow::event: src/workflow/event.rs:131: Step succeeded instance_id=06e7a708-2146-470d-bf17-7bf768b84200 step_id=update_policies duration_ms=0
2026-01-08 22:06:31 DEBUG smg::workflow::engine: src/workflow/engine.rs:353: Step completed step_id=update_policies result=Success
2026-01-08 22:06:31  INFO smg::core::worker: src/core/worker.rs:768: Successfully connected gRPC client (sglang) for worker: grpc://127.0.0.1:18080
2026-01-08 22:06:31 DEBUG smg::grpc_client::sglang_scheduler: src/grpc_client/sglang_scheduler.rs:303: Requesting tokenizer from backend
2026-01-08 22:06:31 DEBUG smg::grpc_client::sglang_scheduler: src/grpc_client/sglang_scheduler.rs:317: Received tokenizer metadata: model=gpt-oss-20b, fingerprint=04b544516abc8dde49f9755b38e7b740c9041aa13f5f5613ecd0fb6d8443af0b, files=3, format=zip
2026-01-08 22:06:31 DEBUG smg::grpc_client::sglang_scheduler: src/grpc_client/sglang_scheduler.rs:339: Starting to receive file chunks
2026-01-08 22:06:31 DEBUG smg::grpc_client::sglang_scheduler: src/grpc_client/sglang_scheduler.rs:352: Received file chunk 0: 2097152 bytes, is_last=false
2026-01-08 22:06:31 DEBUG smg::grpc_client::sglang_scheduler: src/grpc_client/sglang_scheduler.rs:352: Received file chunk 1: 2097152 bytes, is_last=false
2026-01-08 22:06:31 DEBUG smg::grpc_client::sglang_scheduler: src/grpc_client/sglang_scheduler.rs:352: Received file chunk 2: 386192 bytes, is_last=true
2026-01-08 22:06:31 DEBUG smg::grpc_client::sglang_scheduler: src/grpc_client/sglang_scheduler.rs:365: Last chunk received, total data size: 4580496 bytes
2026-01-08 22:06:31 DEBUG smg::grpc_client::sglang_scheduler: src/grpc_client/sglang_scheduler.rs:400: Tokenizer bundle download complete
2026-01-08 22:06:31  INFO smg::core::steps::worker::local::register_tokenizer: src/core/steps/worker/local/register_tokenizer.rs:113: Tokenizer extracted to temporary path: /tmp/.tmph5xhMK
2026-01-08 22:06:31 DEBUG smg::tokenizer::factory: src/tokenizer/factory.rs:221: Auto-discovered chat template in '/tmp/.tmph5xhMK': /tmp/.tmph5xhMK/chat_template.jinja
2026-01-08 22:06:33  INFO smg::tokenizer::registry: src/tokenizer/registry.rs:141: Successfully registered tokenizer 'gpt-oss-20b' with id: accb4650-06e3-493e-b5a8-e2934b0f1ea9
2026-01-08 22:06:33 DEBUG smg::core::steps::worker::local::register_tokenizer: src/core/steps/worker/local/register_tokenizer.rs:129: Successfully registered tokenizer for model gpt-oss-20b from /models/openai/gpt-oss-20b
2026-01-08 22:06:33  INFO smg::workflow::event: src/workflow/event.rs:131: Step succeeded instance_id=06e7a708-2146-470d-bf17-7bf768b84200 step_id=register_tokenizer duration_ms=2289
2026-01-08 22:06:33 DEBUG smg::workflow::engine: src/workflow/engine.rs:353: Step completed step_id=register_tokenizer result=Success
2026-01-08 22:06:33  INFO smg::workflow::event: src/workflow/event.rs:170: Workflow completed instance_id=06e7a708-2146-470d-bf17-7bf768b84200 duration_ms=3296
2026-01-08 22:06:33 DEBUG smg::core::job_queue: src/core/job_queue.rs:872: Completed job: type=AddWorker, worker=grpc://127.0.0.1:18080, result=Worker grpc://127.0.0.1:18080 registered and activated successfully via workflow

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @YouNeedCryDear, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the gRPC server's capabilities by introducing a new GetTokenizer endpoint. This endpoint allows remote clients to efficiently retrieve tokenizer-related files, such as tokenizer.json and chat templates, directly from the server. The implementation supports streaming of these artifacts, bundled as a zip file, ensuring that clients can obtain the necessary tokenizer configuration to accurately process model inputs. This change streamlines the deployment and synchronization of tokenizer assets across distributed systems.

Highlights

  • New gRPC Endpoint: Introduced a GetTokenizer RPC endpoint to the gRPC server, allowing clients to request tokenizer artifacts.
  • Tokenizer Streaming Support: Implemented streaming functionality for the GetTokenizer endpoint, enabling efficient transfer of tokenizer files in chunks.
  • Dynamic Tokenizer Bundling: The server can now dynamically locate and bundle tokenizer-related files (e.g., tokenizer.json, tokenizer_config.json, chat templates) into a zip archive for client retrieval.
  • Protocol Buffer Updates: Updated the .proto definition and regenerated Python bindings to include new message types (GetTokenizerRequest, GetTokenizerChunk, TokenizerMetadata, TokenizerFileDescriptor, TokenizerFileChunk) and the GetTokenizer service method.
  • Comprehensive Testing: Added a new dedicated test file (test_tokenizer_stream.py) to validate the end-to-end functionality of the tokenizer streaming feature, including file bundling and integrity checks.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new GetTokenizer gRPC endpoint to stream tokenizer artifacts. The implementation is well-structured, including bundling tokenizer files into a zip archive, streaming it in chunks, and providing comprehensive tests for the new functionality. The protobuf definitions are clear and follow good practices for streaming data.

My review includes two suggestions for improvement. One addresses a potential non-determinism issue in how chat templates are resolved, which could lead to incorrect behavior. The other is a maintainability suggestion to use a dataclass instead of a tuple for storing file information, which would make the code more readable and robust against future changes.

Comment thread python/sglang/srt/entrypoints/grpc_server.py
Comment thread python/sglang/srt/entrypoints/grpc_server.py
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/sglang/srt/entrypoints/grpc_server.py
@YouNeedCryDear YouNeedCryDear changed the title Add GetTokenizer endpoint for gRPC server [Entrypoint] Add GetTokenizer endpoint for gRPC server Oct 31, 2025
@CatherineSue
Copy link
Copy Markdown
Collaborator

Could we merge #12408 to here ?

Both has proto changes and it is hard to compare the changes

Copy link
Copy Markdown
Collaborator

@CatherineSue CatherineSue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit unclear with this requirement. Why do we need grpc server to stream back tokenizer files? We already have tokenizer_path in GetModelInfo. This introduces concerns on what kind of files we are streaming back and the latency.

tokenizer crate doesn't take care of multimodal files and chat templates. These will be required for the correct request process in grpc.

I feel all the problems we face here can be solved by adding a TokenizerRegistry

@YouNeedCryDear
Copy link
Copy Markdown
Contributor Author

I'm a bit unclear with this requirement. Why do we need grpc server to stream back tokenizer files? We already have tokenizer_path in GetModelInfo. This introduces concerns on what kind of files we are streaming back and the latency.

tokenizer crate doesn't take care of multimodal files and chat templates. These will be required for the correct request process in grpc.

I feel all the problems we face here can be solved by adding a TokenizerRegistry

@CatherineSue If the router is running on a different node than the backend workers, then the tokenizer_path will not be accessible.

@fzyzcjy
Copy link
Copy Markdown
Collaborator

fzyzcjy commented Dec 18, 2025

Hi, is there any updates about the compatibility of "router + gRPC + service discovery (ome)"? I am happy to quickly work on it, but since there is already ongoing PR, I feel not good to do so and thus need to wait for the existing PR, thus wondering whether is an ETA about it.

@YouNeedCryDear YouNeedCryDear force-pushed the grpc-server-get-tokenizer branch 4 times, most recently from 4d0fed0 to 9f26e93 Compare December 26, 2025 22:54
@YouNeedCryDear YouNeedCryDear force-pushed the grpc-server-get-tokenizer branch 7 times, most recently from 46b227d to f529a9c Compare January 8, 2026 21:22
@github-actions github-actions Bot added the dependencies Pull requests that update a dependency file label Jan 8, 2026
@YouNeedCryDear YouNeedCryDear force-pushed the grpc-server-get-tokenizer branch from f529a9c to d249eeb Compare January 8, 2026 22:27
@YouNeedCryDear YouNeedCryDear changed the title [Entrypoint] Add GetTokenizer endpoint for gRPC server [Entrypoint] Add GetTokenizer endpoint for gRPC server and load Tokenizer to registry from temp zip Jan 8, 2026
@YouNeedCryDear YouNeedCryDear changed the title [Entrypoint] Add GetTokenizer endpoint for gRPC server and load Tokenizer to registry from temp zip [Entrypoint][Model-Gateway] Add GetTokenizer endpoint for gRPC server and load Tokenizer to registry from temp zip for model gateway Jan 8, 2026
@slin1237
Copy link
Copy Markdown
Collaborator

slin1237 commented Jan 8, 2026

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label Jan 8, 2026
@YouNeedCryDear YouNeedCryDear force-pushed the grpc-server-get-tokenizer branch from d249eeb to b959827 Compare January 9, 2026 00:15
@YouNeedCryDear
Copy link
Copy Markdown
Contributor Author

Excluded those two tests for tokenizer streaming and loading from GPU related CI.

@YouNeedCryDear YouNeedCryDear force-pushed the grpc-server-get-tokenizer branch 4 times, most recently from 9a5b981 to 507fad1 Compare January 13, 2026 23:18
for grpc server

add get tokenizer method
to sglang schedule in router

add tokenizer bundle loading
from SGLang server
@YouNeedCryDear YouNeedCryDear force-pushed the grpc-server-get-tokenizer branch from 507fad1 to 2e86770 Compare January 14, 2026 21:37
@slin1237 slin1237 closed this Feb 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file model-gateway run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants