[Entrypoint][Model-Gateway] Add GetTokenizer endpoint for gRPC server and load Tokenizer to registry from temp zip for model gateway by YouNeedCryDear · Pull Request #12407 · sgl-project/sglang

YouNeedCryDear · 2025-10-30T19:31:00Z

Summary

This PR adds a new GetTokenizer endpoint to the gRPC server, enabling clients to retrieve tokenizer information. Also adds a new GetTokenizer gRPC method to the SGLang scheduler service, enabling the model gateway to dynamically fetch tokenizer files from the backend via streaming then load into the tokenizer registry. This eliminates the need for static tokenizer configuration or local access to tokenizer files in the model gateway and ensures the router always uses the same tokenizer as the backend.

Motivation

Currently, the SGLang model gateway requires to access the tokenizer files locally. This creates several challenges:

Configuration complexity: Users must manually ensure the model gateway has the same tokenizer files access matching the backend model
Deployment overhead: Tokenizer files must be distributed separately from the backend
Dynamic model support: The model gateway cannot adapt to backend model changes without reconfiguration
By enabling dynamic tokenizer streaming, the router can automatically obtain the correct tokenizer from the backend, simplifying deployment and ensuring consistency.

GetTokenizer endpoint will be used by sglang model gateway when inference gateway mode (multi model) is turned on. Router has no knowledge of what tokenizer should be used during start up and it will get tokenizer at runtime when workers are added into the registry.

Changes

On SGLang Server:

Added GetTokenizer RPC definition to sglang_scheduler.proto
Implemented GetTokenizer endpoint in grpc_server.py with streaming support
Generated updated protocol buffer bindings (pb2.py, pb2.pyi, pb2_grpc.py)
Added comprehensive test coverage in test_tokenizer_stream.py

On SGLang Router:

Added new protobuf messages for tokenizer streaming
Added get_tokenizer() method to SglangSchedulerClient
Added logic to load tokenizer from streamed zip bundle into registry in worker registration

Output

sglang git:(grpc-server-get-tokenizer) ✗ python3 ./test/srt/entrypoints/grpc_server/test_get_tokenizer.py
Sending GetTokenizer request...
Received metadata: meta.llama-3-8b
Received file chunk: tokenizer_bundle.zip, size: 2097152
Received file chunk: tokenizer_bundle.zip, size: 237091
Total zip size: 2334243 bytes
Extracting to /tmp/tmpm4e0l07n
Listing extracted files:
 - tokenizer_config.json
 - tokenizer.json
 - tokenizer.zip
Loading tokenizer with transformers...
Tokenizer loaded successfully!
Test text: Hello, world! This is a test.
Encoded: [128000, 9906, 11, 1917, 0, 1115, 374, 264, 1296, 13]
Decoded: <|begin_of_text|>Hello, world! This is a test.

2026-01-08 22:06:31  INFO smg::core::steps::worker::shared::activate: src/core/steps/worker/shared/activate.rs:27: Activated 1 worker(s) (marked as healthy)
2026-01-08 22:06:31  INFO smg::workflow::event: src/workflow/event.rs:131: Step succeeded instance_id=06e7a708-2146-470d-bf17-7bf768b84200 step_id=activate_workers duration_ms=0
2026-01-08 22:06:31 DEBUG smg::core::steps::worker::local::register_tokenizer: src/core/steps/worker/local/register_tokenizer.rs:41: Registering tokenizer for model gpt-oss-20b from /models/openai/gpt-oss-20b
2026-01-08 22:06:31 DEBUG smg::policies::registry: src/policies/registry.rs:61: Worker added for model gpt-oss-20b, count: 1
2026-01-08 22:06:31 DEBUG smg::workflow::engine: src/workflow/engine.rs:353: Step completed step_id=activate_workers result=Success
2026-01-08 22:06:31 DEBUG smg::tokenizer::registry: src/tokenizer/registry.rs:103: Tokenizer cache miss for name: gpt-oss-20b
2026-01-08 22:06:31 DEBUG smg::policies::registry: src/policies/registry.rs:156: Using default policy for model gpt-oss-20b
2026-01-08 22:06:31  INFO smg::policies::registry: src/policies/registry.rs:77: Assigning policy cache_aware to new model gpt-oss-20b
2026-01-08 22:06:31  INFO smg::tokenizer::registry: src/tokenizer/registry.rs:121: Loading tokenizer 'gpt-oss-20b' from source: /models/openai/gpt-oss-20b
2026-01-08 22:06:31 DEBUG smg::policies::registry: src/policies/registry.rs:272: Initializing cache-aware policy with 1 workers for model gpt-oss-20b
2026-01-08 22:06:31  INFO smg::core::steps::worker::local::register_tokenizer: src/core/steps/worker/local/register_tokenizer.rs:87: Fetching tokenizer from worker: grpc://127.0.0.1:18080
2026-01-08 22:06:31  INFO smg::core::worker: src/core/worker.rs:761: Lazily initializing gRPC client (sglang) for worker: grpc://127.0.0.1:18080
2026-01-08 22:06:31 DEBUG smg::grpc_client::sglang_scheduler: src/grpc_client/sglang_scheduler.rs:138: Connecting to SGLang scheduler at grpc://127.0.0.1:18080
2026-01-08 22:06:31 DEBUG smg::core::steps::worker::shared::update_policies: src/core/steps/worker/shared/update_policies.rs:133: Updated policies for 1 workers across 1 models
2026-01-08 22:06:31  INFO smg::workflow::event: src/workflow/event.rs:131: Step succeeded instance_id=06e7a708-2146-470d-bf17-7bf768b84200 step_id=update_policies duration_ms=0
2026-01-08 22:06:31 DEBUG smg::workflow::engine: src/workflow/engine.rs:353: Step completed step_id=update_policies result=Success
2026-01-08 22:06:31  INFO smg::core::worker: src/core/worker.rs:768: Successfully connected gRPC client (sglang) for worker: grpc://127.0.0.1:18080
2026-01-08 22:06:31 DEBUG smg::grpc_client::sglang_scheduler: src/grpc_client/sglang_scheduler.rs:303: Requesting tokenizer from backend
2026-01-08 22:06:31 DEBUG smg::grpc_client::sglang_scheduler: src/grpc_client/sglang_scheduler.rs:317: Received tokenizer metadata: model=gpt-oss-20b, fingerprint=04b544516abc8dde49f9755b38e7b740c9041aa13f5f5613ecd0fb6d8443af0b, files=3, format=zip
2026-01-08 22:06:31 DEBUG smg::grpc_client::sglang_scheduler: src/grpc_client/sglang_scheduler.rs:339: Starting to receive file chunks
2026-01-08 22:06:31 DEBUG smg::grpc_client::sglang_scheduler: src/grpc_client/sglang_scheduler.rs:352: Received file chunk 0: 2097152 bytes, is_last=false
2026-01-08 22:06:31 DEBUG smg::grpc_client::sglang_scheduler: src/grpc_client/sglang_scheduler.rs:352: Received file chunk 1: 2097152 bytes, is_last=false
2026-01-08 22:06:31 DEBUG smg::grpc_client::sglang_scheduler: src/grpc_client/sglang_scheduler.rs:352: Received file chunk 2: 386192 bytes, is_last=true
2026-01-08 22:06:31 DEBUG smg::grpc_client::sglang_scheduler: src/grpc_client/sglang_scheduler.rs:365: Last chunk received, total data size: 4580496 bytes
2026-01-08 22:06:31 DEBUG smg::grpc_client::sglang_scheduler: src/grpc_client/sglang_scheduler.rs:400: Tokenizer bundle download complete
2026-01-08 22:06:31  INFO smg::core::steps::worker::local::register_tokenizer: src/core/steps/worker/local/register_tokenizer.rs:113: Tokenizer extracted to temporary path: /tmp/.tmph5xhMK
2026-01-08 22:06:31 DEBUG smg::tokenizer::factory: src/tokenizer/factory.rs:221: Auto-discovered chat template in '/tmp/.tmph5xhMK': /tmp/.tmph5xhMK/chat_template.jinja
2026-01-08 22:06:33  INFO smg::tokenizer::registry: src/tokenizer/registry.rs:141: Successfully registered tokenizer 'gpt-oss-20b' with id: accb4650-06e3-493e-b5a8-e2934b0f1ea9
2026-01-08 22:06:33 DEBUG smg::core::steps::worker::local::register_tokenizer: src/core/steps/worker/local/register_tokenizer.rs:129: Successfully registered tokenizer for model gpt-oss-20b from /models/openai/gpt-oss-20b
2026-01-08 22:06:33  INFO smg::workflow::event: src/workflow/event.rs:131: Step succeeded instance_id=06e7a708-2146-470d-bf17-7bf768b84200 step_id=register_tokenizer duration_ms=2289
2026-01-08 22:06:33 DEBUG smg::workflow::engine: src/workflow/engine.rs:353: Step completed step_id=register_tokenizer result=Success
2026-01-08 22:06:33  INFO smg::workflow::event: src/workflow/event.rs:170: Workflow completed instance_id=06e7a708-2146-470d-bf17-7bf768b84200 duration_ms=3296
2026-01-08 22:06:33 DEBUG smg::core::job_queue: src/core/job_queue.rs:872: Completed job: type=AddWorker, worker=grpc://127.0.0.1:18080, result=Worker grpc://127.0.0.1:18080 registered and activated successfully via workflow

gemini-code-assist · 2025-10-30T19:31:20Z

Summary of Changes

Hello @YouNeedCryDear, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the gRPC server's capabilities by introducing a new GetTokenizer endpoint. This endpoint allows remote clients to efficiently retrieve tokenizer-related files, such as tokenizer.json and chat templates, directly from the server. The implementation supports streaming of these artifacts, bundled as a zip file, ensuring that clients can obtain the necessary tokenizer configuration to accurately process model inputs. This change streamlines the deployment and synchronization of tokenizer assets across distributed systems.

Highlights

New gRPC Endpoint: Introduced a GetTokenizer RPC endpoint to the gRPC server, allowing clients to request tokenizer artifacts.
Tokenizer Streaming Support: Implemented streaming functionality for the GetTokenizer endpoint, enabling efficient transfer of tokenizer files in chunks.
Dynamic Tokenizer Bundling: The server can now dynamically locate and bundle tokenizer-related files (e.g., tokenizer.json, tokenizer_config.json, chat templates) into a zip archive for client retrieval.
Protocol Buffer Updates: Updated the .proto definition and regenerated Python bindings to include new message types (GetTokenizerRequest, GetTokenizerChunk, TokenizerMetadata, TokenizerFileDescriptor, TokenizerFileChunk) and the GetTokenizer service method.
Comprehensive Testing: Added a new dedicated test file (test_tokenizer_stream.py) to validate the end-to-end functionality of the tokenizer streaming feature, including file bundling and integrity checks.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new GetTokenizer gRPC endpoint to stream tokenizer artifacts. The implementation is well-structured, including bundling tokenizer files into a zip archive, streaming it in chunks, and providing comprehensive tests for the new functionality. The protobuf definitions are clear and follow good practices for streaming data.

My review includes two suggestions for improvement. One addresses a potential non-determinism issue in how chat templates are resolved, which could lead to incorrect behavior. The other is a maintainability suggestion to use a dataclass instead of a tuple for storing file information, which would make the code more readable and robust against future changes.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

CatherineSue · 2025-11-05T18:45:56Z

Could we merge #12408 to here ?

Both has proto changes and it is hard to compare the changes

CatherineSue

I'm a bit unclear with this requirement. Why do we need grpc server to stream back tokenizer files? We already have tokenizer_path in GetModelInfo. This introduces concerns on what kind of files we are streaming back and the latency.

tokenizer crate doesn't take care of multimodal files and chat templates. These will be required for the correct request process in grpc.

I feel all the problems we face here can be solved by adding a TokenizerRegistry

YouNeedCryDear · 2025-11-05T22:25:56Z

I'm a bit unclear with this requirement. Why do we need grpc server to stream back tokenizer files? We already have tokenizer_path in GetModelInfo. This introduces concerns on what kind of files we are streaming back and the latency.

tokenizer crate doesn't take care of multimodal files and chat templates. These will be required for the correct request process in grpc.

I feel all the problems we face here can be solved by adding a TokenizerRegistry

@CatherineSue If the router is running on a different node than the backend workers, then the tokenizer_path will not be accessible.

fzyzcjy · 2025-12-18T02:07:58Z

Hi, is there any updates about the compatibility of "router + gRPC + service discovery (ome)"? I am happy to quickly work on it, but since there is already ongoing PR, I feel not good to do so and thus need to wait for the existing PR, thus wondering whether is an ETA about it.

slin1237 · 2026-01-08T22:42:11Z

/tag-and-rerun-ci

YouNeedCryDear · 2026-01-09T00:16:41Z

Excluded those two tests for tokenizer streaming and loading from GPU related CI.

for grpc server add get tokenizer method to sglang schedule in router add tokenizer bundle loading from SGLang server

YouNeedCryDear requested review from CatherineSue, JustinTong0323, ispobock, merrymercy and slin1237 as code owners October 30, 2025 19:31

gemini-code-assist Bot reviewed Oct 30, 2025

View reviewed changes

Comment thread python/sglang/srt/entrypoints/grpc_server.py

Comment thread python/sglang/srt/entrypoints/grpc_server.py

chatgpt-codex-connector Bot reviewed Oct 30, 2025

View reviewed changes

Comment thread python/sglang/srt/entrypoints/grpc_server.py

YouNeedCryDear mentioned this pull request Oct 30, 2025

[Router] Add GetTokenizer gRPC method to stream tokenizer files from backend to router #12408

Closed

YouNeedCryDear changed the title ~~Add GetTokenizer endpoint for gRPC server~~ [Entrypoint] Add GetTokenizer endpoint for gRPC server Oct 31, 2025

CatherineSue reviewed Nov 5, 2025

View reviewed changes

YouNeedCryDear mentioned this pull request Nov 10, 2025

[model-gateway] Replace tokenizer with tokenizer registry for dynamic tokenizer loading in gRPC router #12968

Merged

4 tasks

YouNeedCryDear force-pushed the grpc-server-get-tokenizer branch from aabe0ea to 821de53 Compare November 16, 2025 21:39

github-actions Bot added the model-gateway label Nov 16, 2025

YouNeedCryDear force-pushed the grpc-server-get-tokenizer branch from f508f96 to c5b2055 Compare November 24, 2025 19:32

YouNeedCryDear force-pushed the grpc-server-get-tokenizer branch 4 times, most recently from 4d0fed0 to 9f26e93 Compare December 26, 2025 22:54

YouNeedCryDear force-pushed the grpc-server-get-tokenizer branch 7 times, most recently from 46b227d to f529a9c Compare January 8, 2026 21:22

github-actions Bot added the dependencies Pull requests that update a dependency file label Jan 8, 2026

YouNeedCryDear force-pushed the grpc-server-get-tokenizer branch from f529a9c to d249eeb Compare January 8, 2026 22:27

YouNeedCryDear changed the title ~~[Entrypoint] Add GetTokenizer endpoint for gRPC server~~ [Entrypoint] Add GetTokenizer endpoint for gRPC server and load Tokenizer to registry from temp zip Jan 8, 2026

YouNeedCryDear changed the title ~~[Entrypoint] Add GetTokenizer endpoint for gRPC server and load Tokenizer to registry from temp zip~~ [Entrypoint][Model-Gateway] Add GetTokenizer endpoint for gRPC server and load Tokenizer to registry from temp zip for model gateway Jan 8, 2026

github-actions Bot added the run-ci label Jan 8, 2026

YouNeedCryDear force-pushed the grpc-server-get-tokenizer branch from d249eeb to b959827 Compare January 9, 2026 00:15

YouNeedCryDear requested a review from CatherineSue January 9, 2026 00:17

YouNeedCryDear force-pushed the grpc-server-get-tokenizer branch 4 times, most recently from 9a5b981 to 507fad1 Compare January 13, 2026 23:18

add get tokenizer proto and endpoint

2e86770

for grpc server add get tokenizer method to sglang schedule in router add tokenizer bundle loading from SGLang server

YouNeedCryDear force-pushed the grpc-server-get-tokenizer branch from 507fad1 to 2e86770 Compare January 14, 2026 21:37

slin1237 closed this Feb 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Entrypoint][Model-Gateway] Add GetTokenizer endpoint for gRPC server and load Tokenizer to registry from temp zip for model gateway#12407

[Entrypoint][Model-Gateway] Add GetTokenizer endpoint for gRPC server and load Tokenizer to registry from temp zip for model gateway#12407
YouNeedCryDear wants to merge 1 commit intosgl-project:mainfrom
YouNeedCryDear:grpc-server-get-tokenizer

YouNeedCryDear commented Oct 30, 2025 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Oct 30, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

CatherineSue commented Nov 5, 2025

Uh oh!

CatherineSue left a comment

Uh oh!

YouNeedCryDear commented Nov 5, 2025

Uh oh!

fzyzcjy commented Dec 18, 2025

Uh oh!

slin1237 commented Jan 8, 2026

Uh oh!

YouNeedCryDear commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

YouNeedCryDear commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Output

Uh oh!

gemini-code-assist Bot commented Oct 30, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

CatherineSue commented Nov 5, 2025

Uh oh!

CatherineSue left a comment

Choose a reason for hiding this comment

Uh oh!

YouNeedCryDear commented Nov 5, 2025

Uh oh!

fzyzcjy commented Dec 18, 2025

Uh oh!

slin1237 commented Jan 8, 2026

Uh oh!

YouNeedCryDear commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

YouNeedCryDear commented Oct 30, 2025 •

edited

Loading