[Router] Add GetTokenizer gRPC method to stream tokenizer files from backend to router by YouNeedCryDear · Pull Request #12408 · sgl-project/sglang

YouNeedCryDear · 2025-10-30T19:38:32Z

Summary

This PR adds a new GetTokenizer gRPC method to the SGLang scheduler service, enabling the router to dynamically fetch tokenizer files from the backend via streaming. This eliminates the need for static tokenizer configuration in the router and ensures the router always uses the same tokenizer as the backend.

Motivation

Currently, the SGLang router requires static tokenizer files to be configured locally. This creates several challenges:

Configuration complexity: Users must manually ensure the router has the correct tokenizer files matching the backend model
Deployment overhead: Tokenizer files must be distributed separately from the backend
Dynamic model support: The router cannot adapt to backend model changes without reconfiguration

By enabling dynamic tokenizer streaming, the router can automatically obtain the correct tokenizer from the backend, simplifying deployment and ensuring consistency.

Python server implementation of the GetTokenizer RPC in #12407

Implementation Details

Protocol Definition

Added new protobuf messages for tokenizer streaming:

GetTokenizerRequest: Request message (empty for now)
GetTokenizerChunk: Streamed response containing either metadata or file chunks
TokenizerMetadata: Information about the tokenizer bundle (model identifier, fingerprint, file list, format)
TokenizerFileDescriptor: Metadata for individual tokenizer files
TokenizerFileChunk: Chunked file data with sequential indexing and compression support

The protocol follows a metadata-first streaming pattern:

First message contains metadata about the tokenizer bundle
Subsequent messages contain file chunks with sequential indices
Last chunk is marked with is_last_chunk flag

Rust Client Implementation

Added get_tokenizer() method to SglangSchedulerClient that:

Initiates a streaming RPC call to the backend
Validates the streaming protocol (metadata first, sequential chunks, last chunk marker)
Accumulates compressed data chunks
Returns a TokenizerBundle containing metadata and compressed data

The implementation includes robust error handling for protocol violations:

Validates metadata arrives first
Ensures chunks have sequential indices
Verifies the last chunk marker is received
Provides detailed error messages for debugging

Future Work

This PR lays the foundation for dynamic tokenizer streaming. Future enhancements will include:

TokenizerRegistry for caching downloaded tokenizers
Integration with the router's worker registration workflow
Configuration options for remote vs. static tokenizer sources
Automatic fallback mechanisms

Testing

Manual testing was performed to verify:

Proto definitions compile correctly
Rust client compiles without errors
Protocol validation logic works as expected

Integration testing will be added once the Python server PR is merged.

gemini-code-assist · 2025-10-30T19:38:46Z

Summary of Changes

Hello @YouNeedCryDear, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to the SGLang router's tokenizer management by implementing a dynamic streaming mechanism. Instead of relying on statically configured local files, the router can now fetch tokenizer bundles directly from the backend via a new gRPC method. This change streamlines the deployment process, guarantees consistency between the router and backend, and lays the groundwork for more flexible and adaptive model support.

Highlights

New gRPC Method: Introduced a new GetTokenizer gRPC method to the SGLang scheduler service, enabling dynamic fetching of tokenizer files.
Dynamic Tokenizer Streaming: The router can now stream tokenizer files from the backend, eliminating the need for static local configuration.
Protocol Definition: Defined new protobuf messages (GetTokenizerRequest, GetTokenizerChunk, TokenizerMetadata, TokenizerFileDescriptor, TokenizerFileChunk) for the tokenizer streaming protocol, which follows a metadata-first pattern.
Rust Client Implementation: Added a get_tokenizer() method to SglangSchedulerClient that handles the streaming RPC call, validates the protocol (metadata first, sequential chunks, last chunk marker), and accumulates compressed data into a TokenizerBundle.
Improved Consistency and Deployment: This change ensures the router always uses the same tokenizer as the backend, simplifying configuration, preventing version mismatches, and reducing deployment overhead.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new GetTokenizer gRPC method to allow the router to stream tokenizer files from the backend. The changes include new protobuf definitions and a Rust client implementation for this streaming RPC. My review focuses on the client implementation. I've identified a significant ambiguity in the streaming protocol design for handling multi-file tokenizer bundles, which could lead to data corruption. I've also suggested refactoring the main client function and improving its error handling for better maintainability. Overall, the feature is a good addition, but the protocol details need to be solidified before the server-side is implemented.

gemini-code-assist · 2025-10-30T19:40:20Z

+        // Collect file chunks
+        let mut compressed_data = Vec::new();
+        let mut expected_chunk_index = 0u32;
+        let mut last_chunk_received = false;
+
+        debug!("Starting to receive file chunks");
+        while let Some(chunk) = stream.message().await? {
+            match chunk.chunk {
+                Some(proto::get_tokenizer_chunk::Chunk::FileChunk(file_chunk)) => {
+                    // Validate chunk ordering
+                    if file_chunk.chunk_index != expected_chunk_index {
+                        return Err(format!(
+                            "Protocol error: expected chunk index {}, got {}",
+                            expected_chunk_index, file_chunk.chunk_index
+                        )
+                        .into());
+                    }
+
+                    debug!(
+                        "Received file chunk {}: {} bytes, is_last={}",
+                        file_chunk.chunk_index,
+                        file_chunk.data.len(),
+                        file_chunk.is_last_chunk
+                    );
+
+                    // Append data
+                    compressed_data.extend_from_slice(&file_chunk.data);
+
+                    // Check if this is the last chunk
+                    if file_chunk.is_last_chunk {
+                        last_chunk_received = true;
+                        debug!(
+                            "Last chunk received, total data size: {} bytes",
+                            compressed_data.len()
+                        );
+                        break;
+                    }
+
+                    expected_chunk_index += 1;
+                }
+                Some(proto::get_tokenizer_chunk::Chunk::Metadata(_)) => {
+                    return Err(format!(
+                        "Protocol error: unexpected metadata chunk at position {}",
+                        expected_chunk_index
+                    )
+                    .into());
+                }
+                None => {
+                    return Err(format!(
+                        "Protocol error: empty chunk at position {}",
+                        expected_chunk_index
+                    )
+                    .into());
+                }
+            }
+        }


The current implementation collects all file chunks into a single compressed_data vector, ignoring the file_chunk.file_name field. This assumes the server sends a single contiguous data stream (like an archive), which seems to contradict the file_name field in the TokenizerFileChunk proto.

This creates ambiguity in the protocol for multi-file bundles:

If the server sends a single archive, the file_name field in TokenizerFileChunk is redundant and the client would later need to unpack this archive.

If the server sends chunks for individual files, this client implementation will corrupt the data by concatenating them. The client would need to be updated to segregate data by file_name (e.g., into a HashMap<String, Vec<u8>>).

Furthermore, if individual files are streamed:

The chunk_index should probably be tracked on a per-file basis, not globally.

The is_last_chunk flag's meaning is unclear: does it mark the end of a file or the end of the entire bundle? The client assumes the latter.

It's important to clarify the protocol design for handling multiple files and ensure the client implementation correctly reflects it.

gemini-code-assist · 2025-10-30T19:40:20Z

+    pub async fn get_tokenizer(
+        &self,
+    ) -> Result<TokenizerBundle, Box<dyn std::error::Error + Send + Sync>> {
+        debug!("Requesting tokenizer from backend");
+        let request = Request::new(proto::GetTokenizerRequest {});
+
+        let mut client = self.client.clone();
+        let mut stream = client.get_tokenizer(request).await?.into_inner();
+
+        // First message must be metadata
+        let first_chunk = stream
+            .message()
+            .await?
+            .ok_or("Empty stream: expected metadata chunk")?;
+
+        let metadata = match first_chunk.chunk {
+            Some(proto::get_tokenizer_chunk::Chunk::Metadata(meta)) => {
+                debug!(
+                    "Received tokenizer metadata: model={}, fingerprint={}, files={}, format={}",
+                    meta.model_identifier,
+                    meta.fingerprint,
+                    meta.files.len(),
+                    meta.bundle_format
+                );
+                meta
+            }
+            Some(proto::get_tokenizer_chunk::Chunk::FileChunk(_)) => {
+                return Err("Protocol error: first chunk must be metadata, got file chunk".into());
+            }
+            None => {
+                return Err("Protocol error: first chunk is empty".into());
+            }
+        };
+
+        // Collect file chunks
+        let mut compressed_data = Vec::new();
+        let mut expected_chunk_index = 0u32;
+        let mut last_chunk_received = false;
+
+        debug!("Starting to receive file chunks");
+        while let Some(chunk) = stream.message().await? {
+            match chunk.chunk {
+                Some(proto::get_tokenizer_chunk::Chunk::FileChunk(file_chunk)) => {
+                    // Validate chunk ordering
+                    if file_chunk.chunk_index != expected_chunk_index {
+                        return Err(format!(
+                            "Protocol error: expected chunk index {}, got {}",
+                            expected_chunk_index, file_chunk.chunk_index
+                        )
+                        .into());
+                    }
+
+                    debug!(
+                        "Received file chunk {}: {} bytes, is_last={}",
+                        file_chunk.chunk_index,
+                        file_chunk.data.len(),
+                        file_chunk.is_last_chunk
+                    );
+
+                    // Append data
+                    compressed_data.extend_from_slice(&file_chunk.data);
+
+                    // Check if this is the last chunk
+                    if file_chunk.is_last_chunk {
+                        last_chunk_received = true;
+                        debug!(
+                            "Last chunk received, total data size: {} bytes",
+                            compressed_data.len()
+                        );
+                        break;
+                    }
+
+                    expected_chunk_index += 1;
+                }
+                Some(proto::get_tokenizer_chunk::Chunk::Metadata(_)) => {
+                    return Err(format!(
+                        "Protocol error: unexpected metadata chunk at position {}",
+                        expected_chunk_index
+                    )
+                    .into());
+                }
+                None => {
+                    return Err(format!(
+                        "Protocol error: empty chunk at position {}",
+                        expected_chunk_index
+                    )
+                    .into());
+                }
+            }
+        }
+
+        // Validate that we received the last chunk marker
+        if !last_chunk_received {
+            return Err(format!(
+                "Protocol error: stream ended without is_last_chunk flag (received {} chunks)",
+                expected_chunk_index
+            )
+            .into());
+        }
+
+        debug!("Tokenizer bundle download complete");
+        Ok(TokenizerBundle {
+            metadata,
+            compressed_data,
+        })
+    }


This function is quite long and handles multiple concerns (receiving metadata, receiving file chunks, validation). For better readability and maintainability, consider refactoring it into smaller, more focused functions. For example, you could have separate functions for processing the metadata and for streaming the file chunks.

Additionally, the error handling could be improved by using a custom error enum instead of Box<dyn std::error::Error>. This provides more structured errors, making it easier for callers to handle specific failure modes programmatically.

Here's an example of how a custom error type could look using thiserror:

#[derive(Debug, thiserror::Error)] pub enum GetTokenizerError { #[error("gRPC error: {0}")] Grpc(#[from] tonic::Status), #[error("Empty stream: expected metadata chunk")] EmptyStream, #[error("Protocol error: first chunk must be metadata, but got a file chunk")] MetadataNotFirst, #[error("Protocol error: first chunk is empty")] FirstChunkEmpty, #[error("Protocol error: unexpected metadata chunk at position {0}")] UnexpectedMetadata(u32), #[error("Protocol error: empty chunk at position {0}")] EmptyChunk(u32), #[error("Protocol error: expected chunk index {expected}, but got {got}")] ChunkOutOfOrder { expected: u32, got: u32 }, #[error("Protocol error: stream ended without is_last_chunk flag (received {0} chunks)")] StreamEndedPrematurely(u32), }

Using this would make the error handling logic within get_tokenizer cleaner and more explicit.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2025-10-30T19:40:46Z

+                    if file_chunk.chunk_index != expected_chunk_index {
+                        return Err(format!(
+                            "Protocol error: expected chunk index {}, got {}",
+                            expected_chunk_index, file_chunk.chunk_index
+                        )
+                        .into());


Return an Error type instead of String in protocol validation

The new get_tokenizer function returns Result<TokenizerBundle, Box<dyn std::error::Error + Send + Sync>>, but the error paths use return Err(format!("…").into());. format! produces a String, which does not implement std::error::Error, so these branches cannot be converted into Box<dyn Error> and the crate will fail to compile. Use an error type (e.g., anyhow::Error, tonic::Status, or a custom struct) rather than a bare String, and update the other format! branches in this function accordingly.

Useful? React with 👍 / 👎.

CatherineSue · 2025-11-05T19:38:07Z

Configuration complexity: Users must manually ensure the router has the correct tokenizer files matching the backend model

This is FALSE - Router already has create_tokenizer_async() that handles:

HuggingFace model IDs: "deepseek-ai/deepseek-v3" → auto-downloads
Local paths: "/raid/models/..." → loads directly
No manual file management needed

Adding this management in grpc server is even more complexed

Deployment overhead: Tokenizer files must be distributed separately from the backend

Can you explain a bit more? I thought we would share the same PersistentVolume for workers and routers.

Dynamic model support: The router cannot adapt to backend model changes without reconfiguration

Not sure i understand this correctly. I thought same as #12045 , the root cause is because we need --tokenizer-path in start up right now. Why not having a TokenizerRegistry so we can solve all the issues once for all?

YouNeedCryDear · 2026-01-06T00:46:13Z

Close this PR as the changes are merged into #12407

add get tokenizer method to sglang schedule in router

1bf9c65

YouNeedCryDear requested review from CatherineSue and slin1237 as code owners October 30, 2025 19:38

gemini-code-assist Bot reviewed Oct 30, 2025

View reviewed changes

chatgpt-codex-connector Bot reviewed Oct 30, 2025

View reviewed changes

YouNeedCryDear changed the title ~~Add GetTokenizer gRPC method to stream tokenizer files from backend to router~~ [Router] Add GetTokenizer gRPC method to stream tokenizer files from backend to router Oct 30, 2025

CatherineSue mentioned this pull request Nov 5, 2025

[Entrypoint][Model-Gateway] Add GetTokenizer endpoint for gRPC server and load Tokenizer to registry from temp zip for model gateway #12407

Closed

YouNeedCryDear mentioned this pull request Nov 10, 2025

[model-gateway] Replace tokenizer with tokenizer registry for dynamic tokenizer loading in gRPC router #12968

Merged

4 tasks

YouNeedCryDear closed this Jan 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Router] Add GetTokenizer gRPC method to stream tokenizer files from backend to router#12408

[Router] Add GetTokenizer gRPC method to stream tokenizer files from backend to router#12408
YouNeedCryDear wants to merge 1 commit intosgl-project:mainfrom
YouNeedCryDear:router-grpc-get-tokenizer

YouNeedCryDear commented Oct 30, 2025 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Oct 30, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Oct 30, 2025

Uh oh!

gemini-code-assist Bot Oct 30, 2025

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Oct 30, 2025

Uh oh!

CatherineSue commented Nov 5, 2025

Uh oh!

YouNeedCryDear commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

YouNeedCryDear commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Implementation Details

Protocol Definition

Rust Client Implementation

Future Work

Testing

Uh oh!

gemini-code-assist Bot commented Oct 30, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

CatherineSue commented Nov 5, 2025

Uh oh!

YouNeedCryDear commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

YouNeedCryDear commented Oct 30, 2025 •

edited

Loading