Skip to content

[Router] Add GetTokenizer gRPC method to stream tokenizer files from backend to router#12408

Closed
YouNeedCryDear wants to merge 1 commit intosgl-project:mainfrom
YouNeedCryDear:router-grpc-get-tokenizer
Closed

[Router] Add GetTokenizer gRPC method to stream tokenizer files from backend to router#12408
YouNeedCryDear wants to merge 1 commit intosgl-project:mainfrom
YouNeedCryDear:router-grpc-get-tokenizer

Conversation

@YouNeedCryDear
Copy link
Copy Markdown
Contributor

@YouNeedCryDear YouNeedCryDear commented Oct 30, 2025

Summary

This PR adds a new GetTokenizer gRPC method to the SGLang scheduler service, enabling the router to dynamically fetch tokenizer files from the backend via streaming. This eliminates the need for static tokenizer configuration in the router and ensures the router always uses the same tokenizer as the backend.

Motivation

Currently, the SGLang router requires static tokenizer files to be configured locally. This creates several challenges:

  1. Configuration complexity: Users must manually ensure the router has the correct tokenizer files matching the backend model
  2. Deployment overhead: Tokenizer files must be distributed separately from the backend
  3. Dynamic model support: The router cannot adapt to backend model changes without reconfiguration

By enabling dynamic tokenizer streaming, the router can automatically obtain the correct tokenizer from the backend, simplifying deployment and ensuring consistency.

Python server implementation of the GetTokenizer RPC in #12407

Implementation Details

Protocol Definition

Added new protobuf messages for tokenizer streaming:

  • GetTokenizerRequest: Request message (empty for now)
  • GetTokenizerChunk: Streamed response containing either metadata or file chunks
  • TokenizerMetadata: Information about the tokenizer bundle (model identifier, fingerprint, file list, format)
  • TokenizerFileDescriptor: Metadata for individual tokenizer files
  • TokenizerFileChunk: Chunked file data with sequential indexing and compression support

The protocol follows a metadata-first streaming pattern:

  1. First message contains metadata about the tokenizer bundle
  2. Subsequent messages contain file chunks with sequential indices
  3. Last chunk is marked with is_last_chunk flag

Rust Client Implementation

Added get_tokenizer() method to SglangSchedulerClient that:

  • Initiates a streaming RPC call to the backend
  • Validates the streaming protocol (metadata first, sequential chunks, last chunk marker)
  • Accumulates compressed data chunks
  • Returns a TokenizerBundle containing metadata and compressed data

The implementation includes robust error handling for protocol violations:

  • Validates metadata arrives first
  • Ensures chunks have sequential indices
  • Verifies the last chunk marker is received
  • Provides detailed error messages for debugging

Future Work

This PR lays the foundation for dynamic tokenizer streaming. Future enhancements will include:

  • TokenizerRegistry for caching downloaded tokenizers
  • Integration with the router's worker registration workflow
  • Configuration options for remote vs. static tokenizer sources
  • Automatic fallback mechanisms

Testing

Manual testing was performed to verify:

  • Proto definitions compile correctly
  • Rust client compiles without errors
  • Protocol validation logic works as expected

Integration testing will be added once the Python server PR is merged.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @YouNeedCryDear, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to the SGLang router's tokenizer management by implementing a dynamic streaming mechanism. Instead of relying on statically configured local files, the router can now fetch tokenizer bundles directly from the backend via a new gRPC method. This change streamlines the deployment process, guarantees consistency between the router and backend, and lays the groundwork for more flexible and adaptive model support.

Highlights

  • New gRPC Method: Introduced a new GetTokenizer gRPC method to the SGLang scheduler service, enabling dynamic fetching of tokenizer files.
  • Dynamic Tokenizer Streaming: The router can now stream tokenizer files from the backend, eliminating the need for static local configuration.
  • Protocol Definition: Defined new protobuf messages (GetTokenizerRequest, GetTokenizerChunk, TokenizerMetadata, TokenizerFileDescriptor, TokenizerFileChunk) for the tokenizer streaming protocol, which follows a metadata-first pattern.
  • Rust Client Implementation: Added a get_tokenizer() method to SglangSchedulerClient that handles the streaming RPC call, validates the protocol (metadata first, sequential chunks, last chunk marker), and accumulates compressed data into a TokenizerBundle.
  • Improved Consistency and Deployment: This change ensures the router always uses the same tokenizer as the backend, simplifying configuration, preventing version mismatches, and reducing deployment overhead.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new GetTokenizer gRPC method to allow the router to stream tokenizer files from the backend. The changes include new protobuf definitions and a Rust client implementation for this streaming RPC. My review focuses on the client implementation. I've identified a significant ambiguity in the streaming protocol design for handling multi-file tokenizer bundles, which could lead to data corruption. I've also suggested refactoring the main client function and improving its error handling for better maintainability. Overall, the feature is a good addition, but the protocol details need to be solidified before the server-side is implemented.

Comment on lines +293 to +348
// Collect file chunks
let mut compressed_data = Vec::new();
let mut expected_chunk_index = 0u32;
let mut last_chunk_received = false;

debug!("Starting to receive file chunks");
while let Some(chunk) = stream.message().await? {
match chunk.chunk {
Some(proto::get_tokenizer_chunk::Chunk::FileChunk(file_chunk)) => {
// Validate chunk ordering
if file_chunk.chunk_index != expected_chunk_index {
return Err(format!(
"Protocol error: expected chunk index {}, got {}",
expected_chunk_index, file_chunk.chunk_index
)
.into());
}

debug!(
"Received file chunk {}: {} bytes, is_last={}",
file_chunk.chunk_index,
file_chunk.data.len(),
file_chunk.is_last_chunk
);

// Append data
compressed_data.extend_from_slice(&file_chunk.data);

// Check if this is the last chunk
if file_chunk.is_last_chunk {
last_chunk_received = true;
debug!(
"Last chunk received, total data size: {} bytes",
compressed_data.len()
);
break;
}

expected_chunk_index += 1;
}
Some(proto::get_tokenizer_chunk::Chunk::Metadata(_)) => {
return Err(format!(
"Protocol error: unexpected metadata chunk at position {}",
expected_chunk_index
)
.into());
}
None => {
return Err(format!(
"Protocol error: empty chunk at position {}",
expected_chunk_index
)
.into());
}
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation collects all file chunks into a single compressed_data vector, ignoring the file_chunk.file_name field. This assumes the server sends a single contiguous data stream (like an archive), which seems to contradict the file_name field in the TokenizerFileChunk proto.

This creates ambiguity in the protocol for multi-file bundles:

  1. If the server sends a single archive, the file_name field in TokenizerFileChunk is redundant and the client would later need to unpack this archive.
  2. If the server sends chunks for individual files, this client implementation will corrupt the data by concatenating them. The client would need to be updated to segregate data by file_name (e.g., into a HashMap<String, Vec<u8>>).

Furthermore, if individual files are streamed:

  • The chunk_index should probably be tracked on a per-file basis, not globally.
  • The is_last_chunk flag's meaning is unclear: does it mark the end of a file or the end of the entire bundle? The client assumes the latter.

It's important to clarify the protocol design for handling multiple files and ensure the client implementation correctly reflects it.

Comment on lines +259 to +364
pub async fn get_tokenizer(
&self,
) -> Result<TokenizerBundle, Box<dyn std::error::Error + Send + Sync>> {
debug!("Requesting tokenizer from backend");
let request = Request::new(proto::GetTokenizerRequest {});

let mut client = self.client.clone();
let mut stream = client.get_tokenizer(request).await?.into_inner();

// First message must be metadata
let first_chunk = stream
.message()
.await?
.ok_or("Empty stream: expected metadata chunk")?;

let metadata = match first_chunk.chunk {
Some(proto::get_tokenizer_chunk::Chunk::Metadata(meta)) => {
debug!(
"Received tokenizer metadata: model={}, fingerprint={}, files={}, format={}",
meta.model_identifier,
meta.fingerprint,
meta.files.len(),
meta.bundle_format
);
meta
}
Some(proto::get_tokenizer_chunk::Chunk::FileChunk(_)) => {
return Err("Protocol error: first chunk must be metadata, got file chunk".into());
}
None => {
return Err("Protocol error: first chunk is empty".into());
}
};

// Collect file chunks
let mut compressed_data = Vec::new();
let mut expected_chunk_index = 0u32;
let mut last_chunk_received = false;

debug!("Starting to receive file chunks");
while let Some(chunk) = stream.message().await? {
match chunk.chunk {
Some(proto::get_tokenizer_chunk::Chunk::FileChunk(file_chunk)) => {
// Validate chunk ordering
if file_chunk.chunk_index != expected_chunk_index {
return Err(format!(
"Protocol error: expected chunk index {}, got {}",
expected_chunk_index, file_chunk.chunk_index
)
.into());
}

debug!(
"Received file chunk {}: {} bytes, is_last={}",
file_chunk.chunk_index,
file_chunk.data.len(),
file_chunk.is_last_chunk
);

// Append data
compressed_data.extend_from_slice(&file_chunk.data);

// Check if this is the last chunk
if file_chunk.is_last_chunk {
last_chunk_received = true;
debug!(
"Last chunk received, total data size: {} bytes",
compressed_data.len()
);
break;
}

expected_chunk_index += 1;
}
Some(proto::get_tokenizer_chunk::Chunk::Metadata(_)) => {
return Err(format!(
"Protocol error: unexpected metadata chunk at position {}",
expected_chunk_index
)
.into());
}
None => {
return Err(format!(
"Protocol error: empty chunk at position {}",
expected_chunk_index
)
.into());
}
}
}

// Validate that we received the last chunk marker
if !last_chunk_received {
return Err(format!(
"Protocol error: stream ended without is_last_chunk flag (received {} chunks)",
expected_chunk_index
)
.into());
}

debug!("Tokenizer bundle download complete");
Ok(TokenizerBundle {
metadata,
compressed_data,
})
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function is quite long and handles multiple concerns (receiving metadata, receiving file chunks, validation). For better readability and maintainability, consider refactoring it into smaller, more focused functions. For example, you could have separate functions for processing the metadata and for streaming the file chunks.

Additionally, the error handling could be improved by using a custom error enum instead of Box<dyn std::error::Error>. This provides more structured errors, making it easier for callers to handle specific failure modes programmatically.

Here's an example of how a custom error type could look using thiserror:

#[derive(Debug, thiserror::Error)]
pub enum GetTokenizerError {
    #[error("gRPC error: {0}")]
    Grpc(#[from] tonic::Status),
    #[error("Empty stream: expected metadata chunk")]
    EmptyStream,
    #[error("Protocol error: first chunk must be metadata, but got a file chunk")]
    MetadataNotFirst,
    #[error("Protocol error: first chunk is empty")]
    FirstChunkEmpty,
    #[error("Protocol error: unexpected metadata chunk at position {0}")]
    UnexpectedMetadata(u32),
    #[error("Protocol error: empty chunk at position {0}")]
    EmptyChunk(u32),
    #[error("Protocol error: expected chunk index {expected}, but got {got}")]
    ChunkOutOfOrder { expected: u32, got: u32 },
    #[error("Protocol error: stream ended without is_last_chunk flag (received {0} chunks)")]
    StreamEndedPrematurely(u32),
}

Using this would make the error handling logic within get_tokenizer cleaner and more explicit.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +303 to +308
if file_chunk.chunk_index != expected_chunk_index {
return Err(format!(
"Protocol error: expected chunk index {}, got {}",
expected_chunk_index, file_chunk.chunk_index
)
.into());
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Badge Return an Error type instead of String in protocol validation

The new get_tokenizer function returns Result<TokenizerBundle, Box<dyn std::error::Error + Send + Sync>>, but the error paths use return Err(format!("…").into());. format! produces a String, which does not implement std::error::Error, so these branches cannot be converted into Box<dyn Error> and the crate will fail to compile. Use an error type (e.g., anyhow::Error, tonic::Status, or a custom struct) rather than a bare String, and update the other format! branches in this function accordingly.

Useful? React with 👍 / 👎.

@YouNeedCryDear YouNeedCryDear changed the title Add GetTokenizer gRPC method to stream tokenizer files from backend to router [Router] Add GetTokenizer gRPC method to stream tokenizer files from backend to router Oct 30, 2025
@CatherineSue
Copy link
Copy Markdown
Collaborator

Configuration complexity: Users must manually ensure the router has the correct tokenizer files matching the backend model

This is FALSE - Router already has create_tokenizer_async() that handles:

  • HuggingFace model IDs: "deepseek-ai/deepseek-v3" → auto-downloads
  • Local paths: "/raid/models/..." → loads directly
  • No manual file management needed

Adding this management in grpc server is even more complexed

Deployment overhead: Tokenizer files must be distributed separately from the backend

Can you explain a bit more? I thought we would share the same PersistentVolume for workers and routers.

Dynamic model support: The router cannot adapt to backend model changes without reconfiguration

Not sure i understand this correctly. I thought same as #12045 , the root cause is because we need --tokenizer-path in start up right now. Why not having a TokenizerRegistry so we can solve all the issues once for all?

@YouNeedCryDear
Copy link
Copy Markdown
Contributor Author

Close this PR as the changes are merged into #12407

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants