Skip to content

[model-gateway]: add gRPC router embeddings endpoint implementation#15273

Merged
slin1237 merged 14 commits intosgl-project:mainfrom
Ratish1:grpc-router-embeddings
Dec 23, 2025
Merged

[model-gateway]: add gRPC router embeddings endpoint implementation#15273
slin1237 merged 14 commits intosgl-project:mainfrom
Ratish1:grpc-router-embeddings

Conversation

@Ratish1
Copy link
Copy Markdown
Collaborator

@Ratish1 Ratish1 commented Dec 16, 2025

Motivation

This PR introduces initial support for the Embedding API within the gRPC model gateway. This feature allows clients to request vector embeddings for text inputs, leveraging existing gRPC backend workers (SGLang).

Modifications

  1. Dedicated Pipeline Stages: Implements new pipeline stages for embedding requests: EmbeddingPreparationStage, EmbeddingRequestBuildingStage, and EmbeddingResponseProcessingStage. This ensures a clear separation of concerns and a robust processing flow specific to embeddings.
  2. GrpcRouter Integration: Adds an embedding_pipeline to the GrpcRouter and a new route_embeddings_impl method to handle incoming embedding requests, dispatching them through the dedicated pipeline.
  3. Unified Protocol Handling: Extends ProtoRequest and ProtoEmbedResponse embedding-specific protobuf messages, enabling transparent interaction with different backend types.
  4. RequestContext Extension: Updates the RequestContextwith RequestType::Embedding, FinalResponse::Embedding, and ExecutionResult::Embedding to properly track embedding request states throughout the pipeline.
  5. The _convert_embed_request function in python/sglang/srt/entrypoints/grpc_server.py was updated to correctly extract image_inputs, token_type_ids, and sampling_params from the gRPC EmbedRequest.
  6. sampling_params.max_new_tokens is explicitly set to 0 to prevent a TypeError in the downstream scheduler, a workaround for current backend logic.
  7. In python/sglang/srt/managers/scheduler.py, a None check for self.pad_input_ids_func was added within handle_embedding_request, ensuring safe handling of multimodal input in embedding requests with appropriate documentation.
  8. Configurable log_metrics: Adds log_metrics: Option<bool> to EmbeddingRequest for control over metrics logging, defaulting to false.
  9. Implements build_embed_request in sgl-model-gateway/src/grpc_client/sglang_scheduler.rs to abstract the gRPC EmbedRequest
  10. AddsENDPOINT_EMBEDDINGS to smg_labels and utilizes it for accurate metrics tracking of embedding requests, enhancing system observability .

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @Ratish1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the model gateway's capabilities by introducing a comprehensive gRPC endpoint for embedding requests. It establishes a dedicated processing pipeline, from request preparation and building to execution and response processing, ensuring that embedding tasks are handled efficiently and distinctly from generation tasks. The changes also include necessary adaptations for both SGLang and vLLM backends, alongside robust error handling and improved metrics for observability.

Highlights

  • gRPC Embedding API Support: Initial support for the Embedding API has been added to the gRPC model gateway, enabling clients to request vector embeddings for text inputs from SGLang and vLLM backends.
  • Dedicated Pipeline Stages: New pipeline stages (EmbeddingPreparationStage, EmbeddingRequestBuildingStage, EmbeddingResponseProcessingStage) have been implemented to ensure a robust and separated processing flow specifically for embedding requests.
  • Unified Protocol Handling: The system now uses unified ProtoRequest and ProtoEmbedResponse enums to wrap backend-specific protobuf messages, facilitating transparent interaction with different backend types.
  • RequestContext Extension: The RequestContext has been updated to properly track embedding request states, introducing RequestType::Embedding, FinalResponse::Embedding, and ExecutionResult::Embedding.
  • Backend Integration & Fixes: Updates include correctly extracting image inputs and sampling parameters for embedding requests in grpc_server.py (setting max_new_tokens to 0 as a workaround), and adding a None check for pad_input_ids_func in scheduler.py for multimodal input handling.
  • Metrics and Configuration: The EmbeddingRequest now supports a configurable log_metrics option, and ENDPOINT_EMBEDDINGS has been added for accurate metrics tracking of embedding requests.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the Embedding API in the gRPC model gateway, which is a significant feature addition. The implementation is well-structured, with dedicated pipeline stages for embedding requests, integration into the GrpcRouter, and updates to the RequestContext and protocol wrappers to handle embeddings. The changes are comprehensive and touch both the Python backend and the Rust model gateway.

My review focuses on a few areas for improvement:

  • Consistency with the PR description regarding default values.
  • Removal of unused code.
  • Ensuring complete implementation of features mentioned in the description.

Overall, this is a solid contribution that extends the gateway's capabilities. Addressing the minor issues identified will further improve the code quality.

Comment thread python/sglang/srt/entrypoints/grpc_server.py
Comment thread sgl-model-gateway/src/grpc_client/sglang_scheduler.rs
Comment thread sgl-model-gateway/src/routers/grpc/pipeline.rs Outdated
@slin1237
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

Comment thread sgl-model-gateway/src/routers/grpc/common/stages/request_execution.rs Outdated
Comment thread sgl-model-gateway/src/routers/grpc/common/stages/request_execution.rs Outdated
Comment thread sgl-model-gateway/src/routers/grpc/common/stages/request_execution.rs Outdated
Comment thread sgl-model-gateway/src/routers/grpc/harmony/stages/request_building.rs Outdated
Comment thread sgl-model-gateway/src/routers/grpc/client.rs Outdated
Comment thread sgl-model-gateway/src/routers/grpc/client.rs Outdated
Comment thread sgl-model-gateway/src/routers/grpc/pipeline.rs Outdated
@Ratish1
Copy link
Copy Markdown
Collaborator Author

Ratish1 commented Dec 17, 2025

/gemini-review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the Embedding API in the gRPC model gateway, which is a significant and well-structured addition. The changes include dedicated pipeline stages, integration with the GrpcRouter, and extensions to the protocol and request context to handle embedding requests. The new e2e tests are also a great addition.

My review has identified a few key areas for improvement:

  • The handling of batch embedding requests is not aligned with the OpenAI API specification. Currently, an array of strings is concatenated and processed as a single input, whereas it should be processed as a batch, returning an embedding for each string.
  • There's a discrepancy in the Python gRPC server where image_inputs are hardcoded, which contradicts the PR description.

Addressing these points will ensure the new embedding endpoint is robust and compliant with expected behavior.

Comment thread python/sglang/srt/entrypoints/grpc_server.py
Comment thread sgl-model-gateway/py_test/e2e_grpc/basic/test_embedding_server.py
@Ratish1 Ratish1 requested a review from slin1237 December 17, 2025 13:50
@slin1237
Copy link
Copy Markdown
Collaborator

PR looks good to me
please resolve conflict and i can merge this one
nice work

@Ratish1 Ratish1 requested a review from slin1237 December 22, 2025 06:52
@Ratish1
Copy link
Copy Markdown
Collaborator Author

Ratish1 commented Dec 22, 2025

PR looks good to me please resolve conflict and i can merge this one nice work

I fixed the conflict. Thanks for the review.

@Ratish1
Copy link
Copy Markdown
Collaborator Author

Ratish1 commented Dec 23, 2025

I have updated the branch and the tests pass now @slin1237 . Let me know if you need anything else. Thanks for the review.

@slin1237 slin1237 merged commit 5f3a47d into sgl-project:main Dec 23, 2025
101 of 111 checks passed
@Ratish1 Ratish1 deleted the grpc-router-embeddings branch December 23, 2025 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants