[model-gateway]: add gRPC router embeddings endpoint implementation#15273
[model-gateway]: add gRPC router embeddings endpoint implementation#15273slin1237 merged 14 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @Ratish1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the model gateway's capabilities by introducing a comprehensive gRPC endpoint for embedding requests. It establishes a dedicated processing pipeline, from request preparation and building to execution and response processing, ensuring that embedding tasks are handled efficiently and distinctly from generation tasks. The changes also include necessary adaptations for both SGLang and vLLM backends, alongside robust error handling and improved metrics for observability. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for the Embedding API in the gRPC model gateway, which is a significant feature addition. The implementation is well-structured, with dedicated pipeline stages for embedding requests, integration into the GrpcRouter, and updates to the RequestContext and protocol wrappers to handle embeddings. The changes are comprehensive and touch both the Python backend and the Rust model gateway.
My review focuses on a few areas for improvement:
- Consistency with the PR description regarding default values.
- Removal of unused code.
- Ensuring complete implementation of features mentioned in the description.
Overall, this is a solid contribution that extends the gateway's capabilities. Addressing the minor issues identified will further improve the code quality.
|
/tag-and-rerun-ci |
…glang into grpc-router-embeddings
|
/gemini-review |
There was a problem hiding this comment.
Code Review
This pull request introduces support for the Embedding API in the gRPC model gateway, which is a significant and well-structured addition. The changes include dedicated pipeline stages, integration with the GrpcRouter, and extensions to the protocol and request context to handle embedding requests. The new e2e tests are also a great addition.
My review has identified a few key areas for improvement:
- The handling of batch embedding requests is not aligned with the OpenAI API specification. Currently, an array of strings is concatenated and processed as a single input, whereas it should be processed as a batch, returning an embedding for each string.
- There's a discrepancy in the Python gRPC server where
image_inputsare hardcoded, which contradicts the PR description.
Addressing these points will ensure the new embedding endpoint is robust and compliant with expected behavior.
|
PR looks good to me |
I fixed the conflict. Thanks for the review. |
|
I have updated the branch and the tests pass now @slin1237 . Let me know if you need anything else. Thanks for the review. |
Motivation
This PR introduces initial support for the Embedding API within the gRPC model gateway. This feature allows clients to request vector embeddings for text inputs, leveraging existing gRPC backend workers (SGLang).Modifications
EmbeddingPreparationStage,EmbeddingRequestBuildingStage, andEmbeddingResponseProcessingStage. This ensures a clear separation of concerns and a robust processing flow specific to embeddings.GrpcRouterIntegration: Adds anembedding_pipelineto theGrpcRouterand a newroute_embeddings_implmethod to handle incoming embedding requests, dispatching them through the dedicated pipeline.ProtoRequestandProtoEmbedResponseembedding-specific protobuf messages, enabling transparent interaction with different backend types.RequestContextExtension: Updates theRequestContextwithRequestType::Embedding,FinalResponse::Embedding, andExecutionResult::Embeddingto properly track embedding request states throughout the pipeline._convert_embed_requestfunction inpython/sglang/srt/entrypoints/grpc_server.pywas updated to correctly extractimage_inputs,token_type_ids, andsampling_paramsfrom the gRPCEmbedRequest.sampling_params.max_new_tokensis explicitly set to0to prevent aTypeErrorin the downstream scheduler, a workaround for current backend logic.python/sglang/srt/managers/scheduler.py, aNonecheck forself.pad_input_ids_funcwas added withinhandle_embedding_request, ensuring safe handling of multimodal input in embedding requests with appropriate documentation.log_metrics: Addslog_metrics: Option<bool>toEmbeddingRequestfor control over metrics logging, defaulting tofalse.build_embed_requestinsgl-model-gateway/src/grpc_client/sglang_scheduler.rsto abstract the gRPCEmbedRequestENDPOINT_EMBEDDINGStosmg_labelsand utilizes it for accurate metrics tracking of embedding requests, enhancing system observability .Accuracy Tests
Benchmarking and Profiling
Checklist