fix(serve): Fix Ray Serve LLM embeddings endpoint for pooling models#61959
fix(serve): Fix Ray Serve LLM embeddings endpoint for pooling models#61959CharleoY wants to merge 1 commit intoray-project:masterfrom
Conversation
Fix two issues preventing embedding models from working with Ray Serve LLM: 1. Attribute name mismatch: vLLM sets state.serving_embedding but Ray was looking for state.openai_serving_embedding, causing the endpoint to fail with \"This model does not support the 'embed' task\". 2. API mismatch: vLLM's ServingEmbedding is a callable class, not a class with a create_embedding method. Updated to call the instance directly and handle the Starlette Response return type properly. Tested with Qwen3-8B-emb model using runner=\"pooling\" and convert=\"embed\". Co-Authored-By: Claude
There was a problem hiding this comment.
Code Review
This pull request effectively addresses two compatibility issues with the vLLM embeddings endpoint. The changes correctly update the attribute name for the embedding service and adapt the API call to handle vLLM's callable ServingEmbedding class. Additionally, it properly handles the Starlette Response object that can be returned. I've included one suggestion to make the response type checking more robust.
| if hasattr(embedding_response, 'body'): | ||
| content = json.loads(embedding_response.body) | ||
| yield EmbeddingResponse(**content) | ||
| else: | ||
| yield EmbeddingResponse(**embedding_response.model_dump()) | ||
| yield embedding_response |
There was a problem hiding this comment.
Using hasattr for duck-typing can be brittle. A more robust way to check if embedding_response is a Starlette-like response object is to also check the type of the body attribute. Starlette Response objects have a body attribute of type bytes. This avoids potential issues if another type of object with a body attribute of a different type is returned in the future.
| if hasattr(embedding_response, 'body'): | |
| content = json.loads(embedding_response.body) | |
| yield EmbeddingResponse(**content) | |
| else: | |
| yield EmbeddingResponse(**embedding_response.model_dump()) | |
| yield embedding_response | |
| if isinstance(getattr(embedding_response, "body", None), bytes): | |
| content = json.loads(embedding_response.body) | |
| yield EmbeddingResponse(**content) | |
| else: | |
| yield embedding_response |
| yield EmbeddingResponse(**content) | ||
| else: | ||
| yield EmbeddingResponse(**embedding_response.model_dump()) | ||
| yield embedding_response |
There was a problem hiding this comment.
Error responses mishandled as embedding success responses
Medium Severity
vLLM's PoolingServing.__call__ always returns a Starlette Response object (never a VLLMErrorResponse instance), so the isinstance(embedding_response, VLLMErrorResponse) check on line 589 is dead code. When vLLM returns an error, it will be a JSONResponse with an error status code and error body. This error response will fall through to the hasattr(embedding_response, 'body') branch, where EmbeddingResponse(**content) will crash because the error JSON doesn't match the EmbeddingResponse schema. The response's status_code needs to be checked to distinguish success from error responses before parsing the body.
|
Thanks for the fix! Started premerge, will review soon. Please fix DCO in the meantime |
eicherseiji
left a comment
There was a problem hiding this comment.
Thanks @CharleoY. Added a comment. Please fix lint and add a release test :) Here's an example PR that includes a release test: #57194
Please lmk if you have any questions. Feel free to reach out via Ray Slack.
Context: The change is due to vllm-project/vllm#36110 upstream
| raw_request=raw_request, | ||
| ) | ||
|
|
||
| if isinstance(embedding_response, VLLMErrorResponse): |
There was a problem hiding this comment.
embedding_response will always be a Starlette response now, need to check the status code instead of instance type
There was a problem hiding this comment.
Thanks for the review! I've addressed all comments:
- Fixed the embedding_response handling to check status_code instead of isinstance
- Added a release test for embeddings endpoint
- Code is ready for another look
- Fix embedding response handling to check status_code instead of isinstance since vLLM now always returns Starlette Response objects - Add release test for embeddings endpoint following PR ray-project#57194 pattern Fixes review comments from PR ray-project#61959
kouroshHakha
left a comment
There was a problem hiding this comment.
Thanks for identifying the embedding breakage — the two issues you found (attribute rename + callable API change) are real problems introduced in newer vLLM.
However, these API changes (state.serving_embedding instead of state.openai_serving_embedding, callable class instead of create_embedding()) don't exist in the currently pinned vLLM 0.17.0. Merging this as-is would break embeddings on the current release.
#61952 is the active effort to upgrade to vLLM 0.18.0, which is where these API changes originate. The embedding fixes from this PR should be folded into that upgrade PR so everything lands atomically with the version bump.
Recommend closing this PR and contributing the embedding-specific fixes to #61952 instead, or rebasing on top of that branch once it merges.
Thanks for reviewing and pointing the issue of vllm version, I will build on top of the vllm 0.18.0 commit once it get merged. |
|
This is solved via #61952. |


Summary
Fixes two issues preventing embedding models from working with Ray Serve LLM:
Attribute name mismatch: vLLM sets `state.serving_embedding` but Ray was looking for `state.openai_serving_embedding`, causing the endpoint to fail with "This model does not support the 'embed' task".
API mismatch: vLLM's `ServingEmbedding` is a callable class, not a class with a `create_embedding` method. Updated to call the instance directly and handle the Starlette Response return type properly.
Changes
Testing
Tested with `Qwen3-8B-emb` model using `runner="pooling"` and `convert="embed"` configuration.
Related
Fixes compatibility issue between Ray Serve LLM and vLLM's pooling/embedding API.