fix(serve): Fix Ray Serve LLM embeddings endpoint for pooling models by CharleoY · Pull Request #61959 · ray-project/ray

CharleoY · 2026-03-22T17:37:27Z

Summary

Fixes two issues preventing embedding models from working with Ray Serve LLM:

Attribute name mismatch: vLLM sets `state.serving_embedding` but Ray was looking for `state.openai_serving_embedding`, causing the endpoint to fail with "This model does not support the 'embed' task".
API mismatch: vLLM's `ServingEmbedding` is a callable class, not a class with a `create_embedding` method. Updated to call the instance directly and handle the Starlette Response return type properly.

Changes

Updated attribute lookup from `openai_serving_embedding` to `serving_embedding` to match vLLM's naming
Changed from calling `create_embedding()` to calling `call()` directly
Added handling for Starlette Response objects returned by vLLM

Testing

Tested with `Qwen3-8B-emb` model using `runner="pooling"` and `convert="embed"` configuration.

Fix two issues preventing embedding models from working with Ray Serve LLM: 1. Attribute name mismatch: vLLM sets state.serving_embedding but Ray was looking for state.openai_serving_embedding, causing the endpoint to fail with \"This model does not support the 'embed' task\". 2. API mismatch: vLLM's ServingEmbedding is a callable class, not a class with a create_embedding method. Updated to call the instance directly and handle the Starlette Response return type properly. Tested with Qwen3-8B-emb model using runner=\"pooling\" and convert=\"embed\". Co-Authored-By: Claude

gemini-code-assist

Code Review

This pull request effectively addresses two compatibility issues with the vLLM embeddings endpoint. The changes correctly update the attribute name for the embedding service and adapt the API call to handle vLLM's callable ServingEmbedding class. Additionally, it properly handles the Starlette Response object that can be returned. I've included one suggestion to make the response type checking more robust.

gemini-code-assist · 2026-03-22T17:38:57Z

python/ray/llm/_internal/serve/engines/vllm/vllm_engine.py

+        if hasattr(embedding_response, 'body'):
+            content = json.loads(embedding_response.body)
+            yield EmbeddingResponse(**content)
        else:
-            yield EmbeddingResponse(**embedding_response.model_dump())
+            yield embedding_response


Using hasattr for duck-typing can be brittle. A more robust way to check if embedding_response is a Starlette-like response object is to also check the type of the body attribute. Starlette Response objects have a body attribute of type bytes. This avoids potential issues if another type of object with a body attribute of a different type is returned in the future.

Suggested change

if hasattr(embedding_response, 'body'):

content = json.loads(embedding_response.body)

yield EmbeddingResponse(**content)

else:

yield EmbeddingResponse(**embedding_response.model_dump())

yield embedding_response

if isinstance(getattr(embedding_response, "body", None), bytes):

content = json.loads(embedding_response.body)

yield EmbeddingResponse(**content)

else:

yield embedding_response

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

cursor · 2026-03-22T17:41:08Z

python/ray/llm/_internal/serve/engines/vllm/vllm_engine.py

+            yield EmbeddingResponse(**content)
        else:
-            yield EmbeddingResponse(**embedding_response.model_dump())
+            yield embedding_response


Error responses mishandled as embedding success responses

Medium Severity

vLLM's PoolingServing.__call__ always returns a Starlette Response object (never a VLLMErrorResponse instance), so the isinstance(embedding_response, VLLMErrorResponse) check on line 589 is dead code. When vLLM returns an error, it will be a JSONResponse with an error status code and error body. This error response will fall through to the hasattr(embedding_response, 'body') branch, where EmbeddingResponse(**content) will crash because the error JSON doesn't match the EmbeddingResponse schema. The response's status_code needs to be checked to distinguish success from error responses before parsing the body.

eicherseiji · 2026-03-22T17:47:40Z

Thanks for the fix! Started premerge, will review soon.

Please fix DCO in the meantime

eicherseiji

Thanks @CharleoY. Added a comment. Please fix lint and add a release test :) Here's an example PR that includes a release test: #57194

Please lmk if you have any questions. Feel free to reach out via Ray Slack.

Context: The change is due to vllm-project/vllm#36110 upstream

eicherseiji · 2026-03-23T06:25:17Z

python/ray/llm/_internal/serve/engines/vllm/vllm_engine.py

            raw_request=raw_request,
        )

        if isinstance(embedding_response, VLLMErrorResponse):


embedding_response will always be a Starlette response now, need to check the status code instead of instance type

Thanks for the review! I've addressed all comments:

Fixed the embedding_response handling to check status_code instead of isinstance

Added a release test for embeddings endpoint

Code is ready for another look

- Fix embedding response handling to check status_code instead of isinstance since vLLM now always returns Starlette Response objects - Add release test for embeddings endpoint following PR ray-project#57194 pattern Fixes review comments from PR ray-project#61959

kouroshHakha

Thanks for identifying the embedding breakage — the two issues you found (attribute rename + callable API change) are real problems introduced in newer vLLM.

However, these API changes (state.serving_embedding instead of state.openai_serving_embedding, callable class instead of create_embedding()) don't exist in the currently pinned vLLM 0.17.0. Merging this as-is would break embeddings on the current release.

#61952 is the active effort to upgrade to vLLM 0.18.0, which is where these API changes originate. The embedding fixes from this PR should be folded into that upgrade PR so everything lands atomically with the version bump.

Recommend closing this PR and contributing the embedding-specific fixes to #61952 instead, or rebasing on top of that branch once it merges.

CharleoY · 2026-03-24T02:10:11Z

Thanks for identifying the embedding breakage — the two issues you found (attribute rename + callable API change) are real problems introduced in newer vLLM.

However, these API changes (state.serving_embedding instead of state.openai_serving_embedding, callable class instead of create_embedding()) don't exist in the currently pinned vLLM 0.17.0. Merging this as-is would break embeddings on the current release.

#61952 is the active effort to upgrade to vLLM 0.18.0, which is where these API changes originate. The embedding fixes from this PR should be folded into that upgrade PR so everything lands atomically with the version bump.

Recommend closing this PR and contributing the embedding-specific fixes to #61952 instead, or rebasing on top of that branch once it merges.

Thanks for reviewing and pointing the issue of vllm version, I will build on top of the vllm 0.18.0 commit once it get merged.

jeffreywang-anyscale · 2026-03-24T02:45:24Z

@CharleoY I'm working on #61952 and will cherry-pick your changes directly so that these could be addressed atomically. Thanks for identifying the problem!

CharleoY · 2026-03-24T07:40:37Z

@CharleoY I'm working on #61952 and will cherry-pick your changes directly so that these could be addressed atomically. Thanks for identifying the problem!

Sure thing, thanks :)

jeffreywang-anyscale · 2026-03-26T19:26:25Z

This is solved via #61952.

CharleoY requested a review from a team as a code owner March 22, 2026 17:37

gemini-code-assist bot reviewed Mar 22, 2026

View reviewed changes

cursor bot reviewed Mar 22, 2026

View reviewed changes

eicherseiji added the go add ONLY when ready to merge, run all tests label Mar 22, 2026

ray-gardener bot added serve Ray Serve Related Issue llm community-contribution Contributed by the community labels Mar 22, 2026

eicherseiji reviewed Mar 23, 2026

View reviewed changes

kouroshHakha requested changes Mar 24, 2026

View reviewed changes

kouroshHakha mentioned this pull request Mar 24, 2026

[llm][deps] Upgrade vLLM to 0.18.0 #61952

Merged

jeffreywang-anyscale closed this Mar 26, 2026

Conversation

CharleoY commented Mar 22, 2026

Summary

Changes

Testing

Related

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 22, 2026

Choose a reason for hiding this comment

Error responses mishandled as embedding success responses

Uh oh!

eicherseiji commented Mar 22, 2026

Uh oh!

eicherseiji left a comment

Choose a reason for hiding this comment

Uh oh!

eicherseiji Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

CharleoY Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

kouroshHakha left a comment

Choose a reason for hiding this comment

Uh oh!

CharleoY commented Mar 24, 2026

Uh oh!

jeffreywang-anyscale commented Mar 24, 2026

Uh oh!

CharleoY commented Mar 24, 2026

Uh oh!

jeffreywang-anyscale commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants