Skip to content

fix(serve): Fix Ray Serve LLM embeddings endpoint for pooling models#61959

Closed
CharleoY wants to merge 1 commit intoray-project:masterfrom
CharleoY:fix-serve-llm-embeddings
Closed

fix(serve): Fix Ray Serve LLM embeddings endpoint for pooling models#61959
CharleoY wants to merge 1 commit intoray-project:masterfrom
CharleoY:fix-serve-llm-embeddings

Conversation

@CharleoY
Copy link
Copy Markdown

Summary

Fixes two issues preventing embedding models from working with Ray Serve LLM:

  1. Attribute name mismatch: vLLM sets `state.serving_embedding` but Ray was looking for `state.openai_serving_embedding`, causing the endpoint to fail with "This model does not support the 'embed' task".

  2. API mismatch: vLLM's `ServingEmbedding` is a callable class, not a class with a `create_embedding` method. Updated to call the instance directly and handle the Starlette Response return type properly.

Changes

  • Updated attribute lookup from `openai_serving_embedding` to `serving_embedding` to match vLLM's naming
  • Changed from calling `create_embedding()` to calling `call()` directly
  • Added handling for Starlette Response objects returned by vLLM

Testing

Tested with `Qwen3-8B-emb` model using `runner="pooling"` and `convert="embed"` configuration.

Related

Fixes compatibility issue between Ray Serve LLM and vLLM's pooling/embedding API.

Fix two issues preventing embedding models from working with Ray Serve LLM:

1. Attribute name mismatch: vLLM sets state.serving_embedding but Ray was
   looking for state.openai_serving_embedding, causing the endpoint to fail
   with \"This model does not support the 'embed' task\".

2. API mismatch: vLLM's ServingEmbedding is a callable class, not a class
   with a create_embedding method. Updated to call the instance directly
   and handle the Starlette Response return type properly.

Tested with Qwen3-8B-emb model using runner=\"pooling\" and convert=\"embed\".

Co-Authored-By: Claude
@CharleoY CharleoY requested a review from a team as a code owner March 22, 2026 17:37
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses two compatibility issues with the vLLM embeddings endpoint. The changes correctly update the attribute name for the embedding service and adapt the API call to handle vLLM's callable ServingEmbedding class. Additionally, it properly handles the Starlette Response object that can be returned. I've included one suggestion to make the response type checking more robust.

Comment on lines +596 to +600
if hasattr(embedding_response, 'body'):
content = json.loads(embedding_response.body)
yield EmbeddingResponse(**content)
else:
yield EmbeddingResponse(**embedding_response.model_dump())
yield embedding_response
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using hasattr for duck-typing can be brittle. A more robust way to check if embedding_response is a Starlette-like response object is to also check the type of the body attribute. Starlette Response objects have a body attribute of type bytes. This avoids potential issues if another type of object with a body attribute of a different type is returned in the future.

Suggested change
if hasattr(embedding_response, 'body'):
content = json.loads(embedding_response.body)
yield EmbeddingResponse(**content)
else:
yield EmbeddingResponse(**embedding_response.model_dump())
yield embedding_response
if isinstance(getattr(embedding_response, "body", None), bytes):
content = json.loads(embedding_response.body)
yield EmbeddingResponse(**content)
else:
yield embedding_response

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

yield EmbeddingResponse(**content)
else:
yield EmbeddingResponse(**embedding_response.model_dump())
yield embedding_response
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error responses mishandled as embedding success responses

Medium Severity

vLLM's PoolingServing.__call__ always returns a Starlette Response object (never a VLLMErrorResponse instance), so the isinstance(embedding_response, VLLMErrorResponse) check on line 589 is dead code. When vLLM returns an error, it will be a JSONResponse with an error status code and error body. This error response will fall through to the hasattr(embedding_response, 'body') branch, where EmbeddingResponse(**content) will crash because the error JSON doesn't match the EmbeddingResponse schema. The response's status_code needs to be checked to distinguish success from error responses before parsing the body.

Fix in Cursor Fix in Web

@eicherseiji eicherseiji added the go add ONLY when ready to merge, run all tests label Mar 22, 2026
@eicherseiji
Copy link
Copy Markdown
Contributor

Thanks for the fix! Started premerge, will review soon.

Please fix DCO in the meantime

@ray-gardener ray-gardener bot added serve Ray Serve Related Issue llm community-contribution Contributed by the community labels Mar 22, 2026
Copy link
Copy Markdown
Contributor

@eicherseiji eicherseiji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CharleoY. Added a comment. Please fix lint and add a release test :) Here's an example PR that includes a release test: #57194

Please lmk if you have any questions. Feel free to reach out via Ray Slack.

Context: The change is due to vllm-project/vllm#36110 upstream

raw_request=raw_request,
)

if isinstance(embedding_response, VLLMErrorResponse):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

embedding_response will always be a Starlette response now, need to check the status code instead of instance type

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I've addressed all comments:

  • Fixed the embedding_response handling to check status_code instead of isinstance
  • Added a release test for embeddings endpoint
  • Code is ready for another look

CharleoY pushed a commit to CharleoY/ray that referenced this pull request Mar 23, 2026
- Fix embedding response handling to check status_code instead of isinstance
  since vLLM now always returns Starlette Response objects
- Add release test for embeddings endpoint following PR ray-project#57194 pattern

Fixes review comments from PR ray-project#61959
Copy link
Copy Markdown
Contributor

@kouroshHakha kouroshHakha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for identifying the embedding breakage — the two issues you found (attribute rename + callable API change) are real problems introduced in newer vLLM.

However, these API changes (state.serving_embedding instead of state.openai_serving_embedding, callable class instead of create_embedding()) don't exist in the currently pinned vLLM 0.17.0. Merging this as-is would break embeddings on the current release.

#61952 is the active effort to upgrade to vLLM 0.18.0, which is where these API changes originate. The embedding fixes from this PR should be folded into that upgrade PR so everything lands atomically with the version bump.

Recommend closing this PR and contributing the embedding-specific fixes to #61952 instead, or rebasing on top of that branch once it merges.

@CharleoY
Copy link
Copy Markdown
Author

Thanks for identifying the embedding breakage — the two issues you found (attribute rename + callable API change) are real problems introduced in newer vLLM.

However, these API changes (state.serving_embedding instead of state.openai_serving_embedding, callable class instead of create_embedding()) don't exist in the currently pinned vLLM 0.17.0. Merging this as-is would break embeddings on the current release.

#61952 is the active effort to upgrade to vLLM 0.18.0, which is where these API changes originate. The embedding fixes from this PR should be folded into that upgrade PR so everything lands atomically with the version bump.

Recommend closing this PR and contributing the embedding-specific fixes to #61952 instead, or rebasing on top of that branch once it merges.

Thanks for reviewing and pointing the issue of vllm version, I will build on top of the vllm 0.18.0 commit once it get merged.

@jeffreywang-anyscale
Copy link
Copy Markdown
Contributor

@CharleoY I'm working on #61952 and will cherry-pick your changes directly so that these could be addressed atomically. Thanks for identifying the problem!

@CharleoY
Copy link
Copy Markdown
Author

@CharleoY I'm working on #61952 and will cherry-pick your changes directly so that these could be addressed atomically. Thanks for identifying the problem!

Sure thing, thanks :)

@jeffreywang-anyscale
Copy link
Copy Markdown
Contributor

This is solved via #61952.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community go add ONLY when ready to merge, run all tests llm serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants