Skip to content

Make trl vllm-serve OpenAI-compatible (exploratory)#5803

Open
qgallouedec wants to merge 2 commits into
mainfrom
vllm-serve-oai-compat
Open

Make trl vllm-serve OpenAI-compatible (exploratory)#5803
qgallouedec wants to merge 2 commits into
mainfrom
vllm-serve-oai-compat

Conversation

@qgallouedec

@qgallouedec qgallouedec commented May 21, 2026

Copy link
Copy Markdown
Member

Generated with Claude Code. Not intended to merge as-is. Opening this to evaluate whether the OpenAI work is worth doing.

closes #4602

Context

We have trl vllm-serve because old vLLM had no way to sync weights from a training process into a running engine. vLLM 0.17+ fixed that, and we're already on a path to drop older versions. Once that's done, trl vllm-serve can be retired in favour of stock vllm serve (already OpenAI-compatible, ofc). So the question is: is OpenAI compatibility worth doing in the meantime?

What it does

Reshapes the whole HTTP surface to OpenAI: /generate//v1/completions, /chat//v1/chat/completions, new /v1/models, trailing slashes dropped; TRL-only weight-sync endpoints kept. Server gains a concurrent-chat batcher to recover throughput under one-conversation-per-request semantics. Client switches to the openai SDK for /v1/*. The public VLLMClient API is unchanged, so trainers / openenv / distillation work without modification. trl vllm-serve's wire format is not public API.

Pros / cons

  • + Long-standing user request; any OpenAI SDK now works against the server.
  • + Real spec rather than ad-hoc TRL fields.
  • Big diff for a script we plan to retire.
  • New complexity (chat batcher, workers lock, OpenAI-extras plumbing) that wouldn't exist on stock vllm serve.

My take

I'm leaning toward not doing this. The diff is heavy for a script we're planning to retire, and once trl vllm-serve is gone, the OpenAI compatibility comes for free via stock vllm serve. Happy to be argued out of it.
@albertvillanova i'd love to hear what you think.

Not tested

Diff hasn't been run end-to-end. Existing tests call only the preserved public API.


Note

Medium Risk
Moderate risk because it rewires the HTTP contract for trl vllm-serve (new /v1/* endpoints, removed legacy routes) and adds new batching/concurrency paths plus an openai SDK dependency that could affect training integrations and runtime behavior.

Overview
Makes trl vllm-serve OpenAI-compatible by replacing the legacy /generate/ and /chat/ routes with /v1/completions, /v1/chat/completions, plus /v1/models, and by standardizing URL paths (no trailing slashes; /health, /world_size, etc.). Responses now follow OpenAI shapes while carrying TRL extensions like prompt_token_ids/token_ids and logprob token-id payloads.

Updates VLLMClient to use the official openai Python SDK for generation calls while keeping TRL-only side-channel endpoints for weight sync, prefix-cache reset, and teacher-logprob batching; the client also adds concurrent fan-out for chat requests and converts PIL images to OpenAI image_url data URIs.

Server throughput changes: adds a background chat request batcher (queue + grouping by tools/template) and a shared worker lock to serialize pipe I/O across concurrent endpoints, alongside refactoring sampling/logprob formatting to match OpenAI semantics.

Reviewed by Cursor Bugbot for commit a165260. Bugbot is set up for automated code reviews on this repo. Configure here.

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 27026732ef

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread trl/scripts/vllm_serve.py
Comment on lines +986 to +989
{
"index": i,
"message": {"role": "assistant", "content": completion.text},
"finish_reason": completion.finish_reason or "stop",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve tool calls in chat completion messages

The new /v1/chat/completions handler accepts tools and forwards them to vLLM, but each choice is serialized with only message.content and never includes message.tool_calls. In requests where the model chooses a function/tool call (e.g., finish_reason == "tool_calls"), OpenAI-compatible clients rely on choices[i].message.tool_calls to execute the tool; omitting it makes tool-calling flows fail even though the request advertised tool support.

Useful? React with 👍 / 👎.

Comment thread trl/scripts/vllm_serve.py
if part.get("type") == "image_url":
url = part.get("image_url", {}).get("url", "")
part["image_pil"] = _decode_data_url(url)
return messages

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image conversion doesn't update part type for vLLM

Medium Severity

_convert_messages_for_vllm adds an image_pil key to content parts but never changes the type from "image_url" to "image_pil" and never removes the image_url key. The function's docstring states it prepares images for vLLM's chat API which "accepts PIL," but without updating the type field, vLLM's message processing will still see the part as image_url type and attempt to process the data URI directly rather than using the pre-decoded PIL object. The added image_pil key becomes dead weight that vLLM never reads.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 2702673. Configure here.

Comment thread trl/scripts/vllm_serve.py
{
"token": tok,
"bytes": None,
"logprob": -1e30 if v is None else v,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chat logprobs replace None with sentinel value

Low Severity

_format_chat_logprobs replaces None logprob values (from NaN in vLLM) with -1e30, but the client's _extract_choice_logprobs reads these as plain floats from entry["logprob"]. This changes the semantics: the old code preserved None in the returned logprobs list, while the new chat path returns -1e30. Downstream code that checks if value is None to detect invalid logprobs would silently get a very negative float instead.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 2702673. Configure here.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

There are 3 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit a165260. Configure here.

Comment thread trl/scripts/vllm_serve.py
}
for connection in connections:
connection.send({"type": "fire_and_forget", "method": "collective_rpc", "kwargs": kwargs})
return {"message": "Request received, initializing communicator"}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fire-and-forget endpoints bypass workers lock causing races

Low Severity

The init_communicator, update_named_param, and close_communicator endpoints call connection.send() without acquiring _workers_lock. The locked endpoints (/v1/completions, chat batcher, logprob batcher, reset_prefix_cache) use run_in_executor for pipe recv, which runs in a separate thread. If a fire-and-forget endpoint sends to a connection while that executor thread is concurrently sending or receiving on the same Connection object, this constitutes a thread-safety violation on multiprocessing.Connection.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a165260. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

openai-compatible responses and chat completions endpoints in vllm_serve.py

2 participants