Make trl vllm-serve OpenAI-compatible (exploratory)#5803
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 27026732ef
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| { | ||
| "index": i, | ||
| "message": {"role": "assistant", "content": completion.text}, | ||
| "finish_reason": completion.finish_reason or "stop", |
There was a problem hiding this comment.
Preserve tool calls in chat completion messages
The new /v1/chat/completions handler accepts tools and forwards them to vLLM, but each choice is serialized with only message.content and never includes message.tool_calls. In requests where the model chooses a function/tool call (e.g., finish_reason == "tool_calls"), OpenAI-compatible clients rely on choices[i].message.tool_calls to execute the tool; omitting it makes tool-calling flows fail even though the request advertised tool support.
Useful? React with 👍 / 👎.
| if part.get("type") == "image_url": | ||
| url = part.get("image_url", {}).get("url", "") | ||
| part["image_pil"] = _decode_data_url(url) | ||
| return messages |
There was a problem hiding this comment.
Image conversion doesn't update part type for vLLM
Medium Severity
_convert_messages_for_vllm adds an image_pil key to content parts but never changes the type from "image_url" to "image_pil" and never removes the image_url key. The function's docstring states it prepares images for vLLM's chat API which "accepts PIL," but without updating the type field, vLLM's message processing will still see the part as image_url type and attempt to process the data URI directly rather than using the pre-decoded PIL object. The added image_pil key becomes dead weight that vLLM never reads.
Reviewed by Cursor Bugbot for commit 2702673. Configure here.
| { | ||
| "token": tok, | ||
| "bytes": None, | ||
| "logprob": -1e30 if v is None else v, |
There was a problem hiding this comment.
Chat logprobs replace None with sentinel value
Low Severity
_format_chat_logprobs replaces None logprob values (from NaN in vLLM) with -1e30, but the client's _extract_choice_logprobs reads these as plain floats from entry["logprob"]. This changes the semantics: the old code preserved None in the returned logprobs list, while the new chat path returns -1e30. Downstream code that checks if value is None to detect invalid logprobs would silently get a very negative float instead.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 2702673. Configure here.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit a165260. Configure here.
| } | ||
| for connection in connections: | ||
| connection.send({"type": "fire_and_forget", "method": "collective_rpc", "kwargs": kwargs}) | ||
| return {"message": "Request received, initializing communicator"} |
There was a problem hiding this comment.
Fire-and-forget endpoints bypass workers lock causing races
Low Severity
The init_communicator, update_named_param, and close_communicator endpoints call connection.send() without acquiring _workers_lock. The locked endpoints (/v1/completions, chat batcher, logprob batcher, reset_prefix_cache) use run_in_executor for pipe recv, which runs in a separate thread. If a fire-and-forget endpoint sends to a connection while that executor thread is concurrently sending or receiving on the same Connection object, this constitutes a thread-safety violation on multiprocessing.Connection.
Additional Locations (2)
Reviewed by Cursor Bugbot for commit a165260. Configure here.


closes #4602
Context
We have
trl vllm-servebecause old vLLM had no way to sync weights from a training process into a running engine. vLLM 0.17+ fixed that, and we're already on a path to drop older versions. Once that's done,trl vllm-servecan be retired in favour of stockvllm serve(already OpenAI-compatible, ofc). So the question is: is OpenAI compatibility worth doing in the meantime?What it does
Reshapes the whole HTTP surface to OpenAI:
/generate/→/v1/completions,/chat/→/v1/chat/completions, new/v1/models, trailing slashes dropped; TRL-only weight-sync endpoints kept. Server gains a concurrent-chat batcher to recover throughput under one-conversation-per-request semantics. Client switches to theopenaiSDK for/v1/*. The publicVLLMClientAPI is unchanged, so trainers / openenv / distillation work without modification.trl vllm-serve's wire format is not public API.Pros / cons
vllm serve.My take
I'm leaning toward not doing this. The diff is heavy for a script we're planning to retire, and once
trl vllm-serveis gone, the OpenAI compatibility comes for free via stockvllm serve. Happy to be argued out of it.@albertvillanova i'd love to hear what you think.
Not tested
Diff hasn't been run end-to-end. Existing tests call only the preserved public API.
Note
Medium Risk
Moderate risk because it rewires the HTTP contract for
trl vllm-serve(new/v1/*endpoints, removed legacy routes) and adds new batching/concurrency paths plus anopenaiSDK dependency that could affect training integrations and runtime behavior.Overview
Makes
trl vllm-serveOpenAI-compatible by replacing the legacy/generate/and/chat/routes with/v1/completions,/v1/chat/completions, plus/v1/models, and by standardizing URL paths (no trailing slashes;/health,/world_size, etc.). Responses now follow OpenAI shapes while carrying TRL extensions likeprompt_token_ids/token_idsand logprob token-id payloads.Updates
VLLMClientto use the officialopenaiPython SDK for generation calls while keeping TRL-only side-channel endpoints for weight sync, prefix-cache reset, and teacher-logprob batching; the client also adds concurrent fan-out for chat requests and converts PIL images to OpenAIimage_urldata URIs.Server throughput changes: adds a background chat request batcher (queue + grouping by tools/template) and a shared worker lock to serialize pipe I/O across concurrent endpoints, alongside refactoring sampling/logprob formatting to match OpenAI semantics.
Reviewed by Cursor Bugbot for commit a165260. Bugbot is set up for automated code reviews on this repo. Configure here.