Skip to content

[DisaggEverything] Tokens in<>out /generate endpoint#24261

Merged
mgoin merged 6 commits intovllm-project:mainfrom
NickLucche:generate-api
Nov 14, 2025
Merged

[DisaggEverything] Tokens in<>out /generate endpoint#24261
mgoin merged 6 commits intovllm-project:mainfrom
NickLucche:generate-api

Conversation

@NickLucche
Copy link
Copy Markdown
Collaborator

@NickLucche NickLucche commented Sep 4, 2025

Overview

First step in implementing the "Disaggregated Everything" proposal #22817.
This PR focuses on the following component:

image

In particular, it introduces:

  • GenerateRequest/Response interface. NOTE: SamplingParams can now be validated and deserialized within a pydantic message (eg input-only). Check out PydanticMsgspecMixin.
  • /generate tokens-only endpoint
  • An initial set of endpoint args, mimicking /v1/chat/completions for the most part.
  • a --tokens-only "modality" for starting up the server, mostly intended to simplify ux.
  • /abort/request/ endpoint, see below.

Implementation Details

To get a "tokenizer-free" endpoint, one can already use --skip_tokenizer_init and/or detokenize: False sampling option, forcing the use of basic IncrementalDetokenizer.
In order to make ux easier for a Disaggregated Everything setup, a --tokens-only option is added, which enforces the two flags above.
This way the Detokenizer is optional, as intended in the initial design.
INFO 09-10 13:36:17 [arg_utils.py:1281] Skipping tokenizer initialization for tokens-only mode.

Furthermore, it enables the /abort_requests endpoint.

/abort_requests is a solution to the detection of stop strings, which is one of the main challenges to get a real "tokenizer-free" endpoint.
Currently this is done in AsyncLLM output_handler_loop, followed by an IPC abort request back to the EngineCore, like so:

	+-->AsyncLLM---+------------------->API
	|              |
	|ECOs          |stop_string abort
	|              |
EngineCore <-------+

With this Disaggregated Everything, we task the "Coordinator" (to be implemented in a follow-up PR) with detokenization. Hence, the "generate" instance needs to act more as a "remote EngineCore". The workflow is the following:

	+-->AsyncLLM---+------------------->API
	|              |
	|              |stop_string abort
	|              |
GenerateResponse   |
	|			   |				Coordinator Node
____|______________|_______________________________
	+-->AsyncLLM   |/abort_requests
	|              |
	|ECOs          |stop_string abort
	|              |
EngineCore <-------+
									Generate (tokens-only) Node

How to test

# vllm serve Qwen/Qwen3-0.6B
#python examples/online_serving/token_generation_client.py
{'request_id': 'a0e37922547c4d95885b9ce19588b9ef', 'choices': [{'index': 0, 'logprobs': None, 'finish_reason': 'stop', 'token_ids': [785, 7513, 9145, 320, 38807, 8, 17167, 315, 3070, 17, 22, 4462, 5302, 334, 13, 151645]}], 'prompt_logprobs': None, 'kv_transfer_params': None}
--------------------------------------------------
Token generation results:
The European Union (EU) consists of **27 member states**.<|im_end|>

or among other tests

# Ensures tokenizer+/generate+detokenizer == /v1/chat/completions
python -m pytest -v -s tests/entrypoints/openai/test_serving_tokens.py::test_same_response_as_chat_completions

Follow up PRs:

  • streaming mode
  • MultiModalFeatureSpec input, will add once Renderer effort progresses
  • more endpoint params

@mergify mergify bot added frontend v1 documentation Improvements or additions to documentation labels Sep 4, 2025
@mergify
Copy link
Copy Markdown

mergify bot commented Sep 8, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 8, 2025
@NickLucche NickLucche changed the title [do not merge] Tokens in<>out /generate endpoint [DisaggEverything] Tokens in<>out /generate endpoint Sep 10, 2025
@NickLucche NickLucche marked this pull request as ready for review September 10, 2025 15:04
@NickLucche
Copy link
Copy Markdown
Collaborator Author

@mergify mergify bot removed the needs-rebase label Sep 10, 2025
@smarterclayton
Copy link
Copy Markdown
Contributor

smarterclayton commented Sep 11, 2025

EDIT: Moved to #22817 (comment)

@russellb
Copy link
Copy Markdown
Member

I see that /v1/generate was added to the OpenAI API endpoint. This API is often exposed directly to end users. Is this API intended to be potentially used directly by end users, or is it more of an internal infrastructure API?

If it's a different audience, it may be better suited to a different HTTP service scoped to a different audience and purpose. I had similar feedback about an earlier version of http metadata exchange for the Nixl connector, but the latest version seems to have moved it to its own http service: #22274

If it is desired to keep this on the existing OpenAI API, I think it'd be nice if we used namespacing to make it clear which APIs are our own custom ones vs. our implementation of APIs defined by OpenAI. One option would be something like v1/vllm/generate where everything under v1/vllm/ is a vllm-custom API aligned with the V1 OpenAI API. Another option is vllm/generate or vllm/v1/generate. I feel less strongly about the specific choice than just doing something to separate our custom APIs.

@NickLucche
Copy link
Copy Markdown
Collaborator Author

Is this API intended to be potentially used directly by end users, or is it more of an internal infrastructure API

We're still discussing with @smarterclayton the full spectrum of intended use cases.
In my view, it's definitely going to be aimed to be used in a larger infrastructure, but there are also nicher cases where someone just wants vLLM for inference but doesn't care about the added overhead of OAI specs (eg RL).

I feel less strongly about the specific choice than just doing something to separate our custom APIs

I understand, would you be in favor of a separate entrypoint altogether? My motivation for keeping things inside the OAI one was to enable easy access to the other endpoints, which are not exclusive, at least in this early stage.

vllm/v1/generate works for me, although @smarterclayton was raising the issue of keeping the interface "open" as in not vllm-exclusive.

@russellb
Copy link
Copy Markdown
Member

Is this API intended to be potentially used directly by end users, or is it more of an internal infrastructure API

We're still discussing with @smarterclayton the full spectrum of intended use cases. In my view, it's definitely going to be aimed to be used in a larger infrastructure, but there are also nicher cases where someone just wants vLLM for inference but doesn't care about the added overhead of OAI specs (eg RL).

I feel less strongly about the specific choice than just doing something to separate our custom APIs

I understand, would you be in favor of a separate entrypoint altogether? My motivation for keeping things inside the OAI one was to enable easy access to the other endpoints, which are not exclusive, at least in this early stage.

It's probably fine to keep within the same API. It doesn't seem harmful to expose (like maybe internal infrastructure metadata exchange would be).

vllm/v1/generate works for me, although @smarterclayton was raising the issue of keeping the interface "open" as in not vllm-exclusive.

Fair point. I just think it'd be nice to make it clear where we're copying OpenAI vs. defining our own completely independent APIs. It could be inference/v1/generate or something ...

@NickLucche
Copy link
Copy Markdown
Collaborator Author

@russellb Changed naming to the one you suggested. Let me know if there's something else I should change in this PR in your view, looking to move this forward

@rizar
Copy link
Copy Markdown

rizar commented Sep 21, 2025

I'm looking forward to this feature!

Question: will this endpoint propagate data_parallel_rank selector to EngineCoreRequest? Like what is currently added in #24945

@mergify
Copy link
Copy Markdown

mergify bot commented Sep 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot removed the needs-rebase label Oct 24, 2025
@hmellor hmellor self-requested a review as a code owner November 12, 2025 19:35
@mergify mergify bot added the ci/build label Nov 12, 2025
@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 12, 2025
Copy link
Copy Markdown
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to start the structure, nice work. Just some cleanup nits

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future work: we should pull out these APIs into a separate folder, like in this refactor #28040

token_ids = output.token_ids
out_logprobs = output.logprobs

# sampling_params.logprobs == req.top_logprobs
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cruft, or we should assert this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was more of a way of commenting this was the same as in completions but under a different name, since logprobs is a bit overloaded; redacted

@mergify
Copy link
Copy Markdown

mergify bot commented Nov 14, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 14, 2025
Comment on lines -122 to -126
class UtilityResult:
"""Wrapper for special handling when serializing/deserializing."""

def __init__(self, r: Any = None):
self.result = r
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid circular import error; also, I believe this belongs to utils anyways

token_ids = output.token_ids
out_logprobs = output.logprobs

# sampling_params.logprobs == req.top_logprobs
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was more of a way of commenting this was the same as in completions but under a different name, since logprobs is a bit overloaded; redacted

NickLucche and others added 6 commits November 14, 2025 14:07
msgspec+pydantic ser mixin

example script

tests

support lora

tokens-only cli arg

enforcing tokens-only+abort endpoint

stop string tests

remove openai prefix from oaiservingtoken

Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
@NickLucche
Copy link
Copy Markdown
Collaborator Author

Thanks for the review @mgoin , addressed your comments.

@mergify mergify bot removed the needs-rebase label Nov 14, 2025
Copy link
Copy Markdown
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@mgoin mgoin merged commit 6f1e7f7 into vllm-project:main Nov 14, 2025
50 checks passed
geodavic pushed a commit to geodavic/vllm that referenced this pull request Nov 16, 2025
…24261)

Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: George D. Torres <gdavtor@gmail.com>
devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025
…24261)

Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Comment on lines 569 to +1501
@@ -1495,6 +1496,10 @@ def create_engine_config(
else ParallelConfig.data_parallel_rpc_port
)

if self.tokens_only and not model_config.skip_tokenizer_init:
model_config.skip_tokenizer_init = True
logger.info("Skipping tokenizer initialization for tokens-only mode.")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does tokens_only need to be an engine arg?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of a UX change:the tokenization skip depends on model_config, but ideally in a disaggregated setup you want a more general toggle to just ensure/signal that you're deploying a tokens in-out instance.

I was actually planning to leave this flag for toggling optimizations that are "disaggregated-everything specific".

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
…24261)

Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation frontend ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants