[DisaggEverything] Tokens in<>out /generate endpoint#24261
[DisaggEverything] Tokens in<>out /generate endpoint#24261mgoin merged 6 commits intovllm-project:mainfrom
/generate endpoint#24261Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
/generate endpoint/generate endpoint
|
EDIT: Moved to #22817 (comment) |
|
I see that If it's a different audience, it may be better suited to a different HTTP service scoped to a different audience and purpose. I had similar feedback about an earlier version of http metadata exchange for the Nixl connector, but the latest version seems to have moved it to its own http service: #22274 If it is desired to keep this on the existing OpenAI API, I think it'd be nice if we used namespacing to make it clear which APIs are our own custom ones vs. our implementation of APIs defined by OpenAI. One option would be something like |
We're still discussing with @smarterclayton the full spectrum of intended use cases.
I understand, would you be in favor of a separate entrypoint altogether? My motivation for keeping things inside the OAI one was to enable easy access to the other endpoints, which are not exclusive, at least in this early stage.
|
It's probably fine to keep within the same API. It doesn't seem harmful to expose (like maybe internal infrastructure metadata exchange would be).
Fair point. I just think it'd be nice to make it clear where we're copying OpenAI vs. defining our own completely independent APIs. It could be |
240f870 to
115b87f
Compare
|
@russellb Changed naming to the one you suggested. Let me know if there's something else I should change in this PR in your view, looking to move this forward |
|
I'm looking forward to this feature! Question: will this endpoint propagate |
|
This pull request has merge conflicts that must be resolved before it can be |
115b87f to
fdecd4f
Compare
fdecd4f to
0a500ee
Compare
mgoin
left a comment
There was a problem hiding this comment.
LGTM to start the structure, nice work. Just some cleanup nits
There was a problem hiding this comment.
Future work: we should pull out these APIs into a separate folder, like in this refactor #28040
| token_ids = output.token_ids | ||
| out_logprobs = output.logprobs | ||
|
|
||
| # sampling_params.logprobs == req.top_logprobs |
There was a problem hiding this comment.
It was more of a way of commenting this was the same as in completions but under a different name, since logprobs is a bit overloaded; redacted
|
This pull request has merge conflicts that must be resolved before it can be |
| class UtilityResult: | ||
| """Wrapper for special handling when serializing/deserializing.""" | ||
|
|
||
| def __init__(self, r: Any = None): | ||
| self.result = r |
There was a problem hiding this comment.
to avoid circular import error; also, I believe this belongs to utils anyways
| token_ids = output.token_ids | ||
| out_logprobs = output.logprobs | ||
|
|
||
| # sampling_params.logprobs == req.top_logprobs |
There was a problem hiding this comment.
It was more of a way of commenting this was the same as in completions but under a different name, since logprobs is a bit overloaded; redacted
msgspec+pydantic ser mixin example script tests support lora tokens-only cli arg enforcing tokens-only+abort endpoint stop string tests remove openai prefix from oaiservingtoken Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
934664e to
43617b6
Compare
|
Thanks for the review @mgoin , addressed your comments. |
…24261) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: George D. Torres <gdavtor@gmail.com>
…24261) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
| @@ -1495,6 +1496,10 @@ def create_engine_config( | |||
| else ParallelConfig.data_parallel_rpc_port | |||
| ) | |||
|
|
|||
| if self.tokens_only and not model_config.skip_tokenizer_init: | |||
| model_config.skip_tokenizer_init = True | |||
| logger.info("Skipping tokenizer initialization for tokens-only mode.") | |||
There was a problem hiding this comment.
why does tokens_only need to be an engine arg?
There was a problem hiding this comment.
This is more of a UX change:the tokenization skip depends on model_config, but ideally in a disaggregated setup you want a more general toggle to just ensure/signal that you're deploying a tokens in-out instance.
I was actually planning to leave this flag for toggling optimizations that are "disaggregated-everything specific".
…24261) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Overview
First step in implementing the "Disaggregated Everything" proposal #22817.
This PR focuses on the following component:
In particular, it introduces:
GenerateRequest/Responseinterface. NOTE:SamplingParamscan now be validated and deserialized within a pydantic message (eg input-only). Check outPydanticMsgspecMixin./generatetokens-only endpoint/v1/chat/completionsfor the most part.--tokens-only"modality" for starting up the server, mostly intended to simplify ux./abort/request/endpoint, see below.Implementation Details
To get a "tokenizer-free" endpoint, one can already use
--skip_tokenizer_initand/ordetokenize: Falsesampling option, forcing the use of basicIncrementalDetokenizer.In order to make ux easier for a Disaggregated Everything setup, a
--tokens-onlyoption is added, which enforces the two flags above.This way the Detokenizer is optional, as intended in the initial design.
INFO 09-10 13:36:17 [arg_utils.py:1281] Skipping tokenizer initialization for tokens-only mode.Furthermore, it enables the
/abort_requestsendpoint./abort_requestsis a solution to the detection of stop strings, which is one of the main challenges to get a real "tokenizer-free" endpoint.Currently this is done in AsyncLLM output_handler_loop, followed by an IPC abort request back to the EngineCore, like so:
With this Disaggregated Everything, we task the "Coordinator" (to be implemented in a follow-up PR) with detokenization. Hence, the "generate" instance needs to act more as a "remote EngineCore". The workflow is the following:
How to test
or among other tests
Follow up PRs:
MultiModalFeatureSpecinput, will add once Renderer effort progresses