Feature request: allow client to stop reasoning in realtime

Currently, inference request is one-way: client send a request and receives an SSE token stream. This has a limitation that the client cannot directly "control" the inference in real time, like for example in this proposed use case, skip the reasoning.

So, the first requirement is to allow a bi-directional communication between server <--> client; 2 possible methods (to be discussed):
1. Websocket API: httplib already support it out of the box, so it won't be too difficult. The main question is: do we want to use our own API schema (maybe just slap the same /chat/completions schema), or going with a full [OAI-compat schema](https://developers.openai.com/api/docs/guides/realtime-websocket)
2. Keep SSE stream but introduce a new API, for example `/chat/completions/control` that can control the current completion session via an ID

The second requirement for this feature is that chat.cpp (especially the reasoning budget) need to be extended to allow arbitrary triggering "end-of-reasoning" event, though I'm not quite sure yet how to do that. Tagging @aldehir @pwilkin for discussion

CC @ggml-org/llama-server @ggml-org/llama-ui 

Note for contributors / vibe-coders: do not push PR for this yet until we agree on the specs, otherwise your PR will be closed without any questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: allow client to stop reasoning in realtime #23944

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature request: allow client to stop reasoning in realtime #23944

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions