Currently, inference request is one-way: client send a request and receives an SSE token stream. This has a limitation that the client cannot directly "control" the inference in real time, like for example in this proposed use case, skip the reasoning.
So, the first requirement is to allow a bi-directional communication between server <--> client; 2 possible methods (to be discussed):
- Websocket API: httplib already support it out of the box, so it won't be too difficult. The main question is: do we want to use our own API schema (maybe just slap the same /chat/completions schema), or going with a full OAI-compat schema
- Keep SSE stream but introduce a new API, for example
/chat/completions/control that can control the current completion session via an ID
The second requirement for this feature is that chat.cpp (especially the reasoning budget) need to be extended to allow arbitrary triggering "end-of-reasoning" event, though I'm not quite sure yet how to do that. Tagging @aldehir @pwilkin for discussion
CC @ggml-org/llama-server @ggml-org/llama-ui
Note for contributors / vibe-coders: do not push PR for this yet until we agree on the specs, otherwise your PR will be closed without any questions.
Currently, inference request is one-way: client send a request and receives an SSE token stream. This has a limitation that the client cannot directly "control" the inference in real time, like for example in this proposed use case, skip the reasoning.
So, the first requirement is to allow a bi-directional communication between server <--> client; 2 possible methods (to be discussed):
/chat/completions/controlthat can control the current completion session via an IDThe second requirement for this feature is that chat.cpp (especially the reasoning budget) need to be extended to allow arbitrary triggering "end-of-reasoning" event, though I'm not quite sure yet how to do that. Tagging @aldehir @pwilkin for discussion
CC @ggml-org/llama-server @ggml-org/llama-ui
Note for contributors / vibe-coders: do not push PR for this yet until we agree on the specs, otherwise your PR will be closed without any questions.