Skip to content

Feature request: allow client to stop reasoning in realtime #23944

@ngxson

Description

@ngxson

Currently, inference request is one-way: client send a request and receives an SSE token stream. This has a limitation that the client cannot directly "control" the inference in real time, like for example in this proposed use case, skip the reasoning.

So, the first requirement is to allow a bi-directional communication between server <--> client; 2 possible methods (to be discussed):

  1. Websocket API: httplib already support it out of the box, so it won't be too difficult. The main question is: do we want to use our own API schema (maybe just slap the same /chat/completions schema), or going with a full OAI-compat schema
  2. Keep SSE stream but introduce a new API, for example /chat/completions/control that can control the current completion session via an ID

The second requirement for this feature is that chat.cpp (especially the reasoning budget) need to be extended to allow arbitrary triggering "end-of-reasoning" event, though I'm not quite sure yet how to do that. Tagging @aldehir @pwilkin for discussion

CC @ggml-org/llama-server @ggml-org/llama-ui

Note for contributors / vibe-coders: do not push PR for this yet until we agree on the specs, otherwise your PR will be closed without any questions.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions