Skip to content

server: real-time reasoning interruption via control endpoint#23971

Merged
allozaur merged 11 commits into
ggml-org:masterfrom
ServeurpersoCom:realtime-reasoning-control
Jun 2, 2026
Merged

server: real-time reasoning interruption via control endpoint#23971
allozaur merged 11 commits into
ggml-org:masterfrom
ServeurpersoCom:realtime-reasoning-control

Conversation

@ServeurpersoCom

@ServeurpersoCom ServeurpersoCom commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Overview

Builds on the manual reasoning budget trigger from #23949. Adds a CONTROL task that mirrors the CANCEL path on the live slot and calls common_sampler_reasoning_budget_force to end thinking mid-generation. POST /v1/chat/completions/control with { id_slot, action }, opt-in reasoning_control arms the budget sampler on demand. Router and single model.

Minimal WebUI button as a skeleton for further UI work. cc @allozaur

Additional information

Video

think-stop.mp4

Closes #23944

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES Opus 4.8 High + MCP container w/GPU

@ServeurpersoCom ServeurpersoCom requested review from a team as code owners June 1, 2026 12:53
@ServeurpersoCom

ServeurpersoCom commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

I'm going to improve it by making the button grey or disappear at the end of the reasoning inference, WTDT @allozaur, to make a event-ready skeleton and you do the design on top.

@allozaur

allozaur commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

I'm going to improve it by making the button grey or disappear at the end of the reasoning inference, WTDT @allozaur, to make a event-ready skeleton and you do the design on top.

sure, let's go with this

Comment thread tools/server/server-context.cpp Outdated
Comment thread tools/server/server-context.cpp Outdated
Comment thread tools/server/server-task.h Outdated
Comment thread tools/server/server-context.cpp Outdated
Builds on the manual reasoning budget trigger from ggml-org#23949. Adds a
CONTROL task that mirrors the CANCEL path on the live slot and calls
common_sampler_reasoning_budget_force to end thinking mid-generation.
POST /v1/chat/completions/control with { id_slot, action }, opt-in
reasoning_control arms the budget sampler on demand. Router and single
model. Minimal WebUI button as a skeleton for further UI work.
Add isReasoning to the chat store, mirroring the isLoading pattern:
per conversation map, private setter, public accessor and reactive
export. Set from the stream callbacks, true on reasoning chunks, false
on the first content chunk, reset on stream end and resynced on
conversation switch. The skip button now keys off isReasoning so it
shows only during the thinking phase, not the whole generation.
Move the chat completion routes, the slots route and the reasoning
control action out of chat.service into api-endpoints and a dedicated
control-actions module. No behavior change, drops the magic strings so
the control protocol has a single source of truth.
@ServeurpersoCom ServeurpersoCom force-pushed the realtime-reasoning-control branch from 6d67931 to f617ae0 Compare June 1, 2026 14:09
Address @ngxson review on the control endpoint.

Switch from id_slot to the chat completion id to avoid a TOCTOU: the
slot can be reassigned between the lookup and the control request, so
matching the live completion (oaicompat_cmpl_id) is safe and a finished
one simply matches nothing. Rename the action to reasoning_end, guard
it on the reasoning_control flag of the target slot, and reduce the
response to {success} with an optional message.
Keep the streamed completion id on the message and post it back to the
control endpoint instead of probing /slots. Drops the slot discovery
and the TOCTOU that came with it. Action renamed to reasoning_end,
response read as {success}.
@ServeurpersoCom ServeurpersoCom requested a review from ngxson June 1, 2026 14:56

@ngxson ngxson left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remember to also update server docs to add the new endpoint

many AI-generated comments are for AI to response to your prompt, they don't have real technical values. please consider removing all comments about TOCTOU

Comment thread tools/server/server-context.cpp Outdated
Comment thread tools/server/server-context.cpp Outdated
Comment thread tools/server/server-task.h Outdated
Move the control fields into task_params and drop the redundant
comments on the control path.
@ngxson

ngxson commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

also need to update server docs

Comment thread tools/ui/src/lib/types/database.d.ts Outdated
ServeurpersoCom and others added 2 commits June 1, 2026 18:34
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
Per @allozaur review, clearer name for the streamed completion id.
@ServeurpersoCom

ServeurpersoCom commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

I need to fix a regression following the renaming / from rebasing onto latest master on my server.
Retrieving the conversation ID from the UI side is unreliable and the request does not always go through.

The webui streams through the agentic flow, which relayed onModel but
not onCompletionId, so the completion id never reached the message and
the control request was never sent. Relay it through the flow and its
callbacks type, declare id on the chunk type, and log an explicit error
when the button fires without a usable id.
The model is a property of the completion, so read it from the streaming
message like the id, not from the model dropdown which is unrelated UI
state. Makes the request self-consistent by construction instead of just
unlikely to drift.
@ServeurpersoCom

Copy link
Copy Markdown
Contributor Author

This time it's reliable and semantically more correct :

  • The ID has to be passed through the agentic flow.
  • The model name has to be read from the actual message being inferred.

@allozaur allozaur merged commit 354ebac into ggml-org:master Jun 2, 2026
26 of 28 checks passed
@CMay

CMay commented Jun 2, 2026

Copy link
Copy Markdown

Is there an option to disable this? This new feature (which is nice and cleanly implemented) causes token generation to run 20t/s slower on my hardware. As aldehir mentioned in #23949 it does cause performance issues for some people.

Looked around the webui for an option to toggle it off, but didn't see one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature request: allow client to stop reasoning in realtime

4 participants