fix(agent): stream guardrail_halt explanation to clients (#30770)#30813
Open
xxxigm wants to merge 3 commits into
Open
fix(agent): stream guardrail_halt explanation to clients (#30770)#30813xxxigm wants to merge 3 commits into
xxxigm wants to merge 3 commits into
Conversation
…Research#30770) Hermes synthesizes final assistant text in several turn-exit branches that the LLM never streamed — guardrail_halt is the visible one, but budget_exhausted, max_iterations, all_retries_exhausted_no_response, and a handful of others share the same shape. Until now the synthesized text landed in ``messages`` and in ``result["final_response"]`` only, which works for non-streaming callers (CLI prints, ``/v1/chat/completions`` without ``stream=true``) but starves every streaming consumer: the Chat Completions SSE writer pulls from a queue populated by ``stream_delta_callback``, so without an explicit fire it never receives a ``delta.content`` chunk between the role chunk and the finish chunk — Open WebUI / curl / the OpenAI SDK see an empty assistant message and the stream closes without explanation. Add ``_emit_synthesized_final_response(text)`` next to the existing ``_emit_interim_assistant_message`` and ``_fire_stream_delta`` helpers. It is the single fan-out point a synthesized branch needs to call: * fires ``_fire_stream_delta`` so the gateway SSE queue and TUI streaming display see the text exactly as if the model had produced it (re-using the existing think-block / context-leak scrubbers so no partial-tag artefacts leak), * closes ``stream_delta_callback`` with ``None`` afterwards so the TUI display closes its open response box before the finish chunk lands (matches the existing flush between tool-call iterations; the gateway SSE consumer filters None out of its queue so this is harmless there), * fans out to ``interim_assistant_callback`` for platforms that consume structured messages instead of deltas (Telegram, Discord), with ``already_streamed=True`` so they dedupe correctly, * preserves the ``_stream_callback`` (TTS) end-of-stream semantics by NOT firing None on it — None is its sentinel and a later synthesized branch in the same turn would otherwise be muted. Every callback invocation is wrapped in try/except so a misbehaving downstream consumer cannot prevent the turn from finishing; the whole point of the helper is to deliver SOMETHING when normal streaming has already broken down.
…h#30770) The tool-call loop guardrail (repeated_exact_failure_block, same_tool_failure_block, idempotent_no_progress_block) generates a short user-facing explanation via ``_toolguard_controlled_halt_response`` and appends it to the conversation as an assistant turn. Before this change that text was delivered only through the returned result dict, so streaming clients got no SSE ``delta.content`` between the role chunk and the finish chunk — the conversation appeared to die mid-thought from the user's perspective. Call the new ``_emit_synthesized_final_response`` helper inside the ``guardrail_halt`` branch so the synthesized explanation flows through ``stream_delta_callback`` (Chat Completions SSE, TUI streaming display) and ``interim_assistant_callback`` (Telegram, Discord, Slack adapters) exactly as if the model itself had produced it. Wrapped in try/except: the halt text is already safely captured in ``final_response`` and the ``messages`` history, so even a broken downstream callback must not turn a controlled guardrail halt back into a silent failure. The existing ``_emit_status`` lifecycle warning is intentionally preserved — it still drives the gateway dashboard / TUI warning band output that operators rely on for triage.
…rch#30770) 16 tests in three classes: * ``TestEmitSynthesizedFinalResponseUnit`` — direct coverage of the helper: it fires both ``stream_delta_callback`` and ``interim_assistant_callback`` with the expected ``already_streamed`` flag, strips whitespace, skips empty / whitespace / None inputs, no-ops gracefully when no callbacks are registered, and swallows callback exceptions on both channels. Pins that the ``None`` end-of-stream flush goes to ``stream_delta_callback`` only — NOT the TTS ``_stream_callback`` (where ``None`` is its sentinel). * ``TestGuardrailHaltStreamsToClient`` — end-to-end via ``run_conversation`` with a hard-stop guardrail config: the synthesized halt text is fanned out through the helper exactly once, persisted in the assistant turn of ``messages`` for session resume, names both the tool and the guardrail code so users can correlate with logs, is a clean no-op when no streaming callbacks are registered, leaves the pre-existing ``_emit_status`` lifecycle warning untouched, and survives a fan-out helper that itself raises (the result dict + messages history are still well-formed). * ``TestConversationLoopWiring`` — pins the conversation-loop call site so future refactors that split or rename the guardrail_halt branch fail noisily instead of silently re-introducing the silent- stream symptom; also asserts the helper is NOT called for normal text responses (where the model already streamed its own deltas and a re-emit would duplicate the visible answer for clients that don't dedupe).
This was referenced May 23, 2026
Collaborator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes #30770: when the tool-call loop guardrail fires (
repeated_exact_failure_block,same_tool_failure_block, oridempotent_no_progress_block), the agent generates a short user-facing explanation via_toolguard_controlled_halt_responseand appends it to the conversation as an assistant turn — but the synthesized text was never fanned out through the streaming callbacks. The Chat Completions SSE writer (Open WebUI, curl, OpenAI SDK) and TUI streaming display saw the role chunk, the finish chunk, and nothing between them; from the user's perspective the conversation died mid-thought. Interim-message platform adapters (Telegram, Discord) had the same gap.This PR adds a single fan-out helper, wires it into the
guardrail_haltbranch, and pins the behaviour with a 16-test regression suite.Related Issue
Closes #30770
Type of Change
Changes Made
run_agent.py— add_emit_synthesized_final_response(text)next to the existing_emit_interim_assistant_message/_fire_stream_deltahelpers (+77):_fire_stream_deltaso the gateway SSE queue and TUI streaming display receive the text exactly as if the model had produced it (re-uses the existing think-block / context-leak scrubbers so no partial-tag artefacts leak).stream_delta_callbackwithNoneafterwards so the TUI display closes its open response box before the finish chunk lands (matches the existing flush between tool-call iterations; the gateway SSE consumer filtersNoneout of its queue so this is harmless there).interim_assistant_callbackfor platforms that consume structured messages instead of deltas (Telegram, Discord, Slack), withalready_streamed=Trueso they dedupe correctly._stream_callbackend-of-stream semantics by NOT firingNoneon it —Noneis its sentinel and a later synthesized branch in the same turn would otherwise be muted.agent/conversation_loop.py— call the helper inside theguardrail_haltbranch (+27):_emit_statuswarning, wrapped in try/except so the halt text infinal_response/messagesis always preserved even if a downstream callback explodes.tests/run_agent/test_guardrail_halt_emit_30770.py— 16 regression tests across three classes (+473):TestEmitSynthesizedFinalResponseUnit(8 cases) — direct coverage of the helper: fires both stream + interim callbacks with the expectedalready_streamedflag, strips whitespace, skips empty / None / whitespace input, no-ops gracefully without callbacks, swallows callback exceptions on both channels, never closes the TTS_stream_callback.TestGuardrailHaltStreamsToClient(6 cases) — end-to-end throughrun_conversation: synthesized halt is fanned out exactly once, persisted inmessagesfor session resume, names both the tool and the guardrail code so users can correlate with logs, is a clean no-op when no callbacks registered, preserves the pre-existing_emit_statuslifecycle warning, survives a helper that raises.TestConversationLoopWiring(2 cases) — pins the call-site so future refactors fail noisily; also asserts the helper is NOT called for normal text responses.How to Test
.venvis set up:python3 -m venv .venv && source .venv/bin/activate && pip install -e ".[all,dev]"Checklist
Code
fix(agent): …,test(agent): …)scripts/run_tests.sh tests/run_agent/test_guardrail_halt_emit_30770.pyand all tests passDocumentation & Housekeeping
docs/, docstrings) — N/A; the helper's docstring documents the contract, no user-facing surface changedcli-config.yaml.exampleif I added/changed config keys — N/A (no new config keys)CONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — N/AScreenshots / Logs