feat: unified streaming infrastructure — real-time token delivery for CLI + gateway by teknium1 · Pull Request #1538 · NousResearch/hermes-agent

teknium1 · 2026-03-16T12:28:52Z

Summary

Adds streaming token delivery as the default for all API calls when a consumer is registered (CLI display, gateway, TTS). Tokens stream to the user in real-time instead of waiting for the full response. Tool calls are accumulated silently — only text responses stream.

Supersedes PRs #922, #1312, #774, #798, #697. Drawing from the best of each.

What's implemented (Stages 1-2)

Stage 1 — Core streaming (run_agent.py)

stream_delta_callback parameter on AIAgent.__init__ for real-time token delivery
_interruptible_streaming_api_call() — unified streaming for all providers:
- chat_completions: stream=True with stream_options={"include_usage": True}
- anthropic_messages: client.messages.stream() via Anthropic SDK, returns native Message for downstream compat
- codex_responses: enhanced _run_codex_stream() fires delta callbacks during Codex streaming
_fire_stream_delta() fires both display and TTS callbacks
_fire_reasoning_delta() for reasoning content streaming
Tool-call suppression: callbacks only fire on text-only responses
on_first_delta: callback for spinner control on first token arrival
Provider fallback: graceful degradation to non-streaming when provider doesn't support it
_has_stream_consumers() unifies stream_delta_callback and _stream_callback checks

Stage 2 — CLI display (cli.py)

_stream_delta(): line-buffered rendering via _cprint (prompt_toolkit safe)
_emit_stream_text(): emits filtered text with response box framing
Reasoning tag suppression: <REASONING_SCRATCHPAD>, <think>, <reasoning> blocks are suppressed during streaming (handles split-token close tags via sliding window)
Response box: opens on first visible token, closes on flush
Skips Rich Panel when streaming already displayed content
Compatible with existing TTS streaming (both fire simultaneously)

What's coming (Stages 3-4)

Gateway streaming — StreamingConfig, GatewayStreamConsumer, Telegram dual-transport (draft + edit)
Discord/Slack progressive editing
Reasoning streaming visibility (/reasoning command from feat: gateway reasoning visibility modes #1214)
API server SSE streaming (from feat: OpenAI-compatible API server + streaming support #956)

Test results

4572 tests pass (14 new streaming tests)
14 dedicated streaming tests: accumulator shape, callback ordering, tool-call suppression, provider fallback, reasoning streaming, Codex delta callbacks, consumer detection
Live CLI verified: real API calls with streaming, reasoning suppression, response box framing

Config

Streaming is enabled by default when a consumer is registered. No config needed — it "just works". Future gateway config will add:

streaming:
  enabled: true
  transport: auto          # auto | draft | edit
  edit_interval: 0.15
  buffer_threshold: 20
  cursor: " ▉"

Attribution

Built from the best contributions across multiple PRs:

jobless0x (feat: telegram streaming with dual-transport (draft + edit fallback) #774, feat(gateway): streaming consumer — dual-transport (Bot API 9.5 draft + edit fallback), FloodWait retry #1312): Telegram dual-transport streaming, GatewayStreamConsumer, StreamingConfig
OutThisLife (Streaming TUI, streaming CLI output with line-buffered rendering #798): CLI line-buffered rendering, _cprint integration
clicksingh (feat(gateway): streaming final response for Telegram #697): Original Telegram streaming concept
raulvidis (feat: gateway reasoning visibility modes #1214): Reasoning streaming visibility modes

Files changed

File	Changes
`run_agent.py`	+301: stream_delta_callback, unified streaming API call, Codex delta callbacks
`cli.py`	+145: stream display, reasoning suppression, response box framing
`tests/test_streaming.py`	+524: 14-test streaming suite

Fixes #1445 — When using Docker backend, the user's current working directory is now automatically bind-mounted to /workspace inside the container. This allows users to run `cd my-project && hermes` and have their project files accessible to the agent without manual volume config. Changes: - Add host_cwd and auto_mount_cwd parameters to DockerEnvironment - Capture original host CWD in _get_env_config() before container fallback - Pass host_cwd through _create_environment() to Docker backend - Add TERMINAL_DOCKER_NO_AUTO_MOUNT env var to disable if needed - Skip auto-mount when /workspace is already explicitly mounted - Add tests for auto-mount behavior - Add documentation for the new feature The auto-mount is skipped when: 1. TERMINAL_DOCKER_NO_AUTO_MOUNT=true is set 2. User configured docker_volumes with :/workspace 3. persistent_filesystem=true (persistent sandbox mode) This makes the Docker backend behave more intuitively — the agent operates on the user's actual project directory by default.

… providers Stage 1 of streaming support. Adds: - stream_delta_callback parameter on AIAgent.__init__ for real-time token delivery - _interruptible_streaming_api_call() handling chat_completions + anthropic_messages - Enhanced _run_codex_stream() to fire delta callbacks during Codex streaming - _fire_stream_delta() fires both display and TTS callbacks - _fire_reasoning_delta() for reasoning content streaming - Tool-call suppression: callbacks only fire on text-only responses - on_first_delta callback for spinner control on first token - Provider fallback: graceful degradation to non-streaming - _has_stream_consumers() unifies stream_delta_callback and _stream_callback checks - Anthropic streaming returns native Message for downstream compatibility Drawing from PRs #922 (unified streaming), #1312 (gateway consumer), #774 (Telegram streaming), #798 (CLI streaming), #1214 (reasoning modes). Credit: jobless0x, OutThisLife, clicksingh, raulvidis.

…ponse box framing Stage 2 of streaming support. CLI now streams tokens in real-time: - _stream_delta(): line-buffered rendering via _cprint (prompt_toolkit safe) - _flush_stream(): emits remaining buffer and closes response box - Response box opens on first token, closes on flush - Skip Rich Panel when streaming already displayed content - Reset streaming state before each agent turn - Compatible with existing TTS streaming (both can fire simultaneously) - Uses skin engine for response label branding Credit: OutThisLife (#798 CLI streaming concept).

…soning, Codex Tests cover: - Text/tool-call/mixed response accumulation into correct shape - Delta callback ordering and on_first_delta firing once - Tool-call suppression (no callbacks during tool turns) - Provider fallback on 'not supported' errors - Reasoning content accumulation and callback - _has_stream_consumers() detection - Codex stream delta callback firing

…etection Fixes two issues found during live testing: 1. Reasoning tag suppression: close tags like </REASONING_SCRATCHPAD> that arrive split across stream tokens (e.g. '</REASONING_SCRATCH' + 'PAD>\n\nHello') were being lost because the buffer was discarded. Fix: keep a sliding window of the tail (max close tag length) so partial tags survive across tokens. 2. Streaming fallback detection was too broad — 'stream' matched any error containing that word (including 'stream_options' rejections). Narrowed to specific phrases: 'streaming is not', 'streaming not support', 'does not support stream', 'not available'. Verified with real API calls: streaming works end-to-end with reasoning block suppression, response box framing, and proper fallback to Rich Panel when streaming isn't active.

…eamConsumer, already_sent Stage 3 of streaming support. Gateway now streams tokens to messaging platforms: - StreamingConfig dataclass (enabled, transport, edit_interval, buffer_threshold, cursor) on GatewayConfig with from_dict/to_dict serialization - GatewayStreamConsumer: async queue-based consumer that progressively edits a single message on the target platform (edit transport) - on_delta() → queue → run() async task → send_or_edit() with rate limiting - already_sent propagation: when streaming delivered the response, handler returns None so base adapter skips duplicate send() - stream_delta_callback wired into AIAgent constructor in _run_agent - Consumer lifecycle: started as asyncio task, awaited with timeout in finally Config (config.yaml): streaming: enabled: true transport: edit # progressive editMessageText edit_interval: 0.3 # seconds between edits buffer_threshold: 40 # chars before forcing flush cursor: ' ▉' Credit: jobless0x (#774, #1312), OutThisLife (#798), clicksingh (#697).

Previously the fallback only triggered on specific error keywords like 'streaming is not supported'. Many third-party providers have partial or broken streaming — rejecting stream=True, crashing on stream_options, dropping connections mid-stream, returning malformed chunks, etc. Now: any exception during the streaming API call triggers an automatic fallback to the standard non-streaming request path. The error is logged at INFO level for diagnostics but never surfaces to the user. If the fallback also fails, THAT error propagates normally. This ensures streaming is additive — it improves UX when it works but never breaks providers that don't support it. Tests: 2 new (any-error fallback, double-failure propagation), 15 total.

Thorough code review found 5 issues across run_agent.py, cli.py, and gateway/: 1. CRITICAL — Gateway stream consumer task never started: stream_consumer_holder was checked BEFORE run_sync populated it. Fixed with async polling pattern (same as track_agent). 2. MEDIUM-HIGH — Streaming fallback after partial delivery caused double-response: if streaming failed after some tokens were delivered, the fallback would re-deliver the full response. Now tracks deltas_were_sent and only falls back when no tokens reached consumers yet. 3. MEDIUM — Codex mode lost on_first_delta spinner callback: _run_codex_stream now accepts on_first_delta parameter, fires it on first text delta. Passed through from _interruptible_streaming_api_call via _codex_on_first_delta instance attribute. 4. MEDIUM — CLI close-tag after-text bypassed tag filtering: text after a reasoning close tag was sent directly to _emit_stream_text, skipping open-tag detection. Now routes through _stream_delta for full filtering. 5. LOW — Removed 140 lines of dead code: old _streaming_api_call method (superseded by _interruptible_streaming_api_call). Updated 13 tests in test_run_agent.py and test_openai_client_lifecycle.py to use the new method name and signature. 4573 tests passing.

Anthropic native models emit <THINKING> tags in text content (separate from the SDK's thinking_delta events). Without suppression, these tags leak into the streamed CLI output. Found during live provider testing.

Streaming is now off by default for both CLI and gateway. Users opt in: CLI (config.yaml): display: streaming: true Gateway (config.yaml): streaming: enabled: true This lets early adopters test streaming while existing users see zero change. Once we have enough field validation, we flip the default to true in a subsequent release.

Documents the new streaming options in the example config: - display.streaming for CLI (under display section) - streaming.enabled + transport/interval/threshold/cursor for gateway - Added streaming: false to load_cli_config() defaults dict

…ment The streaming infrastructure already fires reasoning deltas via _fire_reasoning_delta() during streaming. The remaining work is the CLI display layer: a dim reasoning box that opens on first reasoning token, streams live, then transitions to the response box. Reference: PR #1214 (raulvidis) for gateway reasoning visibility.

When both display.streaming and display.show_reasoning are enabled, reasoning tokens stream in real-time into a dim bordered box. When content tokens start arriving, the reasoning box closes and the response box opens — smooth visual transition. - _stream_reasoning_delta(): line-buffered rendering in dim text - _close_reasoning_box(): flush + close, called on first content token - Reasoning callback routes to streaming version when both flags set - Skips static post-response reasoning display when streamed live - State reset per turn via _reset_stream_state() Works with reasoning_content deltas (OpenRouter reasoning mode) and thinking_delta events (Anthropic extended thinking).

When the stream consumer's first edit_message() call fails (Signal, Email, HomeAssistant don't support editing), it now disables editing for the rest of the stream instead of falling back to sending a new message every 0.3 seconds. The final response is delivered by the normal send path since already_sent stays false. Without this fix, enabling gateway streaming on Signal/Email/HA would flood the chat with dozens of partial messages.

…ocs)

…a098c323 feat: unified streaming infrastructure — real-time token delivery for CLI + gateway

bartokmagic and others added 6 commits March 16, 2026 04:53

teknium1 changed the title ~~feat: unified streaming infrastructure — real-time token delivery for CLI, gateway, and all providers~~ feat: unified streaming infrastructure — real-time token delivery for CLI + gateway Mar 16, 2026

teknium1 added 10 commits March 16, 2026 06:15

fix(cli): add <THINKING> to streaming tag suppression list

fc4080c

Anthropic native models emit <THINKING> tags in text content (separate from the SDK's thinking_delta events). Without suppression, these tags leak into the streamed CLI output. Found during live provider testing.

docs: add streaming section to configuration guide

8feb9e4

merge: resolve conflicts with main (show_cost, turn routing, docker d…

f4d61c1

…ocs)

teknium1 marked this pull request as ready for review March 16, 2026 21:22

teknium1 merged commit 6c84e26 into main Mar 16, 2026
2 checks passed

raulvidis mentioned this pull request Mar 19, 2026

feat: gateway reasoning visibility modes #1214

Closed

7 tasks

angelburgosrosado pushed a commit to angelburgosrosado/hermes-agent that referenced this pull request Apr 27, 2026

Merge pull request NousResearch#1538 from NousResearch/hermes/hermes-…

ff452be

…a098c323 feat: unified streaming infrastructure — real-time token delivery for CLI + gateway

02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026

Merge pull request NousResearch#1538 from NousResearch/hermes/hermes-…

af8251f

…a098c323 feat: unified streaming infrastructure — real-time token delivery for CLI + gateway

olympus-terminal pushed a commit to olympus-terminal/hermes-agent that referenced this pull request May 16, 2026

Merge pull request NousResearch#1538 from NousResearch/hermes/hermes-…

ed3032d

…a098c323 feat: unified streaming infrastructure — real-time token delivery for CLI + gateway

Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026

Merge pull request NousResearch#1538 from NousResearch/hermes/hermes-…

de69e1a

…a098c323 feat: unified streaming infrastructure — real-time token delivery for CLI + gateway

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: unified streaming infrastructure — real-time token delivery for CLI + gateway#1538

feat: unified streaming infrastructure — real-time token delivery for CLI + gateway#1538
teknium1 merged 16 commits into
mainfrom
hermes/hermes-a098c323

teknium1 commented Mar 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

teknium1 commented Mar 16, 2026

Summary

What's implemented (Stages 1-2)

Stage 1 — Core streaming (run_agent.py)

Stage 2 — CLI display (cli.py)

What's coming (Stages 3-4)

Test results

Config

Attribution

Files changed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants