Skip to content

feat: unified streaming infrastructure — real-time token delivery for CLI + gateway#1538

Merged
teknium1 merged 16 commits into
mainfrom
hermes/hermes-a098c323
Mar 16, 2026
Merged

feat: unified streaming infrastructure — real-time token delivery for CLI + gateway#1538
teknium1 merged 16 commits into
mainfrom
hermes/hermes-a098c323

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

Adds streaming token delivery as the default for all API calls when a consumer is registered (CLI display, gateway, TTS). Tokens stream to the user in real-time instead of waiting for the full response. Tool calls are accumulated silently — only text responses stream.

Supersedes PRs #922, #1312, #774, #798, #697. Drawing from the best of each.

What's implemented (Stages 1-2)

Stage 1 — Core streaming (run_agent.py)

  • stream_delta_callback parameter on AIAgent.__init__ for real-time token delivery
  • _interruptible_streaming_api_call() — unified streaming for all providers:
    • chat_completions: stream=True with stream_options={"include_usage": True}
    • anthropic_messages: client.messages.stream() via Anthropic SDK, returns native Message for downstream compat
    • codex_responses: enhanced _run_codex_stream() fires delta callbacks during Codex streaming
  • _fire_stream_delta() fires both display and TTS callbacks
  • _fire_reasoning_delta() for reasoning content streaming
  • Tool-call suppression: callbacks only fire on text-only responses
  • on_first_delta: callback for spinner control on first token arrival
  • Provider fallback: graceful degradation to non-streaming when provider doesn't support it
  • _has_stream_consumers() unifies stream_delta_callback and _stream_callback checks

Stage 2 — CLI display (cli.py)

  • _stream_delta(): line-buffered rendering via _cprint (prompt_toolkit safe)
  • _emit_stream_text(): emits filtered text with response box framing
  • Reasoning tag suppression: <REASONING_SCRATCHPAD>, <think>, <reasoning> blocks are suppressed during streaming (handles split-token close tags via sliding window)
  • Response box: opens on first visible token, closes on flush
  • Skips Rich Panel when streaming already displayed content
  • Compatible with existing TTS streaming (both fire simultaneously)

What's coming (Stages 3-4)

Test results

  • 4572 tests pass (14 new streaming tests)
  • 14 dedicated streaming tests: accumulator shape, callback ordering, tool-call suppression, provider fallback, reasoning streaming, Codex delta callbacks, consumer detection
  • Live CLI verified: real API calls with streaming, reasoning suppression, response box framing

Config

Streaming is enabled by default when a consumer is registered. No config needed — it "just works". Future gateway config will add:

streaming:
  enabled: true
  transport: auto          # auto | draft | edit
  edit_interval: 0.15
  buffer_threshold: 20
  cursor: ""

Attribution

Built from the best contributions across multiple PRs:

Files changed

File Changes
run_agent.py +301: stream_delta_callback, unified streaming API call, Codex delta callbacks
cli.py +145: stream display, reasoning suppression, response box framing
tests/test_streaming.py +524: 14-test streaming suite

bartokmagic and others added 6 commits March 16, 2026 04:53
Fixes #1445 — When using Docker backend, the user's current working
directory is now automatically bind-mounted to /workspace inside the
container. This allows users to run `cd my-project && hermes` and have
their project files accessible to the agent without manual volume config.

Changes:
- Add host_cwd and auto_mount_cwd parameters to DockerEnvironment
- Capture original host CWD in _get_env_config() before container fallback
- Pass host_cwd through _create_environment() to Docker backend
- Add TERMINAL_DOCKER_NO_AUTO_MOUNT env var to disable if needed
- Skip auto-mount when /workspace is already explicitly mounted
- Add tests for auto-mount behavior
- Add documentation for the new feature

The auto-mount is skipped when:
1. TERMINAL_DOCKER_NO_AUTO_MOUNT=true is set
2. User configured docker_volumes with :/workspace
3. persistent_filesystem=true (persistent sandbox mode)

This makes the Docker backend behave more intuitively — the agent
operates on the user's actual project directory by default.
… providers

Stage 1 of streaming support. Adds:

- stream_delta_callback parameter on AIAgent.__init__ for real-time token delivery
- _interruptible_streaming_api_call() handling chat_completions + anthropic_messages
- Enhanced _run_codex_stream() to fire delta callbacks during Codex streaming
- _fire_stream_delta() fires both display and TTS callbacks
- _fire_reasoning_delta() for reasoning content streaming
- Tool-call suppression: callbacks only fire on text-only responses
- on_first_delta callback for spinner control on first token
- Provider fallback: graceful degradation to non-streaming
- _has_stream_consumers() unifies stream_delta_callback and _stream_callback checks
- Anthropic streaming returns native Message for downstream compatibility

Drawing from PRs #922 (unified streaming), #1312 (gateway consumer),
#774 (Telegram streaming), #798 (CLI streaming), #1214 (reasoning modes).
Credit: jobless0x, OutThisLife, clicksingh, raulvidis.
…ponse box framing

Stage 2 of streaming support. CLI now streams tokens in real-time:

- _stream_delta(): line-buffered rendering via _cprint (prompt_toolkit safe)
- _flush_stream(): emits remaining buffer and closes response box
- Response box opens on first token, closes on flush
- Skip Rich Panel when streaming already displayed content
- Reset streaming state before each agent turn
- Compatible with existing TTS streaming (both can fire simultaneously)
- Uses skin engine for response label branding

Credit: OutThisLife (#798 CLI streaming concept).
…soning, Codex

Tests cover:
- Text/tool-call/mixed response accumulation into correct shape
- Delta callback ordering and on_first_delta firing once
- Tool-call suppression (no callbacks during tool turns)
- Provider fallback on 'not supported' errors
- Reasoning content accumulation and callback
- _has_stream_consumers() detection
- Codex stream delta callback firing
…etection

Fixes two issues found during live testing:

1. Reasoning tag suppression: close tags like </REASONING_SCRATCHPAD>
   that arrive split across stream tokens (e.g. '</REASONING_SCRATCH' +
   'PAD>\n\nHello') were being lost because the buffer was discarded.
   Fix: keep a sliding window of the tail (max close tag length) so
   partial tags survive across tokens.

2. Streaming fallback detection was too broad — 'stream' matched any
   error containing that word (including 'stream_options' rejections).
   Narrowed to specific phrases: 'streaming is not', 'streaming not
   support', 'does not support stream', 'not available'.

Verified with real API calls: streaming works end-to-end with
reasoning block suppression, response box framing, and proper
fallback to Rich Panel when streaming isn't active.
…eamConsumer, already_sent

Stage 3 of streaming support. Gateway now streams tokens to messaging platforms:

- StreamingConfig dataclass (enabled, transport, edit_interval, buffer_threshold, cursor)
  on GatewayConfig with from_dict/to_dict serialization
- GatewayStreamConsumer: async queue-based consumer that progressively edits
  a single message on the target platform (edit transport)
- on_delta() → queue → run() async task → send_or_edit() with rate limiting
- already_sent propagation: when streaming delivered the response, handler
  returns None so base adapter skips duplicate send()
- stream_delta_callback wired into AIAgent constructor in _run_agent
- Consumer lifecycle: started as asyncio task, awaited with timeout in finally

Config (config.yaml):
  streaming:
    enabled: true
    transport: edit      # progressive editMessageText
    edit_interval: 0.3   # seconds between edits
    buffer_threshold: 40 # chars before forcing flush
    cursor: ' ▉'

Credit: jobless0x (#774, #1312), OutThisLife (#798), clicksingh (#697).
@teknium1 teknium1 changed the title feat: unified streaming infrastructure — real-time token delivery for CLI, gateway, and all providers feat: unified streaming infrastructure — real-time token delivery for CLI + gateway Mar 16, 2026
teknium1 added 10 commits March 16, 2026 06:15
Previously the fallback only triggered on specific error keywords like
'streaming is not supported'. Many third-party providers have partial
or broken streaming — rejecting stream=True, crashing on stream_options,
dropping connections mid-stream, returning malformed chunks, etc.

Now: any exception during the streaming API call triggers an automatic
fallback to the standard non-streaming request path. The error is logged
at INFO level for diagnostics but never surfaces to the user. If the
fallback also fails, THAT error propagates normally.

This ensures streaming is additive — it improves UX when it works but
never breaks providers that don't support it.

Tests: 2 new (any-error fallback, double-failure propagation), 15 total.
Thorough code review found 5 issues across run_agent.py, cli.py, and gateway/:

1. CRITICAL — Gateway stream consumer task never started: stream_consumer_holder
   was checked BEFORE run_sync populated it. Fixed with async polling pattern
   (same as track_agent).

2. MEDIUM-HIGH — Streaming fallback after partial delivery caused double-response:
   if streaming failed after some tokens were delivered, the fallback would
   re-deliver the full response. Now tracks deltas_were_sent and only falls
   back when no tokens reached consumers yet.

3. MEDIUM — Codex mode lost on_first_delta spinner callback: _run_codex_stream
   now accepts on_first_delta parameter, fires it on first text delta. Passed
   through from _interruptible_streaming_api_call via _codex_on_first_delta
   instance attribute.

4. MEDIUM — CLI close-tag after-text bypassed tag filtering: text after a
   reasoning close tag was sent directly to _emit_stream_text, skipping
   open-tag detection. Now routes through _stream_delta for full filtering.

5. LOW — Removed 140 lines of dead code: old _streaming_api_call method
   (superseded by _interruptible_streaming_api_call). Updated 13 tests in
   test_run_agent.py and test_openai_client_lifecycle.py to use the new
   method name and signature.

4573 tests passing.
Anthropic native models emit <THINKING> tags in text content (separate
from the SDK's thinking_delta events). Without suppression, these tags
leak into the streamed CLI output. Found during live provider testing.
Streaming is now off by default for both CLI and gateway. Users opt in:

CLI (config.yaml):
  display:
    streaming: true

Gateway (config.yaml):
  streaming:
    enabled: true

This lets early adopters test streaming while existing users see zero
change. Once we have enough field validation, we flip the default to
true in a subsequent release.
Documents the new streaming options in the example config:
- display.streaming for CLI (under display section)
- streaming.enabled + transport/interval/threshold/cursor for gateway
- Added streaming: false to load_cli_config() defaults dict
…ment

The streaming infrastructure already fires reasoning deltas via
_fire_reasoning_delta() during streaming. The remaining work is the
CLI display layer: a dim reasoning box that opens on first reasoning
token, streams live, then transitions to the response box.

Reference: PR #1214 (raulvidis) for gateway reasoning visibility.
When both display.streaming and display.show_reasoning are enabled,
reasoning tokens stream in real-time into a dim bordered box. When
content tokens start arriving, the reasoning box closes and the
response box opens — smooth visual transition.

- _stream_reasoning_delta(): line-buffered rendering in dim text
- _close_reasoning_box(): flush + close, called on first content token
- Reasoning callback routes to streaming version when both flags set
- Skips static post-response reasoning display when streamed live
- State reset per turn via _reset_stream_state()

Works with reasoning_content deltas (OpenRouter reasoning mode) and
thinking_delta events (Anthropic extended thinking).
When the stream consumer's first edit_message() call fails (Signal,
Email, HomeAssistant don't support editing), it now disables editing
for the rest of the stream instead of falling back to sending a new
message every 0.3 seconds. The final response is delivered by the
normal send path since already_sent stays false.

Without this fix, enabling gateway streaming on Signal/Email/HA would
flood the chat with dozens of partial messages.
@teknium1 teknium1 marked this pull request as ready for review March 16, 2026 21:22
@teknium1 teknium1 merged commit 6c84e26 into main Mar 16, 2026
2 checks passed
angelburgosrosado pushed a commit to angelburgosrosado/hermes-agent that referenced this pull request Apr 27, 2026
…a098c323

feat: unified streaming infrastructure — real-time token delivery for CLI + gateway
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…a098c323

feat: unified streaming infrastructure — real-time token delivery for CLI + gateway
olympus-terminal pushed a commit to olympus-terminal/hermes-agent that referenced this pull request May 16, 2026
…a098c323

feat: unified streaming infrastructure — real-time token delivery for CLI + gateway
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…a098c323

feat: unified streaming infrastructure — real-time token delivery for CLI + gateway
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants