Skip to content

Custom fallback providers fail silently when they don't support SSE streaming #21522

@juniperbevensee

Description

@juniperbevensee

Describe the bug

When a custom fallback provider returns a non-streaming JSON response to a stream=True request, the OpenAI SDK's streaming parser receives zero chunks. This causes:

  • content_parts stays empty → full_content = "".join([]) or None = None
  • Response is flagged as "empty" → retry loop → fallback cascade
  • The provider's valid response is silently discarded

This affects any custom provider that doesn't implement SSE streaming (e.g., lightweight proxies, self-hosted endpoints, Vertex AI REST API).

To reproduce

  1. Configure a custom fallback provider that returns valid JSON but not SSE:
fallback_providers:
  - provider: custom
    model: my-model
    base_url: http://my-proxy:8080/v1
  1. Primary provider hits rate limit → Hermes falls back to custom provider
  2. Custom provider returns valid {"choices": [...]} JSON
  3. Hermes logs: ⚠️ Empty response from model — retrying (1/3)
  4. After 3 retries: cascades to next fallback or gives up

Root cause

run_agent.py line ~8089: _use_streaming = True is unconditional — there's no per-provider or per-fallback streaming toggle. The comment says "Always prefer the streaming path" for health-monitoring benefits, but this assumption breaks custom providers.

When client.chat.completions.create(stream=True) receives a JSON response instead of SSE, the SDK's Stream iterator yields zero chunks. The streaming response builder at line ~5040 produces full_content = None with no tool calls → flagged as invalid.

Expected behavior

Either:

  • (A) Add a per-provider config flag to disable streaming: fallback_providers: [{provider: custom, model: x, base_url: y, stream: false}]
  • (B) Detect non-SSE responses in the streaming path and fall back to non-streaming parsing
  • (C) Document that custom providers MUST support SSE streaming

Workaround

We built a lightweight proxy (~200 lines Python) that translates OpenAI streaming requests to Vertex AI's native streamGenerateContent?alt=sse endpoint and converts the chunks back to OpenAI chat.completion.chunk format. Happy to contribute this as a reference implementation or built-in adapter.

Environment

  • Hermes version: 0.8.0
  • Provider: custom (Vertex AI via proxy)
  • Platform: Docker (gateway mode)
  • OS: macOS (Apple Silicon)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existsarea/configConfig system, migrations, profilescomp/agentCore agent loop, run_agent.py, prompt buildertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions