Skip to content

Bug: Gateway falsely marks healthy local vLLM endpoints as timed out/overloaded, causing 1–23 min fallback cascades #63229

@clawdia-lobster

Description

@clawdia-lobster

Summary

The gateway's model-fallback/routing subsystem incorrectly marks healthy, responsive local vLLM endpoints as "timed out" or "overloaded", causing cascading fallback chains that take 1–23 minutes to resolve. The endpoints themselves respond in 0.27–0.93s when tested directly via curl.

Environment

  • OpenClaw: 2026.4.5 (container, Linux 6.12.63 x64)
  • Gateway: loopback bind, port 18789
  • Local vLLM endpoints:
    • vllm-8001 (gemma4, 27B) on jupiter.wg.local:8001 — dedicated GPU
    • vllm-7002 (qwen3.5-27b) on jupiter.wg.local:7002 — dedicated GPU
  • Remote providers: Novita (GLM-5, Kimi), DeepInfra (Kimi), Anthropic (Sonnet)
  • Config: agents.defaults.timeoutSeconds: 1200, agents.defaults.llm.idleTimeoutSeconds: 300

Observed Behaviour

1. Endpoints are fast (direct curl, concurrent)

Both GPUs idle, tested concurrently:

Endpoint Avg latency (5 reqs)
vllm-8001/gemma4 0.28s
vllm-7002/qwen3.5-27b 0.91s

2. Gateway marks them as timed out or overloaded

From gateway logs, model fallback decisions for today:

Failure reasons:

  • timeout: 17 occurrences
  • unknown: 8
  • overloaded: 2

Error previews:

  • LLM request timed out.: 12
  • Gateway is draining for restart; new tasks are not accepted: 8
  • cron: job execution timed out: 4
  • Live session model switch requested: <model>: 2
  • Request was aborted.: 1

3. Fallback chains take minutes

Example fallback chains from today's logs:

Run ID Chain Total time
7c914aae qwen→timeout → Kimi→timeout → gemma→timeout → Sonnet✓ 23.4 min
21ca97c0 gemma→timeout → qwen✓ 4.1 min
0cb06206 gemma→timeout → Kimi✓ 56.6s
66f5e9e5 gemma→timeout → GLM-5✓ 80.6s

4. Gateway can't even spawn subagents

Attempting sessions_spawn returns:

gateway timeout after 10000ms
Gateway target: ws://127.0.0.1:18789

Meanwhile, direct curl to the same endpoints returns in <1s.

5. "Overloaded" misclassification

The gateway logs show reason: "overloaded" with errorPreview: "Live session model switch requested: novita/zai-org/glm-4.7". A session model mismatch is being classified as a provider overload — the gateway is conflating an internal session state error with provider unavailability.

Expected Behaviour

  • Requests to healthy, sub-second local endpoints should not time out
  • Session model switch errors should not be classified as "overloaded"
  • Fallback chains should not take minutes when all providers are responsive
  • sessions_spawn should not time out when the gateway is under normal load

Root Cause Hypothesis

Two distinct bugs:

  1. Internal timeout too aggressive or misapplied: The gateway's LLM request timeout fires before the endpoint responds, or the timeout is applied to an internal queue wait rather than the actual HTTP request. Endpoints respond in <1s but the gateway reports "LLM request timed out" 17 times today.

  2. LiveSessionModelSwitchError misclassified as "overloaded": When a cron job or isolated session requests a model different from the live session's current model, the gateway throws LiveSessionModelSwitchError and classifies this as reason: "overloaded" in the fallback system. This is semantically wrong and triggers unnecessary fallback cascading.

Reproduction

  1. Configure two local vLLM providers with fast endpoints (<1s response)
  2. Configure 3+ agents with cron jobs using different model overrides
  3. Observe gateway logs: endpoints will be marked as "timed out" despite being healthy
  4. Run curl directly against the same endpoints to confirm sub-second response

Impact

Additional Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions