Summary
When an agent session encounters a non-retryable LLM error (e.g., HTTP 400 "invalid model ID"), Hermes enters an infinite loop of reinitializing MCP server connections instead of exiting gracefully. This burns 91%+ CPU indefinitely with no backoff.
Reproduction
- Configure an agent with a model ID that returns 400 (e.g.,
openrouter/anthropic/claude-sonnet-4-6 which is not a valid OpenRouter model ID)
- Start the agent via gateway or CLI
- Agent begins processing, hits 400 error
- Instead of failing the session, Hermes re-initializes MCP connections in a tight loop (~every 30-60 seconds)
- CPU pegged at 91%+ indefinitely — observed running for 5+ hours accumulating 305 minutes of CPU time
Evidence from logs
2026-04-09 11:03:35 ERROR root: Non-retryable client error: Error code: 400 - {'error': {'message': 'openrouter/anthropic/claude-sonnet-4-6 is not a valid model ID', 'code': 400}}
Followed by hundreds of repeated MCP reinit cycles:
2026-04-10 05:09:56 INFO run_agent: Loaded environment variables from /home/openclaw/.hermes/.env
2026-04-10 05:09:58 INFO tools.mcp_tool: MCP server 'firecrawl' (stdio): registered 22 tool(s)
2026-04-10 05:19:32 INFO run_agent: Loaded environment variables from /home/openclaw/.hermes/.env
2026-04-10 05:19:34 INFO tools.mcp_tool: MCP server 'firecrawl' (stdio): registered 22 tool(s)
... (repeats every 30-60 seconds for hours)
Expected behavior
- On a non-retryable 400 error, the agent should fail the current session and exit cleanly
- At minimum, implement exponential backoff on consecutive failures
- Ideally, add a
max_consecutive_failures config (default ~5) after which the session is terminated
Impact
On a 4-core server, this pegs an entire core (25% total CPU). Multiple runaway sessions can stack up and make the server unresponsive.
Environment
- Hermes v0.8.0 (2026.4.8)
- Provider: openai-codex (gpt-5.4) with OpenRouter fallback
- MCP server: firecrawl (22 tools)
- OS: Ubuntu, 4 cores, 15 GB RAM
Summary
When an agent session encounters a non-retryable LLM error (e.g., HTTP 400 "invalid model ID"), Hermes enters an infinite loop of reinitializing MCP server connections instead of exiting gracefully. This burns 91%+ CPU indefinitely with no backoff.
Reproduction
openrouter/anthropic/claude-sonnet-4-6which is not a valid OpenRouter model ID)Evidence from logs
Followed by hundreds of repeated MCP reinit cycles:
Expected behavior
max_consecutive_failuresconfig (default ~5) after which the session is terminatedImpact
On a 4-core server, this pegs an entire core (25% total CPU). Multiple runaway sessions can stack up and make the server unresponsive.
Environment