Skip to content

Agent stuck in CPU-burning MCP restart loop on non-retryable model error (400) #7130

@TheophilusChinomona

Description

@TheophilusChinomona

Summary

When an agent session encounters a non-retryable LLM error (e.g., HTTP 400 "invalid model ID"), Hermes enters an infinite loop of reinitializing MCP server connections instead of exiting gracefully. This burns 91%+ CPU indefinitely with no backoff.

Reproduction

  1. Configure an agent with a model ID that returns 400 (e.g., openrouter/anthropic/claude-sonnet-4-6 which is not a valid OpenRouter model ID)
  2. Start the agent via gateway or CLI
  3. Agent begins processing, hits 400 error
  4. Instead of failing the session, Hermes re-initializes MCP connections in a tight loop (~every 30-60 seconds)
  5. CPU pegged at 91%+ indefinitely — observed running for 5+ hours accumulating 305 minutes of CPU time

Evidence from logs

2026-04-09 11:03:35 ERROR root: Non-retryable client error: Error code: 400 - {'error': {'message': 'openrouter/anthropic/claude-sonnet-4-6 is not a valid model ID', 'code': 400}}

Followed by hundreds of repeated MCP reinit cycles:

2026-04-10 05:09:56 INFO run_agent: Loaded environment variables from /home/openclaw/.hermes/.env
2026-04-10 05:09:58 INFO tools.mcp_tool: MCP server 'firecrawl' (stdio): registered 22 tool(s)
2026-04-10 05:19:32 INFO run_agent: Loaded environment variables from /home/openclaw/.hermes/.env
2026-04-10 05:19:34 INFO tools.mcp_tool: MCP server 'firecrawl' (stdio): registered 22 tool(s)
... (repeats every 30-60 seconds for hours)

Expected behavior

  • On a non-retryable 400 error, the agent should fail the current session and exit cleanly
  • At minimum, implement exponential backoff on consecutive failures
  • Ideally, add a max_consecutive_failures config (default ~5) after which the session is terminated

Impact

On a 4-core server, this pegs an entire core (25% total CPU). Multiple runaway sessions can stack up and make the server unresponsive.

Environment

  • Hermes v0.8.0 (2026.4.8)
  • Provider: openai-codex (gpt-5.4) with OpenRouter fallback
  • MCP server: firecrawl (22 tools)
  • OS: Ubuntu, 4 cores, 15 GB RAM

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions