fix: don't evict cached agent on failed runs — prevents MCP restart loop by teknium1 · Pull Request #7539 · NousResearch/hermes-agent

teknium1 · 2026-04-11T04:15:20Z

Problem

When a gateway session encounters a persistent non-retryable error (e.g., invalid model ID → HTTP 400), the fallback provider activates and fails too. The gateway then evicts the cached agent (because the agent's model doesn't match the config default). Next message → new AIAgent() → MCP servers reinitialize (stdio processes spawn) → same 400 → fallback → eviction → loop. This burns 91%+ CPU for hours (#7130).

Root cause

Line 7569-7571: the fallback-eviction check runs unconditionally after every agent run. When fallback activated but the run failed, evicting the agent is pointless — the same error will recur. But eviction forces a full AIAgent recreation on the next message, paying the full MCP initialization cost each time.

Fix

One guard: and not _run_failed on the eviction check.

Failed runs: keep the cached agent. Next message reuses it (no MCP reinit), hits the same error quickly, returns it to the user. No CPU burn.
Successful runs with fallback: evict as before so the next message retries the primary model.

# Before
if _agent is not None and hasattr(_agent, 'model'):

# After
_run_failed = _result_for_fb.get("failed") if _result_for_fb else False
if _agent is not None and hasattr(_agent, 'model') and not _run_failed:

Changes

File	+/-
`gateway/run.py`	+10/-3
`tests/gateway/test_fallback_eviction.py`	+44 (new)

Addresses #7130.

…rors When a gateway session hits a non-retryable error (e.g. invalid model ID → HTTP 400), the agent fails and returns. But if the session keeps receiving messages (or something periodically recreates agents), each attempt spawns a new AIAgent — reinitializing MCP server connections, burning CPU — only to hit the same 400 error again. On a 4-core server, this pegs an entire core per stuck session and accumulates 300+ minutes of CPU time over hours. Fix: add a per-session consecutive failure counter in the gateway runner. - Track consecutive non-retryable failures per session key - After 3 consecutive failures (_MAX_CONSECUTIVE_FAILURES), block further agent creation for that session and notify the user: '⚠️ This session has failed N times in a row with a non-retryable error. Use /reset to start a new session.' - Evict the cached agent when the circuit breaker engages to prevent stale state from accumulating - Reset the counter on successful agent runs - Clear the counter on /reset and /new so users can recover - Uses getattr() pattern so bare GatewayRunner instances (common in tests using object.__new__) don't crash Tests: - 8 new tests in test_circuit_breaker.py covering counter behavior, threshold, reset, session isolation, and bare-runner safety Addresses #7130.

…stent errors" This reverts commit d848ea7.

When a run fails (e.g. invalid model ID → 400) and fallback activated, the gateway was evicting the cached agent to 'retry primary next time.' But evicting a failed agent forces a full AIAgent recreation on the next message — reinitializing MCP server connections, spawning stdio processes — only to hit the same 400 again. This created a CPU-burning loop (91%+ for hours, #7130). The fix: add `and not _run_failed` to the fallback-eviction check. Failed runs keep the cached agent. The next message reuses it (no MCP reinit), hits the same error, returns it to the user quickly. The user can /reset or /model to fix their config. Successful fallback runs still evict as before so the next message retries the primary model. Addresses #7130.

…oop (NousResearch#7539) * fix: circuit breaker stops CPU-burning restart loops on persistent errors When a gateway session hits a non-retryable error (e.g. invalid model ID → HTTP 400), the agent fails and returns. But if the session keeps receiving messages (or something periodically recreates agents), each attempt spawns a new AIAgent — reinitializing MCP server connections, burning CPU — only to hit the same 400 error again. On a 4-core server, this pegs an entire core per stuck session and accumulates 300+ minutes of CPU time over hours. Fix: add a per-session consecutive failure counter in the gateway runner. - Track consecutive non-retryable failures per session key - After 3 consecutive failures (_MAX_CONSECUTIVE_FAILURES), block further agent creation for that session and notify the user: '⚠️ This session has failed N times in a row with a non-retryable error. Use /reset to start a new session.' - Evict the cached agent when the circuit breaker engages to prevent stale state from accumulating - Reset the counter on successful agent runs - Clear the counter on /reset and /new so users can recover - Uses getattr() pattern so bare GatewayRunner instances (common in tests using object.__new__) don't crash Tests: - 8 new tests in test_circuit_breaker.py covering counter behavior, threshold, reset, session isolation, and bare-runner safety Addresses NousResearch#7130. * Revert "fix: circuit breaker stops CPU-burning restart loops on persistent errors" This reverts commit d848ea7. * fix: don't evict cached agent on failed runs — prevents MCP restart loop When a run fails (e.g. invalid model ID → 400) and fallback activated, the gateway was evicting the cached agent to 'retry primary next time.' But evicting a failed agent forces a full AIAgent recreation on the next message — reinitializing MCP server connections, spawning stdio processes — only to hit the same 400 again. This created a CPU-burning loop (91%+ for hours, NousResearch#7130). The fix: add `and not _run_failed` to the fallback-eviction check. Failed runs keep the cached agent. The next message reuses it (no MCP reinit), hits the same error, returns it to the user quickly. The user can /reset or /model to fix their config. Successful fallback runs still evict as before so the next message retries the primary model. Addresses NousResearch#7130.

…oop (NousResearch#7539) * fix: circuit breaker stops CPU-burning restart loops on persistent errors When a gateway session hits a non-retryable error (e.g. invalid model ID → HTTP 400), the agent fails and returns. But if the session keeps receiving messages (or something periodically recreates agents), each attempt spawns a new AIAgent — reinitializing MCP server connections, burning CPU — only to hit the same 400 error again. On a 4-core server, this pegs an entire core per stuck session and accumulates 300+ minutes of CPU time over hours. Fix: add a per-session consecutive failure counter in the gateway runner. - Track consecutive non-retryable failures per session key - After 3 consecutive failures (_MAX_CONSECUTIVE_FAILURES), block further agent creation for that session and notify the user: '⚠️ This session has failed N times in a row with a non-retryable error. Use /reset to start a new session.' - Evict the cached agent when the circuit breaker engages to prevent stale state from accumulating - Reset the counter on successful agent runs - Clear the counter on /reset and /new so users can recover - Uses getattr() pattern so bare GatewayRunner instances (common in tests using object.__new__) don't crash Tests: - 8 new tests in test_circuit_breaker.py covering counter behavior, threshold, reset, session isolation, and bare-runner safety Addresses NousResearch#7130. * Revert "fix: circuit breaker stops CPU-burning restart loops on persistent errors" This reverts commit 088c185. * fix: don't evict cached agent on failed runs — prevents MCP restart loop When a run fails (e.g. invalid model ID → 400) and fallback activated, the gateway was evicting the cached agent to 'retry primary next time.' But evicting a failed agent forces a full AIAgent recreation on the next message — reinitializing MCP server connections, spawning stdio processes — only to hit the same 400 again. This created a CPU-burning loop (91%+ for hours, NousResearch#7130). The fix: add `and not _run_failed` to the fallback-eviction check. Failed runs keep the cached agent. The next message reuses it (no MCP reinit), hits the same error, returns it to the user quickly. The user can /reset or /model to fix their config. Successful fallback runs still evict as before so the next message retries the primary model. Addresses NousResearch#7130.

…oop (NousResearch#7539) * fix: circuit breaker stops CPU-burning restart loops on persistent errors When a gateway session hits a non-retryable error (e.g. invalid model ID → HTTP 400), the agent fails and returns. But if the session keeps receiving messages (or something periodically recreates agents), each attempt spawns a new AIAgent — reinitializing MCP server connections, burning CPU — only to hit the same 400 error again. On a 4-core server, this pegs an entire core per stuck session and accumulates 300+ minutes of CPU time over hours. Fix: add a per-session consecutive failure counter in the gateway runner. - Track consecutive non-retryable failures per session key - After 3 consecutive failures (_MAX_CONSECUTIVE_FAILURES), block further agent creation for that session and notify the user: '⚠️ This session has failed N times in a row with a non-retryable error. Use /reset to start a new session.' - Evict the cached agent when the circuit breaker engages to prevent stale state from accumulating - Reset the counter on successful agent runs - Clear the counter on /reset and /new so users can recover - Uses getattr() pattern so bare GatewayRunner instances (common in tests using object.__new__) don't crash Tests: - 8 new tests in test_circuit_breaker.py covering counter behavior, threshold, reset, session isolation, and bare-runner safety Addresses NousResearch#7130. * Revert "fix: circuit breaker stops CPU-burning restart loops on persistent errors" This reverts commit d848ea7. * fix: don't evict cached agent on failed runs — prevents MCP restart loop When a run fails (e.g. invalid model ID → 400) and fallback activated, the gateway was evicting the cached agent to 'retry primary next time.' But evicting a failed agent forces a full AIAgent recreation on the next message — reinitializing MCP server connections, spawning stdio processes — only to hit the same 400 again. This created a CPU-burning loop (91%+ for hours, NousResearch#7130). The fix: add `and not _run_failed` to the fallback-eviction check. Failed runs keep the cached agent. The next message reuses it (no MCP reinit), hits the same error, returns it to the user quickly. The user can /reset or /model to fix their config. Successful fallback runs still evict as before so the next message retries the primary model. Addresses NousResearch#7130.

…oop (NousResearch#7539) * fix: circuit breaker stops CPU-burning restart loops on persistent errors When a gateway session hits a non-retryable error (e.g. invalid model ID → HTTP 400), the agent fails and returns. But if the session keeps receiving messages (or something periodically recreates agents), each attempt spawns a new AIAgent — reinitializing MCP server connections, burning CPU — only to hit the same 400 error again. On a 4-core server, this pegs an entire core per stuck session and accumulates 300+ minutes of CPU time over hours. Fix: add a per-session consecutive failure counter in the gateway runner. - Track consecutive non-retryable failures per session key - After 3 consecutive failures (_MAX_CONSECUTIVE_FAILURES), block further agent creation for that session and notify the user: '⚠️ This session has failed N times in a row with a non-retryable error. Use /reset to start a new session.' - Evict the cached agent when the circuit breaker engages to prevent stale state from accumulating - Reset the counter on successful agent runs - Clear the counter on /reset and /new so users can recover - Uses getattr() pattern so bare GatewayRunner instances (common in tests using object.__new__) don't crash Tests: - 8 new tests in test_circuit_breaker.py covering counter behavior, threshold, reset, session isolation, and bare-runner safety Addresses NousResearch#7130. * Revert "fix: circuit breaker stops CPU-burning restart loops on persistent errors" This reverts commit 93d56cecbef5fa602c9f536da6f9c1acc9453736. * fix: don't evict cached agent on failed runs — prevents MCP restart loop When a run fails (e.g. invalid model ID → 400) and fallback activated, the gateway was evicting the cached agent to 'retry primary next time.' But evicting a failed agent forces a full AIAgent recreation on the next message — reinitializing MCP server connections, spawning stdio processes — only to hit the same 400 again. This created a CPU-burning loop (91%+ for hours, NousResearch#7130). The fix: add `and not _run_failed` to the fallback-eviction check. Failed runs keep the cached agent. The next message reuses it (no MCP reinit), hits the same error, returns it to the user quickly. The user can /reset or /model to fix their config. Successful fallback runs still evict as before so the next message retries the primary model. Addresses NousResearch#7130.

teknium1 added 3 commits April 10, 2026 21:07

Revert "fix: circuit breaker stops CPU-burning restart loops on persi…

df4203c

…stent errors" This reverts commit d848ea7.

teknium1 merged commit 2410324 into main Apr 11, 2026
3 of 4 checks passed

teknium1 mentioned this pull request Apr 11, 2026

Agent stuck in CPU-burning MCP restart loop on non-retryable model error (400) #7130

Closed

github-actions Bot mentioned this pull request Apr 15, 2026

chore: bump NousResearch/hermes-agent version from v2026.4.8 to v2026.4.13 Docker-Hub-sirmark/docker-hermes-agent#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: don't evict cached agent on failed runs — prevents MCP restart loop#7539

fix: don't evict cached agent on failed runs — prevents MCP restart loop#7539
teknium1 merged 3 commits into
mainfrom
hermes/hermes-4a5220fe

teknium1 commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

teknium1 commented Apr 11, 2026

Problem

Root cause

Fix

Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant