Skip to content

fix: recover MCP stale session retries#19208

Open
mgh3326 wants to merge 3 commits into
NousResearch:mainfrom
mgh3326:feature/ROB-89-hermes-mcp-session-termination-recovery
Open

fix: recover MCP stale session retries#19208
mgh3326 wants to merge 3 commits into
NousResearch:mainfrom
mgh3326:feature/ROB-89-hermes-mcp-session-termination-recovery

Conversation

@mgh3326

@mgh3326 mgh3326 commented May 3, 2026

Copy link
Copy Markdown

Summary

  • Recover MCP client calls when an MCP server reports stale/expired/terminated session state.
  • Trigger reconnect and wait for a rebuilt session identity before retrying tool/resource/prompt handlers.
  • On recovery failure, bump the circuit breaker (so the model backs off a permanently broken transport); on recovery success, fully reset the breaker (count + opened-at) to avoid a stale half-open state.
  • Add regression coverage for terminated sessions, reconnect timeout, retry failure, stale-session retry ordering, and breaker bump/reset semantics.

Verification

  • python -m ruff check tools/mcp_tool.py tests/tools/test_mcp_tool_session_expired.py
  • python -m pytest tests/tools/test_mcp_tool_session_expired.py -v -o 'addopts=' (26 passed)
  • Prior implementer verification: python -m pytest tests/tools -v -k mcp -o 'addopts=' (326 passed, 1 skipped, 4056 deselected, 3 warnings)

Follow-ups (not in this PR)

  • _handle_auth_error_and_retry has the same retry-on-stale-session race (mcp_tool.py:1612-1616) that this PR fixes for the session-expired path; tracked as a separate PR so the OAuth recovery scope stays reviewable on its own.

Safety / non-actions

  • No broker, trading, scheduler, or production database mutations.
  • No secrets added or printed.
  • Hermes Agent MCP client code only; no production deploy/smoke is applicable for this repository slice.

@alt-glitch alt-glitch added P2 Medium — degraded but workaround exists type/bug Something isn't working tool/mcp MCP client and OAuth labels May 3, 2026
Three follow-ups to the structured stale-transport return added in
this PR:

- Failed recovery (timeout / retry raised / retry returned JSON
  error) now calls _bump_server_error() so the circuit breaker still
  trips on a permanently broken transport. Before, the structured
  return bypassed the caller's bump and the model could hammer the
  tool indefinitely.
- Successful retry uses _reset_server_error() instead of a raw count
  assignment, so the breaker_opened_at timestamp is also cleared.
  Without this, a server that recovered after tripping the breaker
  would keep a stale opened-at, miscomputing future cooldowns.
- _stale_transport_error drops the op_description from the
  operator-visible message; the auth-error sibling already keeps its
  message scoped to the server, and the failing op is preserved in
  the surrounding log lines.

Tests cover the three failure-mode bumps, the success-path reset
(both count and opened-at), and document the _install_stub_server
__dict__-copy contract that future contributors are most likely to
trip over.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P2 Medium — degraded but workaround exists tool/mcp MCP client and OAuth type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants