Skip to content

fix: gateway resilience — token rejection handling, state self-heal, health endpoint#7023

Open
Git-on-my-level wants to merge 1 commit into
NousResearch:mainfrom
Git-on-my-level:fix/gateway-resilience-upstream-v2
Open

fix: gateway resilience — token rejection handling, state self-heal, health endpoint#7023
Git-on-my-level wants to merge 1 commit into
NousResearch:mainfrom
Git-on-my-level:fix/gateway-resilience-upstream-v2

Conversation

@Git-on-my-level

Copy link
Copy Markdown
Contributor

Summary

Three gateway resilience improvements to reduce silent failures and crash loops.

Closes Git-on-my-level#3

Changes

1. Telegram token rejection -> degraded state instead of crash loop

File: gateway/platforms/telegram.py

Previously, a revoked/invalid bot token caused the Telegram adapter to exit immediately, which combined with launchd's KeepAlive created a tight crash loop hammering the API.

Now distinguishes between:

  • Transient errors (network, timeout) -> retry with exponential backoff (existing behavior)
  • Non-retryable errors (InvalidToken, Unauthorized) -> set fatal_error_retryable=False, enter degraded state with 5-10 minute retry interval, clear logging

2. gateway_state.json self-heals on successful reconnect

File: gateway/run.py, gateway/status.py

Stale error messages previously persisted in gateway_state.json even after successful reconnect, making diagnostics confusing (state showed old token error while gateway was healthy).

Now clears error_message and resets error_code when adapters successfully connect or reconnect.

3. Lightweight health check file

File: gateway/run.py, gateway/status.py

Writes gateway_health.json every 60 seconds with:

  • Gateway state, per-adapter status, uptime, last error
  • Makes external monitoring/alerting trivial (no need to parse gateway_state.json crash artifacts)

Testing

  • Added tests for token rejection degraded state
  • Added tests for state self-healing
  • Added tests for health check file output
  • Existing tests pass

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery platform/telegram Telegram bot adapter labels Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists platform/telegram Telegram bot adapter type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gateway resilience: upstream PRs to propose

2 participants