Bug Report: Gateway becomes unresponsive after Telegram timeout — requires manual restart
Summary
After Telegram API returns HTTP 502 (Bad Gateway) and reconnection fails, the Gateway continues running for ~25 minutes, then becomes completely unresponsive (no logs, no cron execution, no message handling) until manually restarted.
Timeline of Events
2026-03-26 01:11:11 — Initial Telegram error:
WARNING gateway.platforms.telegram: [Telegram] Telegram network error, scheduling reconnect: Bad Gateway
WARNING gateway.platforms.telegram: [Telegram] Telegram network error (attempt 1/10), reconnecting in 5s. Error: Bad Gateway
2026-03-26 01:11:26 — Reconnection fails:
WARNING gateway.platforms.telegram: [Telegram] Telegram polling reconnect failed: Timed out
2026-03-26 01:11 — 01:36 — Gateway continues operating normally:
- Cron jobs execute successfully on schedule
- Logs show normal activity
- Duration: ~25 minutes of normal operation after Telegram failure
2026-03-26 01:37 — 03:09 — Gateway silent (~1 hour 32 minutes):
- No log entries
- No cron job execution (scheduled jobs did not run)
- No response to messages
- Process still running (PID unchanged)
2026-03-26 03:09:21 — Manual restart:
INFO gateway.run: Stopping gateway...
INFO gateway.run: Gateway stopped
INFO gateway.run: Starting Hermes Gateway...
INFO gateway.run: Gateway running with 1 platform(s)
Impact
- Gateway becomes completely unresponsive ~25 minutes after transient network error
- No automatic recovery mechanism
- No health check or watchdog to detect stuck state
- Requires manual intervention (
hermes gateway restart)
- ~1.5 hours of downtime in this case
Expected Behavior
- Gateway should continue operating even if one platform (Telegram) fails
- Cron jobs should continue executing regardless of messaging platform status
- Some form of health monitoring should detect unresponsive state
- Automatic restart or recovery mechanism should exist
Observations
- Gateway process did NOT crash (PID remained the same)
- Gateway continued operating normally for ~25 minutes after Telegram failure
- Gateway stopped writing to logs entirely after 01:36
- Scheduled cron jobs stopped executing (no cycles between 01:36 and 03:09)
- No error messages explaining why gateway became unresponsive
- Telegram reconnection failure appears to be the trigger, but root cause is unclear
Proposed Solutions
- Add watchdog/health check — Detect when gateway stops processing and auto-restart
- Improve error isolation — Telegram failures should not affect cron execution
- Add logging for stuck detection — Log when no activity for N minutes
- Implement circuit breaker — Stop retrying Telegram after N failures, mark as "unavailable" but continue other operations
- Add metrics/monitoring — Track last successful cron execution, last message processed, etc.
Environment
- Hermes Agent version: Latest (2026-03-26)
- Python: 3.11
- OS: macOS
- Gateway mode:
hermes gateway run --replace
Related Issues
Priority: High — Gateway becomes unresponsive, blocking all agent functionality
Component: gateway/run.py, gateway/platforms/telegram.py
Bug Report: Gateway becomes unresponsive after Telegram timeout — requires manual restart
Summary
After Telegram API returns HTTP 502 (Bad Gateway) and reconnection fails, the Gateway continues running for ~25 minutes, then becomes completely unresponsive (no logs, no cron execution, no message handling) until manually restarted.
Timeline of Events
2026-03-26 01:11:11 — Initial Telegram error:
2026-03-26 01:11:26 — Reconnection fails:
2026-03-26 01:11 — 01:36 — Gateway continues operating normally:
2026-03-26 01:37 — 03:09 — Gateway silent (~1 hour 32 minutes):
2026-03-26 03:09:21 — Manual restart:
Impact
hermes gateway restart)Expected Behavior
Observations
Proposed Solutions
Environment
hermes gateway run --replaceRelated Issues
Priority: High — Gateway becomes unresponsive, blocking all agent functionality
Component: gateway/run.py, gateway/platforms/telegram.py