Skip to content

Gateway crashes on Telegram Bad Gateway (502) — reconnect loop fails #3173

@glasharocks

Description

@glasharocks

Bug Report: Gateway becomes unresponsive after Telegram timeout — requires manual restart

Summary

After Telegram API returns HTTP 502 (Bad Gateway) and reconnection fails, the Gateway continues running for ~25 minutes, then becomes completely unresponsive (no logs, no cron execution, no message handling) until manually restarted.

Timeline of Events

2026-03-26 01:11:11 — Initial Telegram error:

WARNING gateway.platforms.telegram: [Telegram] Telegram network error, scheduling reconnect: Bad Gateway
WARNING gateway.platforms.telegram: [Telegram] Telegram network error (attempt 1/10), reconnecting in 5s. Error: Bad Gateway

2026-03-26 01:11:26 — Reconnection fails:

WARNING gateway.platforms.telegram: [Telegram] Telegram polling reconnect failed: Timed out

2026-03-26 01:11 — 01:36 — Gateway continues operating normally:

  • Cron jobs execute successfully on schedule
  • Logs show normal activity
  • Duration: ~25 minutes of normal operation after Telegram failure

2026-03-26 01:37 — 03:09Gateway silent (~1 hour 32 minutes):

  • No log entries
  • No cron job execution (scheduled jobs did not run)
  • No response to messages
  • Process still running (PID unchanged)

2026-03-26 03:09:21 — Manual restart:

INFO gateway.run: Stopping gateway...
INFO gateway.run: Gateway stopped
INFO gateway.run: Starting Hermes Gateway...
INFO gateway.run: Gateway running with 1 platform(s)

Impact

  • Gateway becomes completely unresponsive ~25 minutes after transient network error
  • No automatic recovery mechanism
  • No health check or watchdog to detect stuck state
  • Requires manual intervention (hermes gateway restart)
  • ~1.5 hours of downtime in this case

Expected Behavior

  1. Gateway should continue operating even if one platform (Telegram) fails
  2. Cron jobs should continue executing regardless of messaging platform status
  3. Some form of health monitoring should detect unresponsive state
  4. Automatic restart or recovery mechanism should exist

Observations

  • Gateway process did NOT crash (PID remained the same)
  • Gateway continued operating normally for ~25 minutes after Telegram failure
  • Gateway stopped writing to logs entirely after 01:36
  • Scheduled cron jobs stopped executing (no cycles between 01:36 and 03:09)
  • No error messages explaining why gateway became unresponsive
  • Telegram reconnection failure appears to be the trigger, but root cause is unclear

Proposed Solutions

  1. Add watchdog/health check — Detect when gateway stops processing and auto-restart
  2. Improve error isolation — Telegram failures should not affect cron execution
  3. Add logging for stuck detection — Log when no activity for N minutes
  4. Implement circuit breaker — Stop retrying Telegram after N failures, mark as "unavailable" but continue other operations
  5. Add metrics/monitoring — Track last successful cron execution, last message processed, etc.

Environment

  • Hermes Agent version: Latest (2026-03-26)
  • Python: 3.11
  • OS: macOS
  • Gateway mode: hermes gateway run --replace

Related Issues


Priority: High — Gateway becomes unresponsive, blocking all agent functionality
Component: gateway/run.py, gateway/platforms/telegram.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions