Gateway crashes on Telegram Bad Gateway (502) — reconnect loop fails

## Bug Report: Gateway becomes unresponsive after Telegram timeout — requires manual restart

### Summary

After Telegram API returns HTTP 502 (Bad Gateway) and reconnection fails, the Gateway continues running for ~25 minutes, then **becomes completely unresponsive** (no logs, no cron execution, no message handling) until manually restarted.

### Timeline of Events

**2026-03-26 01:11:11** — Initial Telegram error:
```
WARNING gateway.platforms.telegram: [Telegram] Telegram network error, scheduling reconnect: Bad Gateway
WARNING gateway.platforms.telegram: [Telegram] Telegram network error (attempt 1/10), reconnecting in 5s. Error: Bad Gateway
```

**2026-03-26 01:11:26** — Reconnection fails:
```
WARNING gateway.platforms.telegram: [Telegram] Telegram polling reconnect failed: Timed out
```

**2026-03-26 01:11 — 01:36** — Gateway continues operating normally:
- Cron jobs execute successfully on schedule
- Logs show normal activity
- **Duration: ~25 minutes of normal operation after Telegram failure**

**2026-03-26 01:37 — 03:09** — **Gateway silent** (~1 hour 32 minutes):
- No log entries
- No cron job execution (scheduled jobs did not run)
- No response to messages
- Process still running (PID unchanged)

**2026-03-26 03:09:21** — Manual restart:
```
INFO gateway.run: Stopping gateway...
INFO gateway.run: Gateway stopped
INFO gateway.run: Starting Hermes Gateway...
INFO gateway.run: Gateway running with 1 platform(s)
```

### Impact

- Gateway becomes completely unresponsive ~25 minutes after transient network error
- No automatic recovery mechanism
- No health check or watchdog to detect stuck state
- Requires manual intervention (`hermes gateway restart`)
- ~1.5 hours of downtime in this case

### Expected Behavior

1. Gateway should continue operating even if one platform (Telegram) fails
2. Cron jobs should continue executing regardless of messaging platform status
3. Some form of health monitoring should detect unresponsive state
4. Automatic restart or recovery mechanism should exist

### Observations

- Gateway process did NOT crash (PID remained the same)
- Gateway continued operating normally for ~25 minutes after Telegram failure
- Gateway stopped writing to logs entirely after 01:36
- Scheduled cron jobs stopped executing (no cycles between 01:36 and 03:09)
- No error messages explaining why gateway became unresponsive
- Telegram reconnection failure appears to be the trigger, but root cause is unclear

### Proposed Solutions

1. **Add watchdog/health check** — Detect when gateway stops processing and auto-restart
2. **Improve error isolation** — Telegram failures should not affect cron execution
3. **Add logging for stuck detection** — Log when no activity for N minutes
4. **Implement circuit breaker** — Stop retrying Telegram after N failures, mark as "unavailable" but continue other operations
5. **Add metrics/monitoring** — Track last successful cron execution, last message processed, etc.

### Environment

- Hermes Agent version: Latest (2026-03-26)
- Python: 3.11
- OS: macOS
- Gateway mode: `hermes gateway run --replace`

### Related Issues

- #2910 — "Telegram message delivery failure not surfaced to user - appears as 'hang/crash'" (similar symptom: user sees hang when Telegram fails)

---

**Priority:** High — Gateway becomes unresponsive, blocking all agent functionality
**Component:** gateway/run.py, gateway/platforms/telegram.py


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gateway crashes on Telegram Bad Gateway (502) — reconnect loop fails #3173

Bug Report: Gateway becomes unresponsive after Telegram timeout — requires manual restart

Summary

Timeline of Events

Impact

Expected Behavior

Observations

Proposed Solutions

Environment

Related Issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Gateway crashes on Telegram Bad Gateway (502) — reconnect loop fails #3173

Description

Bug Report: Gateway becomes unresponsive after Telegram timeout — requires manual restart

Summary

Timeline of Events

Impact

Expected Behavior

Observations

Proposed Solutions

Environment

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions