Description
When the Matrix homeserver becomes unreachable (e.g., host server goes down), the gateway process enters a rapid crash loop — spawning a new process every ~2 seconds — rather than gracefully retrying or backing off.
Steps to Reproduce
- Configure OpenClaw with a Matrix channel pointing to a self-hosted Synapse instance
- Take the Matrix homeserver offline (power off the host)
- Observe gateway logs
Expected Behavior
The Matrix provider should:
- Catch connection errors gracefully
- Use exponential backoff for reconnection attempts
- Keep the gateway process alive (other channels like webchat should remain functional)
- Respect
channelMaxRestartsPerHour for health-monitor-initiated restarts
Actual Behavior
The Matrix SDK appears to throw an uncaught exception on connection failure that kills the entire Node.js process. The macOS LaunchAgent (or systemd) immediately restarts it, which tries Matrix again, crashes again, creating a tight loop.
Evidence from logs (2026-04-07):
- 04:16–08:01 BST: Gateway process restarted with a new PID every ~2 seconds for 3.5+ hours
- PIDs increment by ~23 each time (e.g., 77007, 77030, 77053, 77096...)
- Each cycle: starts → Matrix connect attempt → process dies → LaunchAgent restarts
channelMaxRestartsPerHour had no effect because it's the process crashing, not the health monitor restarting the channel
- Other channels (webchat) were repeatedly disconnected with
code=1012 reason=service restart
Separate from health monitor restarts
The Matrix provider also has an auto-restart attempt N/10 mechanism that does use backoff — this works correctly. The crash loop is something else: an unhandled exception that takes down the entire gateway.
Environment
- OpenClaw version: 2026.3.13
- Node.js: v22.22.0
- OS: macOS (arm64)
- Matrix homeserver: self-hosted Synapse on NixOS
- Matrix config:
allowPrivateNetwork: true
Workaround
Setting gateway.channelMaxRestartsPerHour and gateway.channelStaleEventThresholdMinutes helps with the health-monitor-initiated restarts but does not prevent the crash loop.
Suggestion
The Matrix provider (or the SDK integration layer) needs a top-level try/catch or process-level unhandled rejection handler that prevents connection failures from crashing the gateway process. Connection errors should be caught and retried with backoff, keeping the rest of the gateway operational.
Description
When the Matrix homeserver becomes unreachable (e.g., host server goes down), the gateway process enters a rapid crash loop — spawning a new process every ~2 seconds — rather than gracefully retrying or backing off.
Steps to Reproduce
Expected Behavior
The Matrix provider should:
channelMaxRestartsPerHourfor health-monitor-initiated restartsActual Behavior
The Matrix SDK appears to throw an uncaught exception on connection failure that kills the entire Node.js process. The macOS LaunchAgent (or systemd) immediately restarts it, which tries Matrix again, crashes again, creating a tight loop.
Evidence from logs (2026-04-07):
channelMaxRestartsPerHourhad no effect because it's the process crashing, not the health monitor restarting the channelcode=1012 reason=service restartSeparate from health monitor restarts
The Matrix provider also has an
auto-restart attempt N/10mechanism that does use backoff — this works correctly. The crash loop is something else: an unhandled exception that takes down the entire gateway.Environment
allowPrivateNetwork: trueWorkaround
Setting
gateway.channelMaxRestartsPerHourandgateway.channelStaleEventThresholdMinuteshelps with the health-monitor-initiated restarts but does not prevent the crash loop.Suggestion
The Matrix provider (or the SDK integration layer) needs a top-level try/catch or process-level unhandled rejection handler that prevents connection failures from crashing the gateway process. Connection errors should be caught and retried with backoff, keeping the rest of the gateway operational.