Skip to content

Matrix provider connection failure causes rapid gateway process crash loop #62376

@janstenpickle

Description

@janstenpickle

Description

When the Matrix homeserver becomes unreachable (e.g., host server goes down), the gateway process enters a rapid crash loop — spawning a new process every ~2 seconds — rather than gracefully retrying or backing off.

Steps to Reproduce

  1. Configure OpenClaw with a Matrix channel pointing to a self-hosted Synapse instance
  2. Take the Matrix homeserver offline (power off the host)
  3. Observe gateway logs

Expected Behavior

The Matrix provider should:

  • Catch connection errors gracefully
  • Use exponential backoff for reconnection attempts
  • Keep the gateway process alive (other channels like webchat should remain functional)
  • Respect channelMaxRestartsPerHour for health-monitor-initiated restarts

Actual Behavior

The Matrix SDK appears to throw an uncaught exception on connection failure that kills the entire Node.js process. The macOS LaunchAgent (or systemd) immediately restarts it, which tries Matrix again, crashes again, creating a tight loop.

Evidence from logs (2026-04-07):

  • 04:16–08:01 BST: Gateway process restarted with a new PID every ~2 seconds for 3.5+ hours
  • PIDs increment by ~23 each time (e.g., 77007, 77030, 77053, 77096...)
  • Each cycle: starts → Matrix connect attempt → process dies → LaunchAgent restarts
  • channelMaxRestartsPerHour had no effect because it's the process crashing, not the health monitor restarting the channel
  • Other channels (webchat) were repeatedly disconnected with code=1012 reason=service restart

Separate from health monitor restarts

The Matrix provider also has an auto-restart attempt N/10 mechanism that does use backoff — this works correctly. The crash loop is something else: an unhandled exception that takes down the entire gateway.

Environment

  • OpenClaw version: 2026.3.13
  • Node.js: v22.22.0
  • OS: macOS (arm64)
  • Matrix homeserver: self-hosted Synapse on NixOS
  • Matrix config: allowPrivateNetwork: true

Workaround

Setting gateway.channelMaxRestartsPerHour and gateway.channelStaleEventThresholdMinutes helps with the health-monitor-initiated restarts but does not prevent the crash loop.

Suggestion

The Matrix provider (or the SDK integration layer) needs a top-level try/catch or process-level unhandled rejection handler that prevents connection failures from crashing the gateway process. Connection errors should be caught and retried with backoff, keeping the rest of the gateway operational.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions