Skip to content

Gateway crash loop: no backoff or circuit-breaker on auto-restart #60142

@yhyatt

Description

@yhyatt

Description

When a config schema change causes startup validation to fail, the gateway exits with code 1 and systemd restarts it indefinitely — no backoff, no circuit breaker, no user notification.

Real-World Incident

After upgrading from 2026.3.28 → 2026.4.1, the tools.web.search config schema changed. The old key path was now invalid, causing a hard config validation failure on every startup. systemd restarted the gateway 6,198 times over 12.5 hours (20:20 Apr 2 → 08:55 Apr 3) before the user manually intervened.

Actual Log (journalctl)

Apr 02 20:20:44 node[1768617]: [gateway] signal SIGTERM received
Apr 02 20:20:44 node[1768617]: [gateway] received SIGTERM; shutting down
Apr 02 20:20:46 node[2645786]: Config invalid
Apr 02 20:20:46 node[2645786]:   - tools.web.search: Unrecognized key: "brave"
Apr 02 20:20:46 systemd[332]: openclaw-gateway.service: Main process exited, code=exited, status=1/FAILURE
Apr 02 20:20:46 systemd[332]: openclaw-gateway.service: Failed with result 'exit-code'.
Apr 02 20:20:52 systemd[332]: openclaw-gateway.service: Scheduled restart job, restart counter is at 1.
Apr 02 20:20:53 node[2645810]: Config invalid
Apr 02 20:20:53 node[2645810]:   - tools.web.search: Unrecognized key: "brave"
Apr 02 20:20:54 systemd[332]: openclaw-gateway.service: Main process exited, code=exited, status=1/FAILURE
Apr 02 20:20:54 systemd[332]: openclaw-gateway.service: Failed with result 'exit-code'.
Apr 02 20:20:59 systemd[332]: openclaw-gateway.service: Scheduled restart job, restart counter is at 2.
[... same pattern every ~5-7 seconds ...]
Apr 03 08:49:30 systemd[332]: openclaw-gateway.service: Scheduled restart job, restart counter is at 6197.
Apr 03 08:49:38 systemd[332]: openclaw-gateway.service: Scheduled restart job, restart counter is at 6198.

Duration: ~12.5 hours. Restarts: 6,198. Interval: ~7 seconds each.

Steps to Reproduce

  1. Upgrade openclaw from 2026.3.28 → 2026.4.1
  2. Have tools.web.search.brave key in openclaw.json (old schema path)
  3. Restart gateway — immediately enters crash loop

Expected Behavior

After 3 consecutive startup failures within 60 seconds:

  1. Stop auto-restarting
  2. Write a clear message to logs: "Gateway failed to start 3 times in 60s — possible config error. Check ~/.openclaw/openclaw.json. Run 'openclaw doctor' to diagnose."
  3. Exit cleanly (let the user fix and manually restart)

Suggested Fix

Option A (application-level): On startup, check a restart-sentinel file (already exists: src/infra/restart-sentinel.ts). If 3+ restarts within 60s, write error and exit with a distinct code (e.g. 78 = EX_CONFIG) that systemd's RestartPreventExitStatus can catch.

Option B (systemd unit): Document recommended StartLimitBurst=3 + StartLimitIntervalSec=60 in the generated systemd unit file (src/daemon/node-service.ts).

Both options should be implemented — A for user-visible feedback, B as a safety net.

Environment

  • OpenClaw: 2026.4.1
  • OS: Linux (WSL2, systemd user session)
  • Supervisor: systemd

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions