Skip to content

doctor --fix self-terminates with SIGTERM when gateway is running #78217

@esqandil

Description

@esqandil

Summary

openclaw doctor --fix aborts with SIGTERM when the gateway service is running, instead of either completing safe-live fixes or cleanly skipping them.

Observed multiple times in the same session:

  • Gateway active under systemd user supervision (systemctl --user status openclaw-gateway.service)
  • /healthz returning {"ok":true,"status":"live"}
  • openclaw doctor --fix --non-interactive emits some output, then exits via SIGTERM before completing all fix actions
  • Gateway-port section of doctor output includes:
    Health check failed: GatewayTransportError: gateway timeout after 3000ms
    Gateway target: ws://127.0.0.1:18789
    ...
    Port 18789 is already in use.
    - pid <X> shadeform: openclaw (127.0.0.1:18789)
    - Gateway already running locally. Stop it (openclaw gateway stop) or use a different port.
    

The doctor self-confuses: it can't talk to the gateway over WS (3-second timeout), then concludes the port is "already in use" (by the same gateway that just answered /healthz), then SIGTERMs itself.

Environment

  • OpenClaw 2026.5.4 (commit 325df3e)
  • Linux (Ubuntu 24.04)
  • Gateway running under systemctl --user user supervision

Reproduction

  1. Have gateway running and healthy: curl http://127.0.0.1:18789/healthz{"ok":true,"status":"live"}
  2. Run openclaw doctor --fix --non-interactive
  3. Observe the run aborts via SIGTERM before completing all fixes

Expected behavior

doctor --fix should either:

  • (a) preflight-classify each fix as safe-live vs requires-restart, apply the safe-live ones, and skip-with-explanation the others, OR
  • (b) refuse to run with --fix against a live gateway and tell the user to stop it first

What it should NOT do is partially run, then SIGTERM itself mid-stream, leaving the user uncertain about what got applied.

Impact

  • Operators can't use --fix for routine maintenance against a running gateway
  • Manual cleanup is required (renaming orphan transcripts, archiving stale agent dirs, etc.)
  • Self-termination during incident response increases risk and confusion

Workarounds

  • Do cleanup steps manually: archive orphan transcripts via mv *.jsonl *.deleted.<ts>, move stale agent dirs to a .archived/ sibling, etc.
  • Use openclaw doctor --non-interactive (no --fix) to validate state — that one runs cleanly against a live gateway

Suggested fix

  1. The gateway-port check should not run Port already in use warning when the gateway PID match equals the running gateway service PID (it's the same process answering /healthz and listening on 18789 — there's no conflict).
  2. The 3s WS timeout on a loopback gateway is too short under load; bump to 10s or read from config.
  3. The SIGTERM appears to come from doctor's own port-bind probe trying to bind 18789 itself to "verify" it's free. That probe should be skipped when a live gateway is detected.
  4. Consider whether --fix should ever be allowed against a running gateway — if not, exit cleanly with code 78 ("config valid, but live gateway detected; stop it first") rather than SIGTERM.

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions