Skip to content

respawnGatewayProcessForUpdate falsely reports mode=supervised on macOS when XPC_SERVICE_NAME is inherited from a launchd-managed parent #85224

@richardmqq

Description

@richardmqq

Summary

On macOS, respawnGatewayProcessForUpdate() (and restartGatewayProcessWithFreshPid()) trusts detectRespawnSupervisor() to decide whether launchd will restart the gateway. The detector returns "launchd" if any of LAUNCH_JOB_LABEL, LAUNCH_JOB_NAME, XPC_SERVICE_NAME, or OPENCLAW_LAUNCHD_LABEL is set.

But XPC_SERVICE_NAME is inherited by any child process of a launchd-managed parent. When OpenClaw's GUI app (ai.openclaw.mac) spawns the gateway as a child — or any custom supervisor inherits launchd env from its own parent — the gateway misidentifies itself as launchd-supervised.

Result: gateway writes a gateway-supervisor-restart-handoff.json with supervisorMode: "launchd" and exits cleanly, expecting launchd to restart it. But launchd has no ai.openclaw.gateway service registered (only ai.openclaw.mac for the parent). The gateway never comes back.

Environment

  • OpenClaw: confirmed in 2026.5.19 (where I first hit it) and verified still present in 2026.5.20 (currently installed) by reading dist/supervisor-markers-B5EgETF5.js and dist/cli/gateway-lifecycle.runtime.js.
  • Node: 25.x
  • OS: macOS 15 (Darwin 25.4)
  • Trigger: any user running the OpenClaw GUI app whose ai.openclaw.gateway LaunchAgent has been unloaded (e.g. by a prior mode=reload restart script that did launchctl bootout + a failed launchctl bootstrap, or by doctor's legacy-service cleanup). Trigger event: any SIGUSR1 / update.run restart, even a dry-run status=skipped one.

Code-level trace (against 2026.5.20)

dist/supervisor-markers-B5EgETF5.js:

const SUPERVISOR_HINTS = {
  launchd: ["LAUNCH_JOB_LABEL", "LAUNCH_JOB_NAME", "XPC_SERVICE_NAME", "OPENCLAW_LAUNCHD_LABEL"],
  // ...
};
function detectRespawnSupervisor(env = process.env, platform = process.platform) {
  if (platform === "darwin") return hasAnyHint(env, SUPERVISOR_HINTS.launchd) ? "launchd" : null;
  // ...
}

dist/cli/gateway-lifecycle.runtime.js:

function respawnGatewayProcessForUpdate(opts = {}) {
  if (isTruthy(process.env.OPENCLAW_NO_RESPAWN)) return { mode: "disabled", detail: "OPENCLAW_NO_RESPAWN" };
  const supervisor = detectRespawnSupervisor(process.env);
  if (supervisor) {
    if (supervisor === "schtasks") { /* ... */ }
    return { mode: "supervised" };  // ← false positive on darwin
  }
  // fallback: spawnDetachedGatewayProcess(...)
}

Observed sequence

09:32:30  update.run dry-run: current 2026.5.19 → target 2026.5.20, status=skipped
09:32:47  gateway PID 84633 receives SIGUSR1
09:33:17  drain timeout (2 tasks + 1 embedded run still active)
09:33:18  shutdown completed cleanly; "restart mode: update process respawn (supervisor restart)"
          → writes handoff.json with supervisorMode=launchd, sleeps 1500ms, exit(0)
[gap]     launchd does NOT restart anything (no ai.openclaw.gateway service registered)
09:33:21  OpenClaw GUI app fallback-spawns a new gateway child (PPID = GUI app, XPC inherited)
09:33:24  new gateway calls cleanStaleGatewayProcessesSync, kills PID 15647 (leftover on :18789)
09:33:57  another banner — yet another spawn, but hits the same bug, exits
[after]   no further gateway log entries; gateway is gone for ~2.5h until I manually
          `launchctl bootstrap`ed ai.openclaw.gateway.plist

Why this matters

Catastrophic and silent: the user's chat bots, agents, and integrations all go offline with no error visible to the gateway operator. Recovery requires CLI/launchctl knowledge to discover the service is unloaded.

Proposed fix

In src/infra/supervisor-markers.ts, narrow darwin detection to OpenClaw's own explicit marker so inherited generic launchd env vars don't trigger a false positive:

function detectRespawnSupervisor(env, platform) {
  if (platform === "darwin") {
    // Only trust the openclaw-specific marker; XPC_SERVICE_NAME and friends
    // are inherited by any child of a launchd-managed process and do not
    // mean *this* process is registered as a launchd service.
    return env.OPENCLAW_LAUNCHD_LABEL?.trim() ? "launchd" : null;
  }
  // ...
}

For belt-and-suspenders: before returning "launchd", optionally verify the service is actually registered via launchctl print "gui/$(id -u)/$LABEL".

Operators who run gateway under launchd should ensure ai.openclaw.gateway.plist sets OPENCLAW_LAUNCHD_LABEL=ai.openclaw.gateway in its EnvironmentVariables. Worth adding this to the bundled plist generator too, so the marker is set by default.

Related (not duplicates)

Workaround

Set OPENCLAW_NO_RESPAWN=1 to force in-process restart (loses the fresh-module-graph benefit on real update.run upgrades, but survives spurious dry-run restarts).

Or: make sure ai.openclaw.gateway is bootstrapped into launchd before relying on update-triggered restarts.

Metadata

Metadata

Assignees

Labels

P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:crash-loopCrash, hang, restart loop, or process-level availability failure.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions