Skip to content

gateway restart/update can fail to come back when respawn reuses unstable package-manager paths #52313

@RichardCao

Description

@RichardCao

Summary

Message-triggered restart / update.run can occasionally fail to bring the gateway back after shutdown.

What is happening

The gateway run loop already tries to do a full fresh-process restart after SIGUSR1, but restartGatewayProcessWithFreshPid() currently respawns the child with process.execArgv + process.argv.slice(1).

That is brittle when the running process was launched from a package-manager-managed realpath, especially pnpm versioned paths like:
node_modules/.pnpm/openclaw@<version>/node_modules/openclaw/dist/entry.js

During self-update, that versioned realpath may be replaced or removed. The parent exits cleanly, but the child can then be spawned against an entrypoint that is no longer stable, which matches the observed "message channel restart/update ran, process went down, did not come back" behavior.

Why this matters

This is most visible on message-triggered flows because the restart/update is initiated from the running gateway itself, so there is no external operator retrying the command.

Expected behavior

Gateway self-restart should respawn via a stable wrapper/symlinked CLI entrypoint that survives package updates, not by blindly reusing the current argv path.

Proposed fix

Before detached respawn:

  • detect pnpm versioned OpenClaw realpaths and rewrite them back to the stable node_modules/openclaw/openclaw.mjs wrapper
  • otherwise, if the package root is known, respawn through <packageRoot>/openclaw.mjs
  • keep current argv unchanged for dev/source entrypoints such as src/entry.ts

Validation

I have a fix prepared locally on top of the latest green main commit:

  • green base: 52a0aa06723fbad5e7c2b0fc07fe04eef433d1c7
  • targeted tests: pnpm exec vitest run src/infra/process-respawn.test.ts
  • targeted lint: pnpm exec oxlint --type-aware src/infra/process-respawn.ts src/infra/process-respawn.test.ts

Full-repo tsc currently reports unrelated pre-existing test typing errors around fetch mocks, so I am not using that as the gating signal for this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:crash-loopCrash, hang, restart loop, or process-level availability failure.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.staleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions