Skip to content

fix(update): prevent gateway crash loop after failed self-update#18131

Merged
steipete merged 2 commits intoopenclaw:mainfrom
RamiNoodle733:fix/self-update-crash-loop
Feb 16, 2026
Merged

fix(update): prevent gateway crash loop after failed self-update#18131
steipete merged 2 commits intoopenclaw:mainfrom
RamiNoodle733:fix/self-update-crash-loop

Conversation

@RamiNoodle733
Copy link
Contributor

@RamiNoodle733 RamiNoodle733 commented Feb 16, 2026

Summary

Fixes the issue where the bot consistently breaks and requires manual intervention when users ask it to update itself or modify config via chat (Telegram/WhatsApp). The root cause is a combination of three bugs in the self-update pipeline:

  • Gateway always restarts after update, even on failure (server-methods/update.ts): scheduleGatewaySigusr1Restart() ran unconditionally regardless of update result status. When deps install or build failed, the gateway would restart into a broken state (corrupted node_modules, missing/partial dist/) and enter a crash loop under systemd.

  • No early bail on step failure (update-runner.ts): Unlike the preflight validation section which checks each step's exit code, the main update flow ran deps install → build → ui:build → doctor in sequence without intermediate checks. A failed pnpm install would cascade into a broken build, wasting time and leaving things in a worse partial state.

  • Doctor doesn't auto-repair config during update (update-runner.ts): The doctor step ran with --non-interactive but not --fix, so when a new version's config schema removed or renamed keys, doctor would detect the unknown keys but not strip them. After restart, the gateway's strict config validation would reject the stale keys and crash immediately.

Changes

  1. src/gateway/server-methods/update.ts: Gate scheduleGatewaySigusr1Restart() on result.status === "ok". Failed/skipped updates still write the restart sentinel so the agent can report the error to the user, but the running gateway process stays alive.

  2. src/infra/update-runner.ts: Add exit-code checks after deps install, build, and ui:build steps — matching the pattern already used in the preflight section. Also pass --fix to the doctor invocation so unknown config keys from schema changes are auto-stripped before restart.

  3. src/infra/update-runner.test.ts: Two new test cases verifying early bail when deps install or build fails (confirms subsequent steps are not executed). Updated existing tests for the new --fix flag.

Test plan

  • pnpm test -- --run src/infra/update-runner.test.ts — all 15 tests pass (13 existing + 2 new)
  • pnpm test -- --run src/gateway/server-methods/ — all 81 tests pass
  • pnpm build — clean, no type errors
  • pnpm check — format + lint clean
  • Manual test: trigger update.run with a version that has a failing build step, verify gateway stays running and reports error to chat
  • Manual test: trigger update.run with a version that introduces an unknown config key, verify doctor auto-strips it

Greptile Summary

This PR fixes a critical crash loop in the self-update pipeline by addressing three related bugs: unconditional gateway restarts after failed updates, missing early-bail on step failures, and doctor not auto-repairing config during updates.

  • Gateway restart gated on success: scheduleGatewaySigusr1Restart() in update.ts now only runs when result.status === "ok", preventing restarts into broken states (corrupted node_modules, partial dist/). The response payload ok field is also updated to result.status !== "error", correctly distinguishing "skipped" (non-error) from "error" states.
  • Early bail on step failure: update-runner.ts now checks exit codes after deps install, build, and ui:build steps, matching the pattern already established in the preflight validation section. Failed steps return structured error results with distinct reason values (deps-install-failed, build-failed, ui-build-failed).
  • Doctor --fix flag: The doctor invocation during updates now passes --fix alongside --non-interactive, enabling auto-stripping of unknown config keys from schema changes between versions. This prevents startup validation crashes after version upgrades.
  • Tests: Two new test cases verify early bail behavior for deps install and build failures. Existing tests are updated for the --fix flag.

Confidence Score: 5/5

  • This PR is safe to merge — it adds defensive checks to prevent crash loops without altering the happy-path behavior.
  • All three changes are well-scoped defensive fixes that follow existing patterns in the codebase. The restart gate is a straightforward conditional check. The early-bail checks mirror the preflight section. The --fix flag is a documented CLI option with clear behavior. New tests cover the two most important failure paths. No regressions are expected.
  • No files require special attention.

Last reviewed commit: a1f5cfc

The gateway unconditionally scheduled a SIGUSR1 restart after every
update.run call, even when the update itself failed (broken deps,
build errors, etc.). This left the process restarting into a broken
state — corrupted node_modules, partial builds — causing a crash loop
that required manual intervention.

Three fixes:

1. Only restart on success: scheduleGatewaySigusr1Restart is now
   gated on result.status === "ok". Failed or skipped updates still
   write the restart sentinel (so the status can be reported back to
   the user) but the running gateway stays alive.

2. Early bail on step failure: deps install, build, and ui:build now
   check exit codes immediately (matching the preflight section) so a
   failed deps install no longer cascades into a broken build and
   ui:build.

3. Auto-repair config during update: the doctor step now runs with
   --fix alongside --non-interactive, so unknown config keys left over
   from schema changes between versions are stripped automatically
   instead of causing a startup validation crash.
@openclaw-barnacle openclaw-barnacle bot added gateway Gateway runtime size: S labels Feb 16, 2026
@steipete steipete merged commit 0b8b95f into openclaw:main Feb 16, 2026
23 checks passed
@sebslight sebslight self-assigned this Feb 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants