fix(update): prevent gateway crash loop after failed self-update#18131
Merged
steipete merged 2 commits intoopenclaw:mainfrom Feb 16, 2026
Merged
fix(update): prevent gateway crash loop after failed self-update#18131steipete merged 2 commits intoopenclaw:mainfrom
steipete merged 2 commits intoopenclaw:mainfrom
Conversation
The gateway unconditionally scheduled a SIGUSR1 restart after every update.run call, even when the update itself failed (broken deps, build errors, etc.). This left the process restarting into a broken state — corrupted node_modules, partial builds — causing a crash loop that required manual intervention. Three fixes: 1. Only restart on success: scheduleGatewaySigusr1Restart is now gated on result.status === "ok". Failed or skipped updates still write the restart sentinel (so the status can be reported back to the user) but the running gateway stays alive. 2. Early bail on step failure: deps install, build, and ui:build now check exit codes immediately (matching the preflight section) so a failed deps install no longer cascades into a broken build and ui:build. 3. Auto-repair config during update: the doctor step now runs with --fix alongside --non-interactive, so unknown config keys left over from schema changes between versions are stripped automatically instead of causing a startup validation crash.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the issue where the bot consistently breaks and requires manual intervention when users ask it to update itself or modify config via chat (Telegram/WhatsApp). The root cause is a combination of three bugs in the self-update pipeline:
Gateway always restarts after update, even on failure (
server-methods/update.ts):scheduleGatewaySigusr1Restart()ran unconditionally regardless of update result status. When deps install or build failed, the gateway would restart into a broken state (corruptednode_modules, missing/partialdist/) and enter a crash loop under systemd.No early bail on step failure (
update-runner.ts): Unlike the preflight validation section which checks each step's exit code, the main update flow ran deps install → build → ui:build → doctor in sequence without intermediate checks. A failedpnpm installwould cascade into a broken build, wasting time and leaving things in a worse partial state.Doctor doesn't auto-repair config during update (
update-runner.ts): The doctor step ran with--non-interactivebut not--fix, so when a new version's config schema removed or renamed keys, doctor would detect the unknown keys but not strip them. After restart, the gateway's strict config validation would reject the stale keys and crash immediately.Changes
src/gateway/server-methods/update.ts: GatescheduleGatewaySigusr1Restart()onresult.status === "ok". Failed/skipped updates still write the restart sentinel so the agent can report the error to the user, but the running gateway process stays alive.src/infra/update-runner.ts: Add exit-code checks after deps install, build, and ui:build steps — matching the pattern already used in the preflight section. Also pass--fixto the doctor invocation so unknown config keys from schema changes are auto-stripped before restart.src/infra/update-runner.test.ts: Two new test cases verifying early bail when deps install or build fails (confirms subsequent steps are not executed). Updated existing tests for the new--fixflag.Test plan
pnpm test -- --run src/infra/update-runner.test.ts— all 15 tests pass (13 existing + 2 new)pnpm test -- --run src/gateway/server-methods/— all 81 tests passpnpm build— clean, no type errorspnpm check— format + lint cleanupdate.runwith a version that has a failing build step, verify gateway stays running and reports error to chatupdate.runwith a version that introduces an unknown config key, verify doctor auto-strips itGreptile Summary
This PR fixes a critical crash loop in the self-update pipeline by addressing three related bugs: unconditional gateway restarts after failed updates, missing early-bail on step failures, and doctor not auto-repairing config during updates.
scheduleGatewaySigusr1Restart()inupdate.tsnow only runs whenresult.status === "ok", preventing restarts into broken states (corruptednode_modules, partialdist/). The response payloadokfield is also updated toresult.status !== "error", correctly distinguishing "skipped" (non-error) from "error" states.update-runner.tsnow checks exit codes after deps install, build, and ui:build steps, matching the pattern already established in the preflight validation section. Failed steps return structured error results with distinctreasonvalues (deps-install-failed,build-failed,ui-build-failed).--fixflag: The doctor invocation during updates now passes--fixalongside--non-interactive, enabling auto-stripping of unknown config keys from schema changes between versions. This prevents startup validation crashes after version upgrades.--fixflag.Confidence Score: 5/5
Last reviewed commit: a1f5cfc