Skip to content

fix(runtime): three stability fixes for SIGUSR1 restart loop and BYOK model override#915

Merged
lefarcen merged 3 commits intomainfrom
fix/runtime-stability-trio
Apr 8, 2026
Merged

fix(runtime): three stability fixes for SIGUSR1 restart loop and BYOK model override#915
lefarcen merged 3 commits intomainfrom
fix/runtime-stability-trio

Conversation

@lefarcen
Copy link
Copy Markdown
Collaborator

@lefarcen lefarcen commented Apr 8, 2026

What

Three small surgical fixes for runtime stability bugs found while diagnosing user reports today. Each is a separate commit so they can be reviewed (or split) independently.

Why

Bug Symptom User impact
#1 PR #836 gap: `installService` early-returns before checking disabled override when plist is unchanged Upgrades that don't change the plist leave legacy `launchctl unload -w` flags intact. OpenClaw's SIGUSR1 self-restart then fails with `Bootstrap failed: 5: Input/output error` on every reload. One affected user (sunqingyu) hit 8 SIGUSR1 restarts in a single day, each triggering an 11-second gateway drain window during which user messages get `Gateway is draining for restart` errors.
#2 `plugins.allow` non-deterministic ordering Even when the SET of plugins is unchanged, channel reorderings or transient status flaps produce a differently-ordered `allow` array. OpenClaw's hot-reload checker sees an array diff and triggers `SIGUSR1` (because `plugins.allow` is not a hot-reloadable field). Combined with bug #1, this is what made the restarts so frequent. Same affected user — frequent unprompted restarts even with no real config change.
#3 `resolveAvailableRuntimeModel` silently overrides BYOK model selection A user (loomis) selected `anthropic/claude-opus-4-6` (BYOK Anthropic with their own API key). Their BYOK provider has `models: []` (empty array — common when the user enables a provider but never explicitly adds models to its allowlist). Every doSync, `resolveAvailableRuntimeModel` finds the selection isn't in `availableRuntimeModels` and falls back to `selectPreferredModel`, which returns `link/gemini-3-flash-preview`. OpenClaw then errors with `Unknown model: link/gemini-3-flash-preview`. Bot stops responding entirely; user thinks switching models in the UI is broken.

How

Commit 1 — `fix(desktop): always clear launchd disabled override at install time`

Move the `isServiceDisabled` + `enableService` block to the very top of `LaunchdManager.installService`, before the `isRegistered` early-return. `launchctl enable` is idempotent so calling it on every install is safe.

Commit 2 — `fix(controller): sort plugins.allow to prevent spurious gateway restarts`

Change `compilePlugins` to emit `allow` via `Array.from(new Set([...connectedPluginIds, ...platformPluginIds])).sort()` so the output is fully deterministic regardless of input order.

Commit 3 — `fix(controller): trust user model selection under configured BYOK providers`

Pass `configuredProviderKeys` (set of provider keys in `compiled.models.providers`) to `resolveAvailableRuntimeModel`. Add a new rule: trust any `desiredRef` whose provider key is configured. OpenClaw's `resolveModelWithRegistry` has a generic-fallback path that builds a synthetic model entry when `providerConfig` is present, so the request still goes through with the user's chosen model.

Affected areas

  • Desktop app (Electron shell)
  • Controller (backend / API)

Checklist

  • `pnpm typecheck` passes
  • `pnpm lint` passes
  • `pnpm test` (not run; existing tests don't reference the modified internals — `plugins.allow` test assertions use `toContain` so they remain order-independent)
  • `pnpm generate-types` (no API route/schema changes)
  • No credentials or tokens in code or logs
  • No `any` types introduced

Notes for reviewers

  • Each commit is independent and could be cherry-picked or reverted standalone if any one of them needs to be held back.
  • Bug feat: end-to-end Slack integration with LiteLLM model provider and config generator fixes #3 is a behavioral change in model resolution: previously the controller would silently downgrade unknown selections to a default; now it forwards them as-is. The downside is that a typo'd model id will now reach OpenClaw and produce an "Unknown model" error there — but that's strictly better than silently picking a different model behind the user's back.
  • Bug OpenClaw config generation #1 leaves the existing retry-path `enableService` call at the bottom of `installService` intact. It's now somewhat redundant but harmless and defends against a race where something re-disables the service between the top-of-function check and the bootstrap call.
  • The OpenClaw upstream restart logic in `triggerOpenClawRestart` (which is the actual code that fails on `launchctl bootstrap` for sunqingyu's case) is in vendored OpenClaw source. We can't modify it, but bug OpenClaw config generation #1's fix means desktop boot will preemptively clear the disabled flag before OpenClaw ever tries to use it.

lefarcen added 3 commits April 8, 2026 13:29
Move the isServiceDisabled + enableService check to the very top of
LaunchdManager.installService so it runs even when the service is
already registered with unchanged plist content.

The PR #836 fix only ran the disabled check on the bootstrap path,
which means upgrades that don't change the plist hit the early-return
and never clear legacy `launchctl unload -w` flags. As a result,
OpenClaw's own SIGUSR1 self-restart flow tries `launchctl bootstrap`
and fails with "Bootstrap failed: 5" on every reload, draining the
gateway for ~11s and rejecting user requests with
"Gateway is draining for restart".

`launchctl enable` is idempotent, so calling it on every install is
safe.
Build plugins.allow via sort + dedup so the output array is fully
deterministic regardless of channel insertion order or transient
status flaps.

Without this, every time a channel briefly flips status (e.g. weixin
account probe) the resulting allow array's order changes, OpenClaw
treats it as a config change, and triggers a SIGUSR1 in-process
restart with an 11s gateway drain window. One affected user hit 8
SIGUSR1 restarts in a single day.
…viders

Add the set of configured provider keys to resolveAvailableRuntimeModel
and trust any desiredRef whose provider exists in compiled.models.providers,
even when the provider's explicit models[] list is empty.

Without this, a BYOK Anthropic user who selected
\`anthropic/claude-opus-4-6\` but never explicitly added models to
the provider's allowlist would silently get their selection
overridden to \`link/gemini-3-flash-preview\` (the default link
model) on every doSync. OpenClaw's resolveModelWithRegistry has a
generic-fallback path that builds a synthetic model entry whenever
providerConfig is present, so the request still goes through
correctly with the user's chosen model.
@sentry
Copy link
Copy Markdown

sentry Bot commented Apr 8, 2026

Codecov Report

❌ Patch coverage is 22.72727% with 17 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...s/controller/src/services/openclaw-sync-service.ts 0.00% 11 Missing ⚠️
apps/desktop/main/services/launchd-manager.ts 14.28% 6 Missing ⚠️

📢 Thoughts on this report? Let us know!

@lefarcen lefarcen merged commit f1e52ab into main Apr 8, 2026
19 of 20 checks passed
@lefarcen lefarcen mentioned this pull request Apr 8, 2026
@Celina-create Celina-create added priority:p2 P2-nice-to-have: Low impact, backlog type:feature Issue type is a feature request area/channels IM channel integration (WeChat/Discord/DingTalk etc.) labels Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/channels IM channel integration (WeChat/Discord/DingTalk etc.) priority:p2 P2-nice-to-have: Low impact, backlog type:feature Issue type is a feature request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants