Skip to content

[Bug]: doctor --fix discards legacy migrations if any unrelated validation issue remains, leaving config unfixed #76798

@rhubain

Description

@rhubain

Bug type

Upgrade regression / catch-22 — locks users out of openclaw doctor --fix, the only documented escape from upgrade-induced crash loops.

Severity

High. When this fires, the documented recovery path (openclaw doctor --fix) silently does nothing useful. Users are forced to hand-edit ~/.openclaw/openclaw.json (and the .bak LKG file) to escape. Combined with the LKG-restore loop documented in #76700 and the validation-before-migration order described in #68664, the user is effectively bricked.

Summary

migrateLegacyConfig in src/commands/doctor/shared/legacy-config-migrate.ts (bundled at dist/doctor-config-flow-7oxT6MZQ.js:924) applies legacy migrations to a config, then runs full plugin-aware validation on the migrated result. If the post-migration config still has any unrelated validation issue (e.g. a missing plugin, a broken provider), the function returns config: null and the migrated raw is silently discarded, even though the legacy migration itself succeeded and is strictly safe to persist.

The caller (applyLegacyCompatibilityStep, same file, line 945) then falls through with the unmigrated candidate, the doctor flow continues, and the write step ultimately throws Error: Config validation failed on the unrelated issue. The legacy migration is never written to disk.

This means doctor --fix is non-incremental: a single unrelated validation issue defeats every safe migration that doctor knows how to apply.

Concrete repro (the path I hit)

This is the scenario that triggered the investigation — two independent issues that, individually, doctor knows how to handle, but together brick the user:

  1. Long-running install on v2026.4.x. Config has agents.defaults.llm.idleTimeoutSeconds: 300 (was valid; deprecation rule for it lives in dist/legacy-config-issues-Bce7-rlH.js:526 and a migration for it lives at line 605 — id agents.defaults.llm->models.providers.timeoutSeconds).
  2. Upgrade to v2026.5.2 (commit 8b2a6e5). The new validator now rejects the legacy key as agents.defaults: Unrecognized key: "llm", with a sibling legacyIssue flagging the migration is available.
  3. Independently, plugins.entries.brave.enabled: true plus tools.web.search.provider: "brave" are present, and ~/.openclaw/plugins/installs.json claims @openclaw/brave-plugin@2026.5.2 is installed at ~/.openclaw/npm/node_modules/@openclaw/brave-plugin/, but the directory does not exist on disk. (I'm not sure how this happens — could be a half-completed install, an aborted upgrade, or the same provider-registration failure mode [Bug]: Brave plugin install + tools.web.search.provider: "brave" causes 1.2s crash loop with no CLI escape (5.2) #76700 documents. Issue is orthogonal — a stale install record, a half-uninstalled plugin, or anything else that produces a web_search provider is not available: brave validation error reproduces this just as well.)
  4. Gateway crash-loops at startup with the LKG-restore pattern from [Bug]: Brave plugin install + tools.web.search.provider: "brave" causes 1.2s crash loop with no CLI escape (5.2) #76700 (LKG was promoted in 4.x with agents.defaults.llm still considered valid → identical-hash restore loop; clean repro of [Bug]: Brave plugin install + tools.web.search.provider: "brave" causes 1.2s crash loop with no CLI escape (5.2) #76700's Issue Login fails with 'WebSocket Error (socket hang up)' ECONNRESET #2).
  5. User runs openclaw doctor --fix. Expected: at least the legacy agents.defaults.llm key is removed, since OpenClaw ships an explicit migration for it. Actual: doctor ends with Error: Config validation failed: tools.web.search.provider: web_search provider is not available: brave. Re-reading ~/.openclaw/openclaw.json shows agents.defaults.llm is still there. The migration ran in memory but was discarded.

I confirmed this by reading the post-doctor --fix config back; the llm block was unchanged. Only manual JSON edits (on both openclaw.json and the .bak LKG, since the LKG restore loop overwrites in-place edits) escaped the loop.

Root cause (code reference)

In dist/doctor-config-flow-7oxT6MZQ.js, with apparent source path src/commands/doctor/shared/legacy-config-migrate.ts:

function migrateLegacyConfig(raw) {
    const { next, changes } = applyLegacyDoctorMigrations(raw);
    if (!next) return { config: null, changes: [] };
    const validated = validateConfigObjectWithPlugins(next);
    if (!validated.ok) {
        changes.push("Migration applied, but config still invalid; fix remaining issues manually.");
        return {
            config: null,           // <-- migrated config thrown away
            changes
        };
    }
    return { config: validated.config, changes };
}

And the caller (src/commands/doctor/shared/config-flow-steps.ts, bundled line ~945):

function applyLegacyCompatibilityStep(params) {
    if (params.snapshot.legacyIssues.length === 0) return { ... };
    const { config: migrated, changes } = migrateLegacyConfig(params.snapshot.parsed);
    if (!migrated) return {
        state: {
            ...params.state,        // <-- candidate stays UNMIGRATED
            pendingChanges: ...,
            fixHints: ...
        },
        ...
    };
    return {
        state: { cfg: migrated, candidate: migrated, ... },
        ...
    };
}

The "all-or-nothing" coupling between legacy migration and full plugin-aware validation is the bug. Legacy migrations are strictly safe (they remove a known-legacy key for which a migration is registered) and should be applied independently of unrelated plugin/provider problems.

Why this is a separate issue from existing ones

Issue What it covers What it doesn't
#76700 LKG-restore loop when current ≡ LKG (brave plugin trigger) doctor --fix as escape hatch — assumes it works
#68664 Validation runs before legacy migrations in the gateway startup path doctor --fix specifically — different code path
#74910 agents.defaults.llm migration discards user's idleTimeoutSeconds value the migration not running at all when other issues exist
#50561 Auto-apply safe doctor fixes on gateway start (feature request) the existing doctor --fix path being itself broken
#55347 Native gateway self-healing (feature request) same — assumes fixes are applied when invoked

This issue is specifically: the manually-invoked doctor --fix flow drops legacy migrations on the floor when any other validation issue remains. Even if #68664 lands (migrations before validation in gateway startup), users who already hit the loop and try doctor --fix per OpenClaw's own error message (Run "openclaw doctor --fix") will still find it does nothing.

Suggested fix

Two options, in increasing order of scope.

1. Minimal — make legacy migration independently committable. In migrateLegacyConfig, when post-migration validation fails, still return the migrated raw object instead of null, with a flag noting that other issues remain. The caller commits the legacy-migrated state to state.candidate and surfaces the remaining issues separately. The final write either succeeds (legacy migration alone unbricks the config) or fails on the unrelated issue (but the legacy keys are now gone, so the LKG loop in #76700 no longer reproduces them after the next promotion).

 function migrateLegacyConfig(raw) {
     const { next, changes } = applyLegacyDoctorMigrations(raw);
     if (!next) return { config: null, changes: [] };
     const validated = validateConfigObjectWithPlugins(next);
     if (!validated.ok) {
         changes.push("Migration applied; other unrelated issues remain — see below.");
-        return { config: null, changes };
+        return { config: next, changes, unresolvedIssues: validated.issues };
     }
     return { config: validated.config, changes };
 }

2. Broader — split doctor --fix into independently-committed atomic steps. Each migration / fix is its own transaction. Apply, validate the change in isolation (does this specific change make the config strictly closer to valid?), commit. Surface remaining unfixable issues. This is the structural answer that also dovetails with #50561 (gateway can run the legacy-migration subset on startup safely) and addresses #68664 (legacy migrations are applied earlier, in their own commit, before the strict validator gates startup).

In either form, the user-visible contract should be: doctor --fix always applies every safe migration it knows about, and only refuses on the issues it cannot fix. Today, one unfixable issue blocks every fixable one.

Reproduction artefacts (what to check on a repro instance)

  • ~/.openclaw/openclaw.json — has both agents.defaults.llm.* and a config-level reference to a missing/unloadable web_search provider
  • ~/.openclaw/openclaw.json.bak — same legacy key (the LKG was promoted under the old schema)
  • ~/.openclaw/logs/gateway.err.log — repeating "Config auto-restored from last-known-good" for (startup-invalid-config) with Rejected validation details: agents.defaults: Unrecognized key: "llm" (matches the trigger described in [Bug]: Brave plugin install + tools.web.search.provider: "brave" causes 1.2s crash loop with no CLI escape (5.2) #76700, but with the legacy key as the trigger instead of brave)
  • After openclaw doctor --fix: agents.defaults.llm should be gone (per the migration in dist/legacy-config-issues-Bce7-rlH.js:605). It is not.

Environment

  • OpenClaw: 2026.5.2 (commit 8b2a6e5)
  • Install: Homebrew (/opt/homebrew/lib/node_modules/openclaw)
  • Node: v22.22.0 (/opt/homebrew/opt/node@22/bin/node)
  • OS: macOS 26.3.1 (Darwin 25.4.0, arm64)
  • LaunchAgent: gui/$UID/ai.openclaw.gateway
  • Last touched config version (per meta.lastTouchedVersion): 2026.4.14 — covers the upgrade path

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions