Skip to content

[Bug]: openclaw doctor hangs at 100% CPU after Plugins step with large agents.list containing per-agent model overrides #66159

@lmagitem

Description

@lmagitem

Bug type

Crash (process/app exits or hangs)

Beta release blocker

No

Summary

resolveExternalCatalogPreferOver performs uncached synchronous disk reads (3 files per call) inside an O(N^2) loop over plugin auto-enable candidates, causing openclaw doctor to hang indefinitely at 100% CPU when agents.list contains ~50+ agents with per-agent model overrides.

Steps to reproduce

  1. Configure openclaw.json with:

    • Multiple auth profiles (synthetic, openrouter, together, zai, minimax)
    • Corresponding models.providers entries
    • agents.list with ~130 agents, each declaring model: { primary: "provider/model", fallbacks: [...] }
    • plugins.entries enabling telegram, discord, and provider plugins
    • channels with telegram and discord enabled
  2. Run openclaw doctor

  3. Observe: process hangs after displaying the "Plugins" box, consuming 100% CPU indefinitely.

Minimal repro: any config with ~50+ agents.list entries each declaring model overrides
using 3+ different provider prefixes (e.g. synthetic/, openrouter/, together/) should trigger the issue.
Removing all agents.list[].model fields resolves the hang.

Expected behavior

openclaw doctor completes all steps and exits within a reasonable time regardless of the number of configured agents.

Actual behavior

Process hangs after the Plugins step. strace shows a tight loop of synchronous reads of ~/.openclaw/mpm/plugins.json, ~/.openclaw/mpm/catalog.json, and ~/.openclaw/plugins/catalog.json repeating indefinitely. Process must be killed manually. On a 16 GB machine, the process reached 682 MB RSS before being killed, suggesting a possible memory leak in the loop as well.

OpenClaw version

2026.4.11

Operating system

Linux Mint 22.3 (x86_64), Node v22.22.0

Install method

npm global

Model

N/A (bug is in config resolution, not model calls)

Provider / routing chain

N/A (hang occurs before any provider calls)

Additional provider/model setup details

Config uses 5 custom provider prefixes across agents:

  • synthetic (OpenAI-compat, api.synthetic.new)
  • openrouter
  • together
  • zai (OpenAI-compat, api.z.ai)
  • minimax (Anthropic-compat, api.minimax.io)

Each of ~130 agents declares model.primary + 1-2 fallbacks using these providers.
agents.defaults.model also declares a primary + fallbacks.
plugins.entries explicitly enables telegram, discord, minimax, synthetic, openrouter, together, zai.

Logs, screenshots, and evidence

strace output showing the tight read loop:

access("/home/user/.openclaw/mpm/plugins.json", F_OK) = 0
openat(AT_FDCWD, "/home/user/.openclaw/mpm/plugins.json", O_RDONLY|O_CLOEXEC) = 24
read(24, "[]\n", 8192)                  = 3
read(24, "", 8192)                      = 0
close(24)                               = 0
access("/home/user/.openclaw/mpm/catalog.json", F_OK) = 0
openat(AT_FDCWD, "/home/user/.openclaw/mpm/catalog.json", O_RDONLY|O_CLOEXEC) = 24
read(24, "{}\n", 8192)                  = 3
read(24, "", 8192)                      = 0
close(24)                               = 0
access("/home/user/.openclaw/plugins/catalog.json", F_OK) = 0
openat(AT_FDCWD, "/home/user/.openclaw/plugins/catalog.json", O_RDONLY|O_CLOEXEC) = 24
read(24, "{}\n", 8192)                  = 3
read(24, "", 8192)                      = 0
close(24)                               = 0
[repeats indefinitely]


Root cause traced to `plugin-auto-enable-rMc8VJBA.js`:
- `materializePluginAutoEnableCandidatesInternal` iterates all candidates (line ~587)
- For each candidate, `shouldSkipPreferredPluginAutoEnable` iterates all *other* candidates (line ~235)
- For each pair, `resolvePreferredOverIds` calls `resolveExternalCatalogPreferOver` (line ~232)
- `resolveExternalCatalogPreferOver` performs 3 synchronous `fs.readFileSync` + `fs.existsSync` calls with no caching (lines 207-216)
- Total file reads: O(N^2 * 3) where N = number of auto-enable candidates derived from model refs

Confirmed workaround: adding a `Map`-based memo cache on `channelId` to `resolveExternalCatalogPreferOver` resolves the hang completely. Doctor completes in seconds with the full 130-agent config.

Impact and severity

  • Affected: Any user with a large multi-agent config using per-agent model overrides across multiple providers.
  • Severity: Blocks workflow. openclaw doctor and openclaw gateway restart both hang, making the system unusable until model overrides are removed.
  • Frequency: 100% reproducible with ~50+ agents declaring model overrides.
  • Consequence: Unable to start or restart the gateway, run doctor, or use the system at all without stripping per-agent model config.

Additional information

Suggested fix: memoise resolveExternalCatalogPreferOver by channelId. The external catalog files do not change during a single process invocation, so caching is safe. Patch applied locally and confirmed working:

const _externalCatalogPreferOverCache = new Map();
function resolveExternalCatalogPreferOver(channelId, env) {
    if (_externalCatalogPreferOverCache.has(channelId)) return _externalCatalogPreferOverCache.get(channelId);
    for (const rawPath of resolveExternalCatalogPaths(env)) {
        const resolved = resolveUserPath(rawPath, env);
        if (!fs.existsSync(resolved)) continue;
        try {
            const channel = parseExternalCatalogChannelEntries(JSON.parse(fs.readFileSync(resolved, "utf-8"))).find((entry) => entry.id === channelId);
            if (channel) { _externalCatalogPreferOverCache.set(channelId, channel.preferOver); return channel.preferOver; }
        } catch {}
    }
    const _result = []; _externalCatalogPreferOverCache.set(channelId, _result); return _result;
}

An additional improvement would be to also cache resolveExternalCatalogPaths and the parsed file contents, since those are invariant within a process run and currently re-read for every unique channelId.

Bisection results confirming the trigger:

  • plugins.enabled = false -> doctor completes (auto-enable skipped entirely)
  • agents.list = [] + del(agents.defaults.model) -> doctor completes
  • agents.defaults.model alone (no agents.list models) -> doctor completes
  • agents.defaults.models catalog alone (2 entries) -> doctor completes
  • Full agents.list with per-agent models -> hangs

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingbug:crashProcess/app exits unexpectedly or hangs

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions