Skip to content

Unknown CLI commands load plugins and leave hot child process while gateway saturates #75287

@bcdonadio

Description

@bcdonadio

Summary

A live main source checkout is showing two related CPU/event-loop issues:

  1. Running a non-existent command such as openclaw foo does not fail fast. It loads provider plugins, loads the runtime plugin set, prints an unavailable-command error, and leaves a hot openclaw child process alive at ~90% CPU.
  2. The running gateway shows high sustained CPU, degraded event-loop diagnostics, and repeated provider plugin load bursts around log/model/session/control-ui surfaces.

The key failure mode is that invalid CLI input can trigger expensive plugin loading and leave a busy child process instead of returning a normal unknown-command error and exiting cleanly.

Environment

  • Repo: openclaw/openclaw
  • Checkout kind: source/git checkout
  • Affected local commit when reproduced: 4429ee7d2e7f6261bc5af5827e20d9566b2287da
  • origin/main at time of update to this issue: 359d871293e801dc9e5506b5002a4bf545c42662
  • Note on version: this is a live mainline source checkout on 2026-04-30. The package/CLI banner still reports 2026.4.27, so do not interpret the original report as being limited to an old 4.27 release.
  • Runtime: Node 22, systemd user gateway
  • Gateway command: node dist/index.js gateway --port 18789

Repro: non-existent CLI command

openclaw foo

Controlled repro used a separate process group so it could be cleaned up safely:

setsid bash -lc "openclaw foo >/tmp/openclaw-direct-foo.log 2>&1" &
sleep 20
ps -eo pid,ppid,pgid,etime,stat,pcpu,command | awk -v pg="$leader" '$3==pg || $1==pg {print}'

After 20 seconds, the command still had a hot child process:

PID      PPID    PGID    ELAPSED STAT %CPU COMMAND
1104454  1104452 1104454 00:20   Ss    0.0 bash -lc openclaw foo ...
1104481  1104454 1104454 00:19   Sl    0.8 openclaw
1104493  1104481 1104454 00:19   Rl   93.5 openclaw

The command output showed repeated plugin loading before the unknown/unavailable command error:

Config warnings:
- plugins.entries.opik-openclaw: plugin opik-openclaw: duplicate plugin id detected; global plugin will be overridden by config plugin (/home/claw/opik-openclaw/index.ts)
Config warnings:
- plugins.entries.opik-openclaw: plugin opik-openclaw: duplicate plugin id detected; global plugin will be overridden by config plugin (/home/claw/opik-openclaw/index.ts)
[plugins] loading anthropic from .../dist/extensions/anthropic/index.js
[plugins] loading byteplus from .../dist/extensions/byteplus/index.js
[plugins] loading deepseek from .../dist/extensions/deepseek/index.js
[plugins] loading moonshot from .../dist/extensions/moonshot/index.js
[plugins] loading tencent from .../dist/extensions/tencent/index.js
[plugins] loading volcengine from .../dist/extensions/volcengine/index.js
[plugins] loading xai from .../dist/extensions/xai/index.js
[plugins] loaded 7 plugin(s) (7 attempted) in 434.6ms
[plugins] loading anthropic from .../dist/extensions/anthropic/index.js
...
[plugins] loaded 7 plugin(s) (7 attempted) in 31.3ms
[plugins] loading openclaw-honcho from /home/claw/openclaw-honcho/dist/index.js
[plugins] Honcho memory plugin loaded
[plugins] loading opik-openclaw from /home/claw/opik-openclaw/index.ts
[plugins] loading acpx from .../dist/extensions/acpx/index.js
...
[plugins] loading lossless-claw from /home/claw/.openclaw/extensions/lossless-claw/dist/index.js
plugin runtime config.loadConfig() is deprecated (runtime-config-load-write); use config.current().
[plugins] loaded 120 plugin(s) (17 attempted) in 4184.6ms
[openclaw] Failed to start CLI: Error: The `openclaw foo` command is unavailable because `plugins.allow` excludes "foo". Add "foo" to `plugins.allow` if you want that bundled plugin CLI surface.

Expected behavior for openclaw foo:

  • No provider catalog loads.
  • No runtime plugin load.
  • No background/hot child process after printing the error.
  • The error should be a normal unknown-command or unavailable-command response that exits promptly.

Signal note: the operator report is that Ctrl-C/SIGINT does not stop the busy-looping openclaw foo path and SIGKILL is needed. In my controlled repro, SIGINT sent to the entire test process group did stop the tree, so the exact interrupt-resistant variant may depend on how the command is launched. The important reproduced bug is that the invalid command leaves a high-CPU child alive after printing the error; it should not require SIGINT, SIGTERM, or SIGKILL cleanup at all.

Repro: channel logs / provider reloads

From the source checkout:

pnpm openclaw gateway status --deep
pnpm openclaw channels logs --lines 80
journalctl --user -u openclaw-gateway.service --since '10 minutes ago' --no-pager
ps -p $(systemctl --user show openclaw-gateway.service -p MainPID --value) -o pid,ppid,etime,pcpu,pmem,rss,stat,command
pnpm openclaw health --json

openclaw channels logs --lines 80 prints plugin load bursts before showing log lines:

[plugins] loading anthropic from .../dist/extensions/anthropic/index.js
[plugins] loading byteplus from .../dist/extensions/byteplus/index.js
[plugins] loading deepseek from .../dist/extensions/deepseek/index.js
[plugins] loading moonshot from .../dist/extensions/moonshot/index.js
[plugins] loading tencent from .../dist/extensions/tencent/index.js
[plugins] loading volcengine from .../dist/extensions/volcengine/index.js
[plugins] loading xai from .../dist/extensions/xai/index.js
[plugins] loaded 7 plugin(s) (7 attempted) in 565.2ms
[plugins] loading anthropic from .../dist/extensions/anthropic/index.js
...
[plugins] loaded 7 plugin(s) (7 attempted) in 92.8ms

Earlier in the same current-generation gateway journal, model/session surfaces triggered much larger provider-plugin reload bursts:

[plugins] loading amazon-bedrock from .../dist/extensions/amazon-bedrock/index.js
[plugins] loading amazon-bedrock-mantle from .../dist/extensions/amazon-bedrock-mantle/index.js
[plugins] loading anthropic from .../dist/extensions/anthropic/index.js
[plugins] loading anthropic-vertex from .../dist/extensions/anthropic-vertex/index.js
...
[plugins] loading zai from .../dist/extensions/zai/index.js
[plugins] loaded 49 plugin(s) (49 attempted) in 1422.3ms
[ws] ⇄ res ✓ models.list 18353ms ...
[ws] ⇄ res ✓ models.list 7590ms ...

sessions.list also showed catalog pressure:

[gateway] sessions.list continuing without model catalog after 750ms
[ws] ⇄ res ✓ sessions.list 10155ms ...

Runtime impact observed

The gateway process stayed hot even with no active agent run:

PID       ELAPSED %CPU %MEM   RSS STAT COMMAND
1101846   07:12  61.8  5.4  892284 Rsl  node dist/index.js gateway --port 18789

Short top samples showed the process around 70-86% CPU.

Gateway diagnostics reported event-loop degradation while active/waiting/queued work was zero:

[diagnostic] liveness warning: reasons=event_loop_delay,event_loop_utilization,cpu interval=30s eventLoopDelayP99Ms=2457.9 eventLoopDelayMaxMs=5469.4 eventLoopUtilization=1 cpuCoreRatio=0.921 active=0 waiting=0 queued=0
[diagnostic] liveness warning: reasons=event_loop_delay interval=30s eventLoopDelayP99Ms=1270.9 eventLoopDelayMaxMs=1598 eventLoopUtilization=0.94 cpuCoreRatio=0.818 active=0 waiting=0 queued=0

At the same time, two Control UI clients were repeatedly polling node.list, each response taking over a second and sometimes over three seconds:

[ws] ⇄ res ✓ node.list 1480ms conn=...
[ws] ⇄ res ✓ node.list 1423ms conn=...
[ws] ⇄ res ✓ node.list 3535ms conn=...
[ws] ⇄ res ✓ node.list 3538ms conn=...
[ws] ⇄ res ✓ node.list 2026ms conn=...
[ws] ⇄ res ✓ node.list 2058ms conn=...

A health probe still eventually returned ok: true, but took ~10s:

pnpm openclaw health --json
# durationMs: 10282
# plugins.errors: []
# channels signal/telegram/whatsapp running, whatsapp healthy/linked

Expected behavior

  • Unknown or unavailable CLI commands should fail before provider/plugin runtime loading.
  • Invalid CLI commands should not leave hot child processes alive.
  • channels logs should not require loading provider/model plugins just to print recent channel logs.
  • models.list / sessions.list / Control UI polling should not repeatedly cold-load all provider plugins on the gateway hot path.
  • Repeated UI polling should not be able to keep the main gateway loop at ~60-80% CPU when no agent run is active.
  • Event-loop delay should stay low enough for gateway health/readiness and channel handling to remain responsive.

Troubleshooting notes

  • This was reproduced after a fresh source install/update and gateway restart.
  • The configured channels remained healthy, so this is not a simple channel crash loop.
  • There is a duplicate Opik plugin warning in this environment because config intentionally overrides the global Opik plugin with a local checkout; that warning is present but does not explain the provider catalog reload bursts across model providers or the openclaw foo hot child.
  • Killing a long-running local channels logs command stopped that CLI process, but the gateway process remained CPU hot due to ongoing Control UI node.list polling and event-loop degradation.

Suspected area

Check CLI dispatch and provider/plugin catalog loading for:

  • command validation happening after plugin discovery/runtime initialization,
  • plugin CLI fallback treating arbitrary words like possible plugin commands before checking allowlists/known command tables,
  • child process lifecycle cleanup after CLI startup errors,
  • missing request-level or process-level memoization of provider plugin catalog loads,
  • expensive plugin loader calls inside models.list, sessions.list, or log/status commands,
  • duplicated Control UI polling across multiple clients/tabs,
  • synchronous work on the gateway main loop during catalog/provider discovery.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions