Skip to content

Gateway startup race: channels fire before anthropic plugin registers claude-cli harness #71957

@WolvenRA

Description

@WolvenRA

Summary

The gateway declares itself "ready" and fires channel startup before the anthropic plugin (which provides the claude-cli agent harness) has finished initializing. This causes immediate channel startup failure on every boot:

[gateway] ready (2 plugins: memory-core, memory-wiki; 3.6s)
[gateway/channels] channel startup failed: Error: Requested agent harness "claude-cli" 
  is not registered and PI fallback is disabled.

The anthropic plugin loads later (triggered lazily by models.list), but by then the channel has already failed. This leaves the gateway in a degraded state where followup/embedded agent dispatches can fail with the same harness-not-registered error.

Reproduction

  • Environment: WSL2 (Linux 6.6.x), OpenClaw v2026.4.24, gateway as systemd user service
  • Config: claude-cli/claude-opus-4-7 as primary model (requires anthropic plugin for harness)
  • Steps: Restart gateway (systemctl --user restart openclaw-gateway), observe logs
  • Result: Channel startup fails every time. The anthropic plugin loads 60-90+ seconds later when the Control UI triggers models.list.

Gateway log timeline (typical boot)

21:23:30 [plugins] memory-core installed bundled runtime deps in 890ms
21:23:31 [plugins] memory-wiki installed bundled runtime deps in 347ms
21:23:32 [gateway] ready (2 plugins: memory-core, memory-wiki; 3.6s)
21:23:32 [gateway/channels] channel startup failed: "claude-cli" is not registered
  ... 90+ seconds gap ...
  [models.list triggers remaining plugin init]
  [plugins] anthropic installed bundled runtime deps
  [plugins] brave installed bundled runtime deps
  ... etc ...

Impact

When the harness isn't registered at channel startup:

  1. Followup agent dispatch fails — after a CLI turn completes, the gateway tries to dispatch followup work but the harness is deregistered. All 17 fallback models fail instantly with the same error.
  2. Cascading timeout loop — if the CLI session happens to go quiet (e.g., running a tool that produces no streaming output for 180s), the gateway kills it. The abort deregisters the harness, and every subsequent fallback attempt fails instantly. The gateway cycles through opus-4-7 → gpt-5.5 → gpt-5.4 → gemini → opus-4-6 → ... → all 17 candidates fail in under 1 second.
  3. User sees duplicate/lost messages — the Control UI retries sends, messages appear to vanish, and the session becomes unresponsive until a manual WS disconnect/reconnect.

Root cause

The gateway's plugin loading has two phases:

  1. Startup plugins (memory-core, memory-wiki) — loaded before "ready"
  2. Lazy plugins (anthropic, brave, google, openai, xai) — loaded on-demand, triggered by models.list

Channel startup fires immediately after phase 1, but the primary model's harness (claude-cli) is provided by the anthropic plugin in phase 2. There's no mechanism to defer channel startup until the primary model's harness is available.

Suggested fix

Either:

  • Load the primary model's plugin in phase 1 — if the configured primary model requires a specific plugin (e.g., claude-cli/* requires anthropic), ensure that plugin is loaded before declaring "ready"
  • Defer channel startup — don't fire channel startup until the primary model's harness is registered, with a reasonable timeout
  • Retry channel startup — if channel startup fails due to a missing harness, retry after plugin lazy-load completes

Related

The harness deregistration on CLI abort is a separate issue that amplifies this one. When a CLI session is aborted (180s no-output timeout), the claude-cli harness deregisters, and all fallback candidates fail instantly instead of being able to spawn fresh CLI processes.

Environment

OpenClaw v2026.4.24
OS WSL2 on Windows (Linux 6.6.87.2-microsoft-standard-WSL2)
Gateway loopback-only, systemd user service
Primary model claude-cli/claude-opus-4-7

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions