[Bug]: claude-cli harness registers lazily after gateway boot — silently drops inbound traffic with MissingAgentHarnessError for 96 minutes

## Environment

- **OpenClaw**: 2026.5.4 (gateway process that exhibited the bug)
- **Backend**: `cliBackends.claude-cli` pointing at Anthropic's Claude Code CLI binary
- **Channel**: Telegram (polling mode), single DM lane
- **Agent**: single `agents.list[main]` with model `claude-cli/claude-opus-4-7`, fallbacks `claude-cli/claude-sonnet-4-6`, `claude-cli/claude-haiku-4-5`
- **Plugins enabled**: anthropic, browser, canvas, device-pair, file-transfer, memory-core, phone-control, slack, talk-voice, telegram (10 total)
- **Host**: single-VPS Linux deployment, no container

## TL;DR

After a gateway restart, the `claude-cli` agent harness was not present in the dispatch registry for ~96 minutes despite the gateway reporting all plugins healthy and listening. Every inbound user message during that window failed with `MissingAgentHarnessError`, and the user got a generic "Something went wrong" reply. The state self-healed on the first subsequent inbound — strongly suggesting the harness registers *lazily* on first dispatch attempt rather than synchronously at boot, and that the registration race won at 16:57 but lost at 18:33 for some intermediate reason (or vice versa). Either way: no log line warned that a critical harness was missing, and the gateway considered itself healthy throughout.

## Symptom

After gateway restart at `T=00:00:08` (process up and listening), every inbound Telegram message for the next 96 minutes failed identically:

```
[diagnostic] message dispatch completed:
  channel=telegram sessionId=unknown
  sessionKey=agent:main:telegram:direct:<user>
  source=replyResolver outcome=error duration=17696ms
  error="MissingAgentHarnessError: Requested agent harness "claude-cli" is not registered."
[diagnostic] message processed:
  channel=telegram chatId=<chat> messageId=<id>
  sessionId=unknown sessionKey=agent:main:telegram:direct:<user>
  outcome=error duration=19833ms
  error="MissingAgentHarnessError: Requested agent harness "claude-cli" is not registered."
[telegram] dispatch failed: MissingAgentHarnessError:
  Requested agent harness "claude-cli" is not registered.
```

Each failure took ~17–19s of CPU time before erroring (suggests the dispatch path was retrying or waiting on something internal before giving up). Telegram still sent the canned "Something went wrong while processing your request. Please try again." back to the user for each — so from the user's perspective the bot looked online and responsive, just broken.

The gateway, meanwhile, was reporting itself healthy:

```
[gateway] http server listening (10 plugins: anthropic, browser, canvas,
  device-pair, file-transfer, memory-core, phone-control, slack, talk-voice,
  telegram; 6.5s)
```

No warn/error line was emitted about the missing harness.

## Timeline (UTC, anonymized; 4 messages affected)

```
T=00:00:00  prior gateway shutdown errored (separate issue, filed as peer)
T=00:00:08  new gateway PID listening, 10 plugins reported, healthy
T=00:00:36  inbound message #1  → MissingAgentHarnessError, 19s, canned reply sent
T=00:02:36  inbound message #2  → MissingAgentHarnessError, 19s, canned reply sent
T=00:24:02  inbound message #3  → MissingAgentHarnessError, 19s, canned reply sent
T=01:36:15  inbound message #4  → MissingAgentHarnessError, 19s, canned reply sent
T=01:36:47  inbound message #5  → SUCCESS, cli-backend cold-started
T=01:36:56  [agent/cli-backend] cli exec: provider=claude-cli model=opus
            promptChars=447 trigger=user useResume=false session=none
            resumeSession=none reuse=none historyPrompt=none
T=01:37:16  [agent/cli-backend] claude live session turn: provider=claude-cli
            model=claude-opus-4-7 durationMs=19782 rawLines=53
```

Total user-visible outage: 96 minutes. 4 user messages got the canned error before the gateway healed itself.

## What went wrong

Two distinct problems, both worth fixing:

### 1. Harness registration is lazy / racy, not synchronous-at-boot

The gateway reports `http server listening (10 plugins: ..., 6.5s)` *before* the `claude-cli` agent harness has actually registered into the dispatch registry. Dispatch then races: if the first inbound arrives before registration completes, dispatch hits the "harness not registered" branch and the user sees an error. After 96 minutes of dispatches failing this way on this host, the next inbound succeeded — which means *something* about that 5th dispatch (or the slot of time it ran in) finally caused the registration to complete. The `cli exec ... session=none resumeSession=none reuse=none` log line on the first successful exec confirms it was a *cold* cli-backend start, consistent with the registry having been empty until that moment.

The exact race condition needs investigation — possible candidates: harness `register()` waiting on an async secret-resolution that never completed on the first 4 dispatches, or a one-shot retry that exhausted retries on each dispatch without registering. But the fix is the same regardless: the gateway should not declare itself ready to dispatch until the registry is populated with every declared agent harness.

### 2. Missing-harness condition is silent

The dispatch path knows perfectly well, on each failed attempt, that an *expected* harness (declared in `agents.list[main]`) is missing. But it only logs the per-message dispatch error — there is no:

- boot-time invariant check that fails the gateway start if a declared harness fails to register
- periodic health-line that reports "registered harnesses: [...]" so operators can spot a missing one
- distinct warn/error message on missing-harness vs other dispatch errors

The user-facing canned reply also doesn't distinguish "transient network glitch" from "gateway is structurally broken." Both look identical to the user, so the operator has to be reading journals to discover the outage exists.

## Ask

1. **Fix lazy registration.** Harness registration should be synchronous (or properly awaited) during plugin init, so that `http server listening` only fires after the dispatch registry can resolve every declared agent. Treat a declared-but-unregistered harness as a fatal boot error.

2. **Boot-time invariant check.** Even after #1, add a startup assertion that compares declared agents (from `agents.list`) against the registry — fail fast at boot if any declared harness failed to register.

3. **Loud logging on missing-harness dispatch.** When dispatch hits a `MissingAgentHarnessError` for a harness that's in `agents.list` but not in the registry, escalate the log to a warn/error level with a distinctive prefix (e.g. `[gateway] CRITICAL: declared harness "claude-cli" missing from registry on dispatch`). This is a different class of failure from a user typo'ing an agent name.

4. **Distinguish user-facing error.** Either keep the canned "Something went wrong" message but include a follow-up operator-facing alert (Telegram DM to the deployment owner / system-bus event / etc.), or upgrade the message to "The gateway is unhealthy; check journals" so a human gets a clear signal vs. just retrying into the same wall.

## Related

- Co-occurred 2026-05-24 with #86226 (`synthetic-auth.runtime.js` shutdown SyntaxError). Together they produced the 96-minute outage: the SyntaxError fired during the prior gateway's shutdown, and the new gateway then came up with this lazy-registration bug, so by the time the dispatch path saw user traffic the registry was empty.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: claude-cli harness registers lazily after gateway boot — silently drops inbound traffic with MissingAgentHarnessError for 96 minutes #86227

Environment

TL;DR

Symptom

Timeline (UTC, anonymized; 4 messages affected)

What went wrong

1. Harness registration is lazy / racy, not synchronous-at-boot

2. Missing-harness condition is silent

Ask

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: claude-cli harness registers lazily after gateway boot — silently drops inbound traffic with MissingAgentHarnessError for 96 minutes #86227

Description

Environment

TL;DR

Symptom

Timeline (UTC, anonymized; 4 messages affected)

What went wrong

1. Harness registration is lazy / racy, not synchronous-at-boot

2. Missing-harness condition is silent

Ask

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions