You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After a gateway restart, the claude-cli agent harness was not present in the dispatch registry for ~96 minutes despite the gateway reporting all plugins healthy and listening. Every inbound user message during that window failed with MissingAgentHarnessError, and the user got a generic "Something went wrong" reply. The state self-healed on the first subsequent inbound — strongly suggesting the harness registers lazily on first dispatch attempt rather than synchronously at boot, and that the registration race won at 16:57 but lost at 18:33 for some intermediate reason (or vice versa). Either way: no log line warned that a critical harness was missing, and the gateway considered itself healthy throughout.
Symptom
After gateway restart at T=00:00:08 (process up and listening), every inbound Telegram message for the next 96 minutes failed identically:
[diagnostic] message dispatch completed:
channel=telegram sessionId=unknown
sessionKey=agent:main:telegram:direct:<user>
source=replyResolver outcome=error duration=17696ms
error="MissingAgentHarnessError: Requested agent harness "claude-cli" is not registered."
[diagnostic] message processed:
channel=telegram chatId=<chat> messageId=<id>
sessionId=unknown sessionKey=agent:main:telegram:direct:<user>
outcome=error duration=19833ms
error="MissingAgentHarnessError: Requested agent harness "claude-cli" is not registered."
[telegram] dispatch failed: MissingAgentHarnessError:
Requested agent harness "claude-cli" is not registered.
Each failure took ~17–19s of CPU time before erroring (suggests the dispatch path was retrying or waiting on something internal before giving up). Telegram still sent the canned "Something went wrong while processing your request. Please try again." back to the user for each — so from the user's perspective the bot looked online and responsive, just broken.
The gateway, meanwhile, was reporting itself healthy:
Total user-visible outage: 96 minutes. 4 user messages got the canned error before the gateway healed itself.
What went wrong
Two distinct problems, both worth fixing:
1. Harness registration is lazy / racy, not synchronous-at-boot
The gateway reports http server listening (10 plugins: ..., 6.5s)before the claude-cli agent harness has actually registered into the dispatch registry. Dispatch then races: if the first inbound arrives before registration completes, dispatch hits the "harness not registered" branch and the user sees an error. After 96 minutes of dispatches failing this way on this host, the next inbound succeeded — which means something about that 5th dispatch (or the slot of time it ran in) finally caused the registration to complete. The cli exec ... session=none resumeSession=none reuse=none log line on the first successful exec confirms it was a cold cli-backend start, consistent with the registry having been empty until that moment.
The exact race condition needs investigation — possible candidates: harness register() waiting on an async secret-resolution that never completed on the first 4 dispatches, or a one-shot retry that exhausted retries on each dispatch without registering. But the fix is the same regardless: the gateway should not declare itself ready to dispatch until the registry is populated with every declared agent harness.
2. Missing-harness condition is silent
The dispatch path knows perfectly well, on each failed attempt, that an expected harness (declared in agents.list[main]) is missing. But it only logs the per-message dispatch error — there is no:
boot-time invariant check that fails the gateway start if a declared harness fails to register
periodic health-line that reports "registered harnesses: [...]" so operators can spot a missing one
distinct warn/error message on missing-harness vs other dispatch errors
The user-facing canned reply also doesn't distinguish "transient network glitch" from "gateway is structurally broken." Both look identical to the user, so the operator has to be reading journals to discover the outage exists.
Ask
Fix lazy registration. Harness registration should be synchronous (or properly awaited) during plugin init, so that http server listening only fires after the dispatch registry can resolve every declared agent. Treat a declared-but-unregistered harness as a fatal boot error.
Boot-time invariant check. Even after fix: add @lid format support and allowFrom wildcard handling #1, add a startup assertion that compares declared agents (from agents.list) against the registry — fail fast at boot if any declared harness failed to register.
Loud logging on missing-harness dispatch. When dispatch hits a MissingAgentHarnessError for a harness that's in agents.list but not in the registry, escalate the log to a warn/error level with a distinctive prefix (e.g. [gateway] CRITICAL: declared harness "claude-cli" missing from registry on dispatch). This is a different class of failure from a user typo'ing an agent name.
Distinguish user-facing error. Either keep the canned "Something went wrong" message but include a follow-up operator-facing alert (Telegram DM to the deployment owner / system-bus event / etc.), or upgrade the message to "The gateway is unhealthy; check journals" so a human gets a clear signal vs. just retrying into the same wall.
Environment
cliBackends.claude-clipointing at Anthropic's Claude Code CLI binaryagents.list[main]with modelclaude-cli/claude-opus-4-7, fallbacksclaude-cli/claude-sonnet-4-6,claude-cli/claude-haiku-4-5TL;DR
After a gateway restart, the
claude-cliagent harness was not present in the dispatch registry for ~96 minutes despite the gateway reporting all plugins healthy and listening. Every inbound user message during that window failed withMissingAgentHarnessError, and the user got a generic "Something went wrong" reply. The state self-healed on the first subsequent inbound — strongly suggesting the harness registers lazily on first dispatch attempt rather than synchronously at boot, and that the registration race won at 16:57 but lost at 18:33 for some intermediate reason (or vice versa). Either way: no log line warned that a critical harness was missing, and the gateway considered itself healthy throughout.Symptom
After gateway restart at
T=00:00:08(process up and listening), every inbound Telegram message for the next 96 minutes failed identically:Each failure took ~17–19s of CPU time before erroring (suggests the dispatch path was retrying or waiting on something internal before giving up). Telegram still sent the canned "Something went wrong while processing your request. Please try again." back to the user for each — so from the user's perspective the bot looked online and responsive, just broken.
The gateway, meanwhile, was reporting itself healthy:
No warn/error line was emitted about the missing harness.
Timeline (UTC, anonymized; 4 messages affected)
Total user-visible outage: 96 minutes. 4 user messages got the canned error before the gateway healed itself.
What went wrong
Two distinct problems, both worth fixing:
1. Harness registration is lazy / racy, not synchronous-at-boot
The gateway reports
http server listening (10 plugins: ..., 6.5s)before theclaude-cliagent harness has actually registered into the dispatch registry. Dispatch then races: if the first inbound arrives before registration completes, dispatch hits the "harness not registered" branch and the user sees an error. After 96 minutes of dispatches failing this way on this host, the next inbound succeeded — which means something about that 5th dispatch (or the slot of time it ran in) finally caused the registration to complete. Thecli exec ... session=none resumeSession=none reuse=nonelog line on the first successful exec confirms it was a cold cli-backend start, consistent with the registry having been empty until that moment.The exact race condition needs investigation — possible candidates: harness
register()waiting on an async secret-resolution that never completed on the first 4 dispatches, or a one-shot retry that exhausted retries on each dispatch without registering. But the fix is the same regardless: the gateway should not declare itself ready to dispatch until the registry is populated with every declared agent harness.2. Missing-harness condition is silent
The dispatch path knows perfectly well, on each failed attempt, that an expected harness (declared in
agents.list[main]) is missing. But it only logs the per-message dispatch error — there is no:The user-facing canned reply also doesn't distinguish "transient network glitch" from "gateway is structurally broken." Both look identical to the user, so the operator has to be reading journals to discover the outage exists.
Ask
Fix lazy registration. Harness registration should be synchronous (or properly awaited) during plugin init, so that
http server listeningonly fires after the dispatch registry can resolve every declared agent. Treat a declared-but-unregistered harness as a fatal boot error.Boot-time invariant check. Even after fix: add @lid format support and allowFrom wildcard handling #1, add a startup assertion that compares declared agents (from
agents.list) against the registry — fail fast at boot if any declared harness failed to register.Loud logging on missing-harness dispatch. When dispatch hits a
MissingAgentHarnessErrorfor a harness that's inagents.listbut not in the registry, escalate the log to a warn/error level with a distinctive prefix (e.g.[gateway] CRITICAL: declared harness "claude-cli" missing from registry on dispatch). This is a different class of failure from a user typo'ing an agent name.Distinguish user-facing error. Either keep the canned "Something went wrong" message but include a follow-up operator-facing alert (Telegram DM to the deployment owner / system-bus event / etc.), or upgrade the message to "The gateway is unhealthy; check journals" so a human gets a clear signal vs. just retrying into the same wall.
Related
synthetic-auth.runtime.jsshutdown SyntaxError). Together they produced the 96-minute outage: the SyntaxError fired during the prior gateway's shutdown, and the new gateway then came up with this lazy-registration bug, so by the time the dispatch path saw user traffic the registry was empty.