Summary
For agents using a CLI backend (e.g. claude-cli/sonnet), every successful turn poisons the persisted session entry with a bare model alias ('sonnet', 'haiku', 'opus'). On the next request the bare alias is sent literally to api.anthropic.com → HTTP 404 model_not_found. A single 404 puts the auth profile into cooldown, blocking subsequent retries on Anthropic. Replies fail or take 20–30 s as the failover walks through the chain.
The bug is the interaction of three places in the codebase:
src/agents/cli-runner.ts:298 writes the bare alias to agentMeta.model:
const modelId = (params.model ?? "default").trim() || "default"; // line 77 → bare alias
...
agentMeta: { ..., model: modelId, ... } // line 298
src/auto-reply/reply/session-usage.ts:54 (persistSessionUsageUpdate) saves that alias to sessions.json:
model: params.modelUsed ?? entry.model
src/agents/model-fallback.ts:172 adds the primary candidate to the failover list without alias resolution, so the bare alias is sent on the wire:
addCandidate({ provider, model }, false); // not alias-resolved
while the fallbacks correctly go through resolveModelRefFromString (lines 188–197).
The asymmetry in (3) is the root cause; (1) is what feeds it; (2) is what makes it persistent across requests.
Code word: lobster-biscuit
Steps to reproduce
- Configure
agents.list[main].model.primary = "claude-cli/sonnet" (the published Pattern B CLI shell-out).
- Have
agents.defaults.models declare aliases, e.g. "anthropic/claude-sonnet-4-20250514": { "alias": "sonnet" }.
- Send a message → agent replies successfully.
- Inspect the session entry in
~/.openclaw/agents/<id>/sessions/sessions.json: model is now 'sonnet', modelProvider is 'claude-cli'.
- Send a follow-up on the same session.
- The follower path / failover path resolves provider to
anthropic (canonical for claude-cli) and model sonnet literally → primary candidate becomes anthropic/sonnet → 404 → auth profile cooldown.
Expected
agentMeta.model should be the canonical full model ID, or the primary candidate in resolveFallbackCandidates should go through resolveModelRefFromString like the fallbacks already do. Either fix prevents the cascade.
Actual
Cascading 404s → auth profile cooldown → "Embedded agent failed before reply: All models failed" → silent reply drop.
Environment
- OpenClaw version: 2026.4.26 (be8c246)
- Node: v25.2.1
- OS: macOS Darwin 25.3.0 (Apple Silicon)
- Install method: `npm install -g openclaw`
- Channel where observed: Slack (Socket Mode), `main` agent on `claude-cli/sonnet`
Logs / evidence
Persisted session entry after one successful Slack reply (the failure seed):
```
"agent:main:slack:channel:CXXXXXXX": {
...
"modelProvider": "claude-cli",
"model": "sonnet", ← bare alias persisted by session-usage.ts:54
...
}
```
Subsequent request:
```
[agent/embedded] embedded run agent end: runId=85a1c67a... isError=true
model=sonnet provider=anthropic
error=HTTP 404 not_found_error: model: sonnet
rawError=404 {"type":"error","error":{"type":"not_found_error","message":"model: sonnet"}}
[agent/embedded] auth profile failure state updated: runId=85a1c67a... profile=sha256:05de... provider=anthropic
reason=model_not_found window=cooldown reused=false
[model-fallback/decision] decision=candidate_failed requested=anthropic/sonnet candidate=anthropic/sonnet
reason=model_not_found providerErrorType=not_found_error next=anthropic/haiku detail=model: sonnet
Embedded agent failed before reply: All models failed (1):
anthropic/sonnet: Provider anthropic is in cooldown (all profiles unavailable) (model_not_found)
```
Adjacent issue: an auth profile being moved to cooldown on `model_not_found` (a config bug, not an auth failure) compounds the impact — once the alias 404 fires once, the profile is locked out for the cooldown window. Worth a separate filing if not already known.
Impact
High. Every successful CLI-backend reply seeds a future 404 on the same session, making Slack/iMessage replies feel unreliable. Auth profile cooldown widens the blast radius to also block legitimate same-provider retries.
Workarounds
- Local rewrite watchdog — a 60s LaunchAgent reads `sessions.json`, resolves bare-alias `model` fields against the alias map, atomic-rewrites to canonical IDs. Preserves session continuity. Low-cost, fully external to OpenClaw.
- Wipe alias entries — bare deletion works but loses Slack thread context, `cliSessionIds`, etc. Not durable (re-infects on next reply).
- Drop `alias` declarations from `agents.defaults.models` — untested; may break other paths that legitimately read aliases.
Suggested fix
In `src/agents/model-fallback.ts:172`, replace the bare `addCandidate({ provider, model }, false)` with a call through `resolveModelRefFromString({ raw: \`${provider}/${model}\`, defaultProvider, aliasIndex })`, mirroring the fallback path at lines 188–197. This makes the primary candidate symmetric with fallbacks and self-heals the persisted-alias case.
A complementary fix in `cli-runner.ts:298` to report the canonical model ID in `agentMeta.model` (using `normalizedModel` plus the alias index) would also stop the persistence side at the source — but the model-fallback fix alone is sufficient.
Summary
For agents using a CLI backend (e.g.
claude-cli/sonnet), every successful turn poisons the persisted session entry with a bare model alias ('sonnet','haiku','opus'). On the next request the bare alias is sent literally to api.anthropic.com → HTTP 404model_not_found. A single 404 puts the auth profile into cooldown, blocking subsequent retries on Anthropic. Replies fail or take 20–30 s as the failover walks through the chain.The bug is the interaction of three places in the codebase:
src/agents/cli-runner.ts:298writes the bare alias toagentMeta.model:src/auto-reply/reply/session-usage.ts:54(persistSessionUsageUpdate) saves that alias tosessions.json:src/agents/model-fallback.ts:172adds the primary candidate to the failover list without alias resolution, so the bare alias is sent on the wire:resolveModelRefFromString(lines 188–197).The asymmetry in (3) is the root cause; (1) is what feeds it; (2) is what makes it persistent across requests.
Code word: lobster-biscuit
Steps to reproduce
agents.list[main].model.primary = "claude-cli/sonnet"(the published Pattern B CLI shell-out).agents.defaults.modelsdeclare aliases, e.g."anthropic/claude-sonnet-4-20250514": { "alias": "sonnet" }.~/.openclaw/agents/<id>/sessions/sessions.json:modelis now'sonnet',modelProvideris'claude-cli'.anthropic(canonical forclaude-cli) and modelsonnetliterally → primary candidate becomesanthropic/sonnet→ 404 → auth profile cooldown.Expected
agentMeta.modelshould be the canonical full model ID, or the primary candidate inresolveFallbackCandidatesshould go throughresolveModelRefFromStringlike the fallbacks already do. Either fix prevents the cascade.Actual
Cascading 404s → auth profile cooldown → "Embedded agent failed before reply: All models failed" → silent reply drop.
Environment
Logs / evidence
Persisted session entry after one successful Slack reply (the failure seed):
```
"agent:main:slack:channel:CXXXXXXX": {
...
"modelProvider": "claude-cli",
"model": "sonnet", ← bare alias persisted by session-usage.ts:54
...
}
```
Subsequent request:
```
[agent/embedded] embedded run agent end: runId=85a1c67a... isError=true
model=sonnet provider=anthropic
error=HTTP 404 not_found_error: model: sonnet
rawError=404 {"type":"error","error":{"type":"not_found_error","message":"model: sonnet"}}
[agent/embedded] auth profile failure state updated: runId=85a1c67a... profile=sha256:05de... provider=anthropic
reason=model_not_found window=cooldown reused=false
[model-fallback/decision] decision=candidate_failed requested=anthropic/sonnet candidate=anthropic/sonnet
reason=model_not_found providerErrorType=not_found_error next=anthropic/haiku detail=model: sonnet
Embedded agent failed before reply: All models failed (1):
anthropic/sonnet: Provider anthropic is in cooldown (all profiles unavailable) (model_not_found)
```
Adjacent issue: an auth profile being moved to cooldown on `model_not_found` (a config bug, not an auth failure) compounds the impact — once the alias 404 fires once, the profile is locked out for the cooldown window. Worth a separate filing if not already known.
Impact
High. Every successful CLI-backend reply seeds a future 404 on the same session, making Slack/iMessage replies feel unreliable. Auth profile cooldown widens the blast radius to also block legitimate same-provider retries.
Workarounds
Suggested fix
In `src/agents/model-fallback.ts:172`, replace the bare `addCandidate({ provider, model }, false)` with a call through `resolveModelRefFromString({ raw: \`${provider}/${model}\`, defaultProvider, aliasIndex })`, mirroring the fallback path at lines 188–197. This makes the primary candidate symmetric with fallbacks and self-heals the persisted-alias case.
A complementary fix in `cli-runner.ts:298` to report the canonical model ID in `agentMeta.model` (using `normalizedModel` plus the alias index) would also stop the persistence side at the source — but the model-fallback fix alone is sufficient.