Skip to content

[Bug]: cli-runner reports bare alias as agentMeta.model; persisted to sessions.json; primary candidate not alias-resolved → cascading 404 + auth profile cooldown #73657

@hashslingers

Description

@hashslingers

Summary

For agents using a CLI backend (e.g. claude-cli/sonnet), every successful turn poisons the persisted session entry with a bare model alias ('sonnet', 'haiku', 'opus'). On the next request the bare alias is sent literally to api.anthropic.com → HTTP 404 model_not_found. A single 404 puts the auth profile into cooldown, blocking subsequent retries on Anthropic. Replies fail or take 20–30 s as the failover walks through the chain.

The bug is the interaction of three places in the codebase:

  1. src/agents/cli-runner.ts:298 writes the bare alias to agentMeta.model:
    const modelId = (params.model ?? "default").trim() || "default";   // line 77 → bare alias
    ...
    agentMeta: { ..., model: modelId, ... }                            // line 298
  2. src/auto-reply/reply/session-usage.ts:54 (persistSessionUsageUpdate) saves that alias to sessions.json:
    model: params.modelUsed ?? entry.model
  3. src/agents/model-fallback.ts:172 adds the primary candidate to the failover list without alias resolution, so the bare alias is sent on the wire:
    addCandidate({ provider, model }, false);   // not alias-resolved
    while the fallbacks correctly go through resolveModelRefFromString (lines 188–197).

The asymmetry in (3) is the root cause; (1) is what feeds it; (2) is what makes it persistent across requests.

Code word: lobster-biscuit

Steps to reproduce

  1. Configure agents.list[main].model.primary = "claude-cli/sonnet" (the published Pattern B CLI shell-out).
  2. Have agents.defaults.models declare aliases, e.g. "anthropic/claude-sonnet-4-20250514": { "alias": "sonnet" }.
  3. Send a message → agent replies successfully.
  4. Inspect the session entry in ~/.openclaw/agents/<id>/sessions/sessions.json: model is now 'sonnet', modelProvider is 'claude-cli'.
  5. Send a follow-up on the same session.
  6. The follower path / failover path resolves provider to anthropic (canonical for claude-cli) and model sonnet literally → primary candidate becomes anthropic/sonnet → 404 → auth profile cooldown.

Expected

agentMeta.model should be the canonical full model ID, or the primary candidate in resolveFallbackCandidates should go through resolveModelRefFromString like the fallbacks already do. Either fix prevents the cascade.

Actual

Cascading 404s → auth profile cooldown → "Embedded agent failed before reply: All models failed" → silent reply drop.

Environment

  • OpenClaw version: 2026.4.26 (be8c246)
  • Node: v25.2.1
  • OS: macOS Darwin 25.3.0 (Apple Silicon)
  • Install method: `npm install -g openclaw`
  • Channel where observed: Slack (Socket Mode), `main` agent on `claude-cli/sonnet`

Logs / evidence

Persisted session entry after one successful Slack reply (the failure seed):
```
"agent:main:slack:channel:CXXXXXXX": {
...
"modelProvider": "claude-cli",
"model": "sonnet", ← bare alias persisted by session-usage.ts:54
...
}
```

Subsequent request:
```
[agent/embedded] embedded run agent end: runId=85a1c67a... isError=true
model=sonnet provider=anthropic
error=HTTP 404 not_found_error: model: sonnet
rawError=404 {"type":"error","error":{"type":"not_found_error","message":"model: sonnet"}}

[agent/embedded] auth profile failure state updated: runId=85a1c67a... profile=sha256:05de... provider=anthropic
reason=model_not_found window=cooldown reused=false

[model-fallback/decision] decision=candidate_failed requested=anthropic/sonnet candidate=anthropic/sonnet
reason=model_not_found providerErrorType=not_found_error next=anthropic/haiku detail=model: sonnet

Embedded agent failed before reply: All models failed (1):
anthropic/sonnet: Provider anthropic is in cooldown (all profiles unavailable) (model_not_found)
```

Adjacent issue: an auth profile being moved to cooldown on `model_not_found` (a config bug, not an auth failure) compounds the impact — once the alias 404 fires once, the profile is locked out for the cooldown window. Worth a separate filing if not already known.

Impact

High. Every successful CLI-backend reply seeds a future 404 on the same session, making Slack/iMessage replies feel unreliable. Auth profile cooldown widens the blast radius to also block legitimate same-provider retries.

Workarounds

  1. Local rewrite watchdog — a 60s LaunchAgent reads `sessions.json`, resolves bare-alias `model` fields against the alias map, atomic-rewrites to canonical IDs. Preserves session continuity. Low-cost, fully external to OpenClaw.
  2. Wipe alias entries — bare deletion works but loses Slack thread context, `cliSessionIds`, etc. Not durable (re-infects on next reply).
  3. Drop `alias` declarations from `agents.defaults.models` — untested; may break other paths that legitimately read aliases.

Suggested fix

In `src/agents/model-fallback.ts:172`, replace the bare `addCandidate({ provider, model }, false)` with a call through `resolveModelRefFromString({ raw: \`${provider}/${model}\`, defaultProvider, aliasIndex })`, mirroring the fallback path at lines 188–197. This makes the primary candidate symmetric with fallbacks and self-heals the persisted-alias case.

A complementary fix in `cli-runner.ts:298` to report the canonical model ID in `agentMeta.model` (using `normalizedModel` plus the alias index) would also stop the persistence side at the source — but the model-fallback fix alone is sufficient.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions