fix(models): provider auth pre-warm #85125
Conversation
0acb0f6 to
2c791b9
Compare
|
Codex review: needs real behavior proof before merge. Workflow note: Future ClawSweeper reviews update this same comment in place. How this review workflow works
Summary Reproducibility: yes. Current main has the 10-second singleton prepared-auth state and scope guards that can make PR rating Rank-up moves:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. Real behavior proof Mantis proof suggestion Risk before merge
Maintainer options:
Next step before merge Security Review findings
Review detailsBest possible solution: Keep the Do we have a high-confidence way to reproduce the issue? Yes. Current main has the 10-second singleton prepared-auth state and scope guards that can make Is this the best way to solve the issue? Not yet. The Label justifications:
Full review comments:
Overall correctness: patch is incorrect What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against e399a92e6cc9. |
|
ClawSweeper PR egg 🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat. Where did the egg go?
|
2c791b9 to
8735f04
Compare
…y by agent ID instead of agent dir
8735f04 to
5ed8a63
Compare
|
@clawsweeper review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
Re: the P1 "Preserve agent-scoped auth in picker flows" finding — I don't think this is a real concern. Walked every production caller of No production caller passes
Same path, same answer. The picker's own If a future caller ever needs agent-scoped picker auth, the fix is additive: thread |
When markAuthProfileFailure observes an auth failure at request time (token rotated, OAuth revoke, etc.), fire a hook that clears the prepared provider-auth map so the next model-listing call recomputes against the real auth state. Single mutable hook slot wired up at gateway startup; no TTL or polling. Addresses ClawSweeper's P1 freshness finding on #85125 without reintroducing the TTL.
|
@clawsweeper review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
…e visible Adds a chokidar watcher on every configured agent's auth-profiles.json. Any change fires clearCurrentProviderAuthState so the next model-listing call recomputes against the on-disk auth state. Closes the stale-FALSE direction (user adds auth via codex login, hand-edit, etc.) that the auth-failure hook can't catch on its own.
4b1e017 to
8ad489a
Compare
|
@clawsweeper review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
…nvalidate Addresses three ClawSweeper findings on the fs-watcher commit: - [P1] auth-profile watcher now handles chokidar 'error' events (logs + closes once) mirroring the gateway config-reload pattern. Without this, an unhandled error from chokidar can crash the gateway. - [P2] auth-profile watcher handle is pushed into postReadySidecars so stopPostReadySidecarsAfterCloseStarted closes it on gateway shutdown. - [P2] auth-failure and file-change invalidation paths now schedule a background rewarm (with a 'reason=' log line). Without this, the next /models call after an invalidation paid the slow per-provider path until the next reload. The warmer's existing generation counter handles concurrent rewarms safely.
|
@clawsweeper review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
When markAuthProfileFailure observes an auth failure at request time (token rotated, OAuth revoke, etc.), fire a hook that clears the prepared provider-auth map so the next model-listing call recomputes against the real auth state. Single mutable hook slot wired up at gateway startup; no TTL or polling. Addresses ClawSweeper's P1 freshness finding on #85125 without reintroducing the TTL.
When markAuthProfileFailure observes an auth failure at request time (token rotated, OAuth revoke, etc.), fire a hook that clears the prepared provider-auth map so the next model-listing call recomputes against the real auth state. Single mutable hook slot wired up at gateway startup; no TTL or polling. Addresses ClawSweeper's P1 freshness finding on openclaw#85125 without reintroducing the TTL.
When markAuthProfileFailure observes an auth failure at request time (token rotated, OAuth revoke, etc.), fire a hook that clears the prepared provider-auth map so the next model-listing call recomputes against the real auth state. Single mutable hook slot wired up at gateway startup; no TTL or polling. Addresses ClawSweeper's P1 freshness finding on openclaw#85125 without reintroducing the TTL.
When markAuthProfileFailure observes an auth failure at request time (token rotated, OAuth revoke, etc.), fire a hook that clears the prepared provider-auth map so the next model-listing call recomputes against the real auth state. Single mutable hook slot wired up at gateway startup; no TTL or polling. Addresses ClawSweeper's P1 freshness finding on openclaw#85125 without reintroducing the TTL.
When markAuthProfileFailure observes an auth failure at request time (token rotated, OAuth revoke, etc.), fire a hook that clears the prepared provider-auth map so the next model-listing call recomputes against the real auth state. Single mutable hook slot wired up at gateway startup; no TTL or polling. Addresses ClawSweeper's P1 freshness finding on openclaw#85125 without reintroducing the TTL.
When markAuthProfileFailure observes an auth failure at request time (token rotated, OAuth revoke, etc.), fire a hook that clears the prepared provider-auth map so the next model-listing call recomputes against the real auth state. Single mutable hook slot wired up at gateway startup; no TTL or polling. Addresses ClawSweeper's P1 freshness finding on openclaw#85125 without reintroducing the TTL.
When markAuthProfileFailure observes an auth failure at request time (token rotated, OAuth revoke, etc.), fire a hook that clears the prepared provider-auth map so the next model-listing call recomputes against the real auth state. Single mutable hook slot wired up at gateway startup; no TTL or polling. Addresses ClawSweeper's P1 freshness finding on openclaw#85125 without reintroducing the TTL.
When markAuthProfileFailure observes an auth failure at request time (token rotated, OAuth revoke, etc.), fire a hook that clears the prepared provider-auth map so the next model-listing call recomputes against the real auth state. Single mutable hook slot wired up at gateway startup; no TTL or polling. Addresses ClawSweeper's P1 freshness finding on openclaw#85125 without reintroducing the TTL.
When markAuthProfileFailure observes an auth failure at request time (token rotated, OAuth revoke, etc.), fire a hook that clears the prepared provider-auth map so the next model-listing call recomputes against the real auth state. Single mutable hook slot wired up at gateway startup; no TTL or polling. Addresses ClawSweeper's P1 freshness finding on openclaw#85125 without reintroducing the TTL.
When markAuthProfileFailure observes an auth failure at request time (token rotated, OAuth revoke, etc.), fire a hook that clears the prepared provider-auth map so the next model-listing call recomputes against the real auth state. Single mutable hook slot wired up at gateway startup; no TTL or polling. Addresses ClawSweeper's P1 freshness finding on openclaw#85125 without reintroducing the TTL.
When markAuthProfileFailure observes an auth failure at request time (token rotated, OAuth revoke, etc.), fire a hook that clears the prepared provider-auth map so the next model-listing call recomputes against the real auth state. Single mutable hook slot wired up at gateway startup; no TTL or polling. Addresses ClawSweeper's P1 freshness finding on openclaw#85125 without reintroducing the TTL.
When markAuthProfileFailure observes an auth failure at request time (token rotated, OAuth revoke, etc.), fire a hook that clears the prepared provider-auth map so the next model-listing call recomputes against the real auth state. Single mutable hook slot wired up at gateway startup; no TTL or polling. Addresses ClawSweeper's P1 freshness finding on openclaw#85125 without reintroducing the TTL.
Fix auth-config pre-warm
Untitled.mp4
The gateway pre-warms a "prepared provider auth" map so every
/modelscall (Discord, Telegram, CLI, status, pickers) can skip the per-provider auth-discovery sweep that costs ~22 s on a typical 30-provider config. Three things were wrong with the existing pre-warm:1. TTL. The prepared map carried a 10 s TTL — any read more than 10 s after the warm completed fell through to the slow compute path. With warm taking ~50 s and
/modelstypically used minutes apart, the map was useful for ~10 s and inert thereafter. The TTL was added with the intent of bounding staleness if auth state changes outside the gateway, but with no auto-rewarm it just guaranteed the perf benefit applied to almost no real call. Removed entirely. The map is invalidated explicitly on the events that actually change the answer — config reload, plugin reload, auth-profile logout, and now any observed auth-profile failure (see below). With no TTL, the prepared map is valid for the gateway's whole lifetime; freshness is event-driven rather than time-bounded.2. workspaceDir scope mismatch. The warmer derived
workspaceDirviaresolveDefaultAgentWorkspaceDir()(lands on~/.openclaw/workspace). The actual call path foragentId="main"derives it viaresolveAgentWorkspaceDir(cfg, "main")(lands on the repo-local.openclaw/workspaceon dev checkouts). Scope check failed → fell through every time, even though the map was populated. Fixed: warmer now usesresolveAgentWorkspaceDir(cfg, agentId), matching the call path. workspaceDir is no longer stored on the prepared state — it's recomputed at read time from(cfg, agentId)since both inputs are already matched.3. Default-agent only. The prepared state was a single value scoped to the default agent. Calls for any non-default
agentIdalways missed and paid the full price. Fixed: replaced withMap<agentId, PreparedProviderAuthState>populated by iteratinglistAgentIds(cfg)at warm time. Catalog load is shared across agents, so per-extra-agent cost is just the ~18 s auth sweep, not the full ~50 s.API cleanup:
hasAuthForModelProviderandcreateProviderAuthCheckernow takeagentId?: stringinstead ofagentDir?: string. agentId is the canonical identifier; agentDir is a derived path and could theoretically collide across agents via config override. Slow path derivesagentDir = resolveAgentDir(cfg, agentId)internally when it needs the filesystem store. Callers (commands-models.ts,model-catalog-visibility.ts,flows/model-picker.ts) updated to passagentId.Self-heal on observed auth failure
ClawSweeper's P1 finding flagged that removing the TTL leaves the prepared map stale if auth state changes outside the gateway's own reload/logout paths (token rotation, OAuth revoke, billing failure). The follow-up commit
961b5bc63faddresses this without reintroducing the TTL:auth-profiles/usage.tsexposes a single mutable hook slot —setAuthProfileFailureHook(hook).markAuthProfileFailurecalls the hook (try/catched) on both success paths after recording the failure.clearCurrentProviderAuthState. Any observed auth failure now clears the prepared map, forcing the next/modelscall to recompute against the real auth state.Set/registry — single-purpose seam between two modules that intentionally don't statically import each other.This covers the stale TRUE direction (map says auth, reality says none — the user-visible case where Discord shows a model that 401s when picked).
Watch auth-profiles.json for external changes
Closes the stale FALSE direction (user added auth externally —
codex loginfrom another shell, hand-edit, etc. — and the gateway hasn't observed any request through that profile yet, so the auth-failure hook never fires). Newsrc/agents/auth-profiles-watcher.tsuses chokidar (already a dep, same options asgateway/config-reload.ts) to watch<agentDir>/auth-profiles.jsonfor every configured agent. Any change event firesclearCurrentProviderAuthState, so the next/modelscall recomputes against the on-disk state. Wired up in gateway startup right after the failure hook.Together with the failure hook, the prepared map now self-heals both directions without a TTL: the events that change the answer (failed auth call, file write) explicitly invalidate; otherwise the map is valid for the gateway's whole lifetime.
Real behavior proof — hot path
Behavior addressed:
/models(Discord, Telegram, CLI) latency on a real openclaw gateway. Before this PR the call took ~22 s for every invocation because the prepared auth map was either expired (10 s TTL) or scope-mismatched and every call fell back to the per-provider auth-discovery sweep. After this PR the call hits the warmed map and returns in single-digit ms.Real environment tested: local openclaw dev gateway on macOS, single machine, same config, same plugin set, captured back-to-back. Catalog at capture time: 955 model entries across ~30 unique providers, 40 providers warmed.
Exact steps or command run after this patch:
Evidence after fix —
/tmp/openclaw/openclaw-2026-05-20.log:BEFORE (HEAD before this PR, no prepared-state hits — 10 s TTL expired and falling through):
Matching gateway liveness diagnostic captured during the BEFORE run shows the call blocks the event loop:
The 24.5 s blocked window aligns with the ~22.9 s
buildModelsProviderDataruntime. Discord's 3 s ack deadline runs on the same loop, so the deferral never gets sent → "This interaction failed."AFTER (this PR's prepared-state with workspace/agentId fixes — gateway restarted with the branch applied, warmer ran at startup,
/modelsexercised 12×):Observed result after fix:
/modelsafter restart/modelsDiscord behavior: ack lands inside the 3 s deadline on every interaction; no more "This interaction failed." Event-loop liveness warnings (p99 ~24 s) gone.
Real behavior proof — file-change invalidation (fs watcher)
Captured against the latest build on
sjf/discord-picker-pagination(fc49e59147). Gateway restarted, startup warm completed, then~/.openclaw/agents/main/agent/auth-profiles.jsonwas hand-edited three times in ~30 s (edit, delete, re-create) to exercise the chokidar watcher. Each event clears the prepared map and schedules a background re-warm./tmp/openclaw/openclaw-2026-05-21.log:Re-warm is ~3× faster than the cold pre-warm because the model catalog is reused across warms — only the per-agent auth-discovery sweep re-runs.
Note on chokidar
unlink: deletingauth-profiles.jsonproduces a normal change/unlink event, not an error event. The watcher'serrorhandler is for transient OS-level failures (lost fsevents stream, permission errors); a deliberatermis handled cleanly and the watcher picks the file up again on re-create.What was not tested in this gateway run: the auth-failure self-heal hook under live API traffic (requires a real outbound call that 401s — covered by unit tests at
src/agents/auth-profiles.markauthprofilefailure.test.ts); clean gateway-shutdown sequence (the watcher handle is now inpostReadySidecarssostopPostReadySidecarsAfterCloseStartedcloses it, but I didn't capture a shutdown log in this run); fs-watcher invalidation under high write churn.Verification
pnpm test src/agents/model-provider-auth.test.ts— 5/5 pass.pnpm test src/agents/auth-profiles.markauthprofilefailure.test.ts— 13/13 pass (added 2 tests covering the hook fires + survives a throwing listener)../run.sh build) clean (exit 0).Discord picker pagination is on the companion PR #85138.