fix(auth-profiles): exclude format rejections from profile cooldown#77280
Conversation
|
Codex review: needs maintainer review before merge. Summary Reproducibility: yes. Source inspection shows current main classifies the reported 400/422 request-shape rejection as Next step before merge Security Review detailsBest possible solution: Land the resolver, test, and changelog change after normal maintainer review while leaving the broader transcript placeholder and session-repair work tracked in #77228. Do we have a high-confidence way to reproduce the issue? Yes. Source inspection shows current main classifies the reported 400/422 request-shape rejection as Is this the best way to solve the issue? Yes. Filtering What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 89db1e5440f5. |
040ef6b to
3c7f49b
Compare
A format-classified failure means the provider rejected the request payload shape (e.g. an assistant-prefill 400 when a session transcript ends with a stream-error placeholder turn). That is a per-session transcript-shape problem, not a profile-wide reliability signal. Mark the reason with the existing transport-timeout exclusion so a single bad session no longer cascades to a profile cooldown that takes down every other healthy session sharing the same auth profile or, when all profiles share the same fault, the whole provider for the backoff window. Refs openclaw#77228 — addresses the cascading-cooldown amplifier only. The other two items in the same issue (the prefill placeholder leaving transcripts ending in assistant, and the auto-repair filling the JSONL with null-role entries) are separate failure modes and remain open.
3c7f49b to
f4188b4
Compare
|
Merged via squash.
Thanks @openperf! |
|
Merged as |
…ainer-hardening * origin/main: (843 commits) docs(changelog): relocate openclaw#77046 and openclaw#77280 entries from 2026.5.3 to Unreleased (openclaw#77728) docs: reorder unreleased changelog fix: expose ollama thinking profile before activation (openclaw#77617) (thanks @yfge) fix: expose ollama thinking profile before activation test(gateway): preserve dispatch timers in waiter test(gateway): keep startup context timer live docs: document cache-friendly activity helper ci: install ffmpeg for Mantis media previews fix: avoid impossible device token rotation advice (openclaw#77688) (thanks @Conan-Scott) docs(changelog): note doctor device pairing advice fix fix(doctor): avoid impossible device token rotation advice ci: use Crabbox media previews for Mantis docs: filter maintainer-owned triage noise test: cover GitHub activity helper fix(session-file-repair): drop null-role message entries instead of preserving them (openclaw#77288) fix: prune orphan session artifacts perf: reduce GitHub activity cache misses fix: cache session list model resolution (openclaw#77650) (thanks @ragesaq) ci: embed Mantis desktop previews fix(replay-history): drop trailing stream-error placeholder before provider send (openclaw#77287) ... # Conflicts: # CHANGELOG.md
…penclaw#77280) Merged via squash. Prepared head SHA: f4188b4 Co-authored-by: openperf <80630709+openperf@users.noreply.github.com> Co-authored-by: openperf <80630709+openperf@users.noreply.github.com> Reviewed-by: @openperf
…rom 2026.5.3 to Unreleased (openclaw#77728) Merged via squash. Prepared head SHA: 1bd228f Co-authored-by: openperf <80630709+openperf@users.noreply.github.com> Co-authored-by: openperf <80630709+openperf@users.noreply.github.com> Reviewed-by: @openperf
Summary
Problem: A single session-specific request-shape rejection from a provider takes down every other healthy session sharing the same auth profile, and when all configured profiles for a provider share the same fault, locks out the entire provider for the configured backoff window. The reporter in Session corruption: prefill error cascades into provider cooldown + repair makes it worse #77228 saw "Provider github-copilot is in cooldown (all profiles unavailable)" persist for 42+ minutes after a single 400 —
"This model does not support assistant message prefill. The conversation must end with a user message."— caused by one corrupted session whose transcript ended in a stream-error placeholder assistant turn. Healthy sessions on the same profile were unable to make any provider call during the entire window.Root Cause:
src/agents/pi-embedded-runner/run/auth-profile-failure-policy.ts:5-14resolves whichFailoverReasons should be persisted as auth-profile health signals. Today it excludespolicy === "local"(helper-local runs) andfailoverReason === "timeout"(transport timeouts) — both annotated as "should not poison shared provider auth health". Aformat-classified failure (src/agents/pi-embedded-helpers/errors.ts:710-727: a 400/422 whose payload couldn't be reclassified as auth/billing/rate-limit/etc.) is also a non-poisonous signal — it means the provider rejected the request payload shape, which is per-session and per-transcript, not a profile-wide reliability problem — but it is currently passed through asfailoverReasonand reachesmarkAuthProfileFailureatsrc/agents/auth-profiles/usage.ts:649. InsidecomputeNextProfileUsageStats(usage.ts:539-642),formatruns throughcalculateAuthProfileCooldownMs(usage.ts:363-372) just likerate_limit/overloaded, producing 30s → 60s → 5min capped backoff, and crucially without the model scoping thatrate_limitgets (usage.ts:637:cooldownModelis only set forrate_limit). So one bad transcript in one session repeatedly hits the same 400, the post-cooldown retry hits the same 400, the cooldown re-lengthens to its 5-min cap, and back-to-back cycles produce the 42-min provider-wide outage observed in the report. Other sessions on the same profile, with valid transcripts, are blocked the entire time.Fix: Add
failoverReason === "format"to the existing exclusion list inresolveAuthProfileFailureReason. This is the single chokepoint through whichmarkAuthProfileFailurelearns about run-time failovers inpi-embedded-runner/run.ts(call sites at:1858,:2005,:2506,:2615all funnel throughresolveRunAuthProfileFailureReasonat:872). When a session's transcript shape is rejected, the rejection still surfaces to the user via the existingFailoverError, the run still logs the failure, but the auth-profile cooldown machinery is no longer triggered. The bad session continues to fail — that is a separate, per-session repair concern explicitly tracked as the other two open items in Session corruption: prefill error cascades into provider cooldown + repair makes it worse #77228 — but other sessions on the same profile keep working, and the provider is no longer killed for everyone for the cooldown window.What changed:
src/agents/pi-embedded-runner/run/auth-profile-failure-policy.ts— extend the existingpolicy === "local"/timeoutexclusion guard to also coverfailoverReason === "format". Comment expanded to document why a request-shape rejection is per-session, not profile-wide.src/agents/pi-embedded-runner/run/auth-profile-failure-policy.test.ts— add aformat-rejection case (with and withoutpolicy: "shared"), asserting the resolver returnsnullsomarkAuthProfileFailureis never called.CHANGELOG.md— single Fixes line under Unreleased referencing the issue with non-closingRefssyntax.What did NOT change (scope boundary):
src/agents/pi-embedded-helpers/errors.ts); 400/422 schema rejections still classify asformatand still surface to the user as aFailoverError.markAuthProfileFailure/computeNextProfileUsageStats/calculateAuthProfileCooldownMs. Profile-cooldown semantics for legitimately profile-poisoning reasons (auth,auth_permanent,billing,rate_limit,overloaded,model_not_found,unknown, …) are untouched, so the existing behavior verified bysrc/agents/auth-profiles.markauthprofilefailure.test.tsis preserved.src/agents/stream-message-shared.ts:STREAM_ERROR_FALLBACK_TEXT) that ends up in the transcript and is the upstream root of the prefill 400. Stripping or rewriting it for prefill-strict providers is a distinct and provider-specific concern; the existing sentinel design was deliberately chosen for Bedrock Converse compatibility (stream-message-shared.ts:75-90).Reproduction
STREAM_ERROR_FALLBACK_TEXT) followed by a blank/empty user turn that itself fails to produce a usable user message — the provider then sees a payload ending in an assistant message.400 This model does not support assistant message prefill ... must end with a user message., the failure classifies asformat,markAuthProfileFailurefires for the profile, the auth profile enters cooldown.Risk / Mitigation
formatfrom profile cooldown make schema problems invisible? No. Theformatfailure still surfaces to the user viaFailoverError, still logs through the existing run-warn path, and still appears infailureCountsif and when the resolver does return non-null on other code paths. We are removing only the profile-cooldown side effect, not the diagnostic surface.Mitigation: covered by the existing
markAuthProfileFailuretest suite continuing to pass — those tests do not assert anything aboutformat, sinceformatwas not previously a documented profile-poisoning reason; the reporter's symptom shows it was de-facto poisoning behavior, not by design.formaterrors across many independent sessions and many profiles, and the operator would still see them asFailoverErrors in logs and per-turn UI. Profile-cooldown was never the right mitigation for a provider-side schema bug — failover/fallback chain configuration is.Mitigation: behavior matches the existing precedent for
timeoutexclusion (transport timeouts also surface to logs and UI but do not poison auth health).replay_invalidandschema(in theProviderRuntimeFailureKindtaxonomy aterrors.ts:258-277). Both already collapse intoformatthroughclassifyFailoverClassificationFromMessage. They are covered by this single change.Mitigation: the test asserts
resolveAuthProfileFailureReasonreturnsnullforformatregardless ofpolicy, locking the contract.resolveAuthProfileFailureReasonmeans there is exactly one path to audit.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
Refs #77228 — addresses the cascading profile cooldown amplifier only. The two other items in the same issue (the stream-error placeholder leaving transcripts ending in assistant, and the auto-repair amplification that fills the JSONL with 935+ null-role entries) are independent failure modes; they remain open and out of scope here so they can be tracked and shipped separately.