Skip to content

fix(auth-profiles): exclude format rejections from profile cooldown#77280

Merged
openperf merged 3 commits intoopenclaw:mainfrom
openperf:fix/77228-format-failure-no-profile-cooldown
May 5, 2026
Merged

fix(auth-profiles): exclude format rejections from profile cooldown#77280
openperf merged 3 commits intoopenclaw:mainfrom
openperf:fix/77228-format-failure-no-profile-cooldown

Conversation

@openperf
Copy link
Copy Markdown
Member

@openperf openperf commented May 4, 2026

Summary

  • Problem: A single session-specific request-shape rejection from a provider takes down every other healthy session sharing the same auth profile, and when all configured profiles for a provider share the same fault, locks out the entire provider for the configured backoff window. The reporter in Session corruption: prefill error cascades into provider cooldown + repair makes it worse #77228 saw "Provider github-copilot is in cooldown (all profiles unavailable)" persist for 42+ minutes after a single 400 — "This model does not support assistant message prefill. The conversation must end with a user message." — caused by one corrupted session whose transcript ended in a stream-error placeholder assistant turn. Healthy sessions on the same profile were unable to make any provider call during the entire window.

  • Root Cause: src/agents/pi-embedded-runner/run/auth-profile-failure-policy.ts:5-14 resolves which FailoverReasons should be persisted as auth-profile health signals. Today it excludes policy === "local" (helper-local runs) and failoverReason === "timeout" (transport timeouts) — both annotated as "should not poison shared provider auth health". A format-classified failure (src/agents/pi-embedded-helpers/errors.ts:710-727: a 400/422 whose payload couldn't be reclassified as auth/billing/rate-limit/etc.) is also a non-poisonous signal — it means the provider rejected the request payload shape, which is per-session and per-transcript, not a profile-wide reliability problem — but it is currently passed through as failoverReason and reaches markAuthProfileFailure at src/agents/auth-profiles/usage.ts:649. Inside computeNextProfileUsageStats (usage.ts:539-642), format runs through calculateAuthProfileCooldownMs (usage.ts:363-372) just like rate_limit / overloaded, producing 30s → 60s → 5min capped backoff, and crucially without the model scoping that rate_limit gets (usage.ts:637: cooldownModel is only set for rate_limit). So one bad transcript in one session repeatedly hits the same 400, the post-cooldown retry hits the same 400, the cooldown re-lengthens to its 5-min cap, and back-to-back cycles produce the 42-min provider-wide outage observed in the report. Other sessions on the same profile, with valid transcripts, are blocked the entire time.

  • Fix: Add failoverReason === "format" to the existing exclusion list in resolveAuthProfileFailureReason. This is the single chokepoint through which markAuthProfileFailure learns about run-time failovers in pi-embedded-runner/run.ts (call sites at :1858, :2005, :2506, :2615 all funnel through resolveRunAuthProfileFailureReason at :872). When a session's transcript shape is rejected, the rejection still surfaces to the user via the existing FailoverError, the run still logs the failure, but the auth-profile cooldown machinery is no longer triggered. The bad session continues to fail — that is a separate, per-session repair concern explicitly tracked as the other two open items in Session corruption: prefill error cascades into provider cooldown + repair makes it worse #77228 — but other sessions on the same profile keep working, and the provider is no longer killed for everyone for the cooldown window.

  • What changed:

    • src/agents/pi-embedded-runner/run/auth-profile-failure-policy.ts — extend the existing policy === "local" / timeout exclusion guard to also cover failoverReason === "format". Comment expanded to document why a request-shape rejection is per-session, not profile-wide.
    • src/agents/pi-embedded-runner/run/auth-profile-failure-policy.test.ts — add a format-rejection case (with and without policy: "shared"), asserting the resolver returns null so markAuthProfileFailure is never called.
    • CHANGELOG.md — single Fixes line under Unreleased referencing the issue with non-closing Refs syntax.
  • What did NOT change (scope boundary):

    • No changes to the failure-reason classification (src/agents/pi-embedded-helpers/errors.ts); 400/422 schema rejections still classify as format and still surface to the user as a FailoverError.
    • No changes to markAuthProfileFailure / computeNextProfileUsageStats / calculateAuthProfileCooldownMs. Profile-cooldown semantics for legitimately profile-poisoning reasons (auth, auth_permanent, billing, rate_limit, overloaded, model_not_found, unknown, …) are untouched, so the existing behavior verified by src/agents/auth-profiles.markauthprofilefailure.test.ts is preserved.
    • No changes to the streaming-error placeholder (src/agents/stream-message-shared.ts:STREAM_ERROR_FALLBACK_TEXT) that ends up in the transcript and is the upstream root of the prefill 400. Stripping or rewriting it for prefill-strict providers is a distinct and provider-specific concern; the existing sentinel design was deliberately chosen for Bedrock Converse compatibility (stream-message-shared.ts:75-90).
    • No changes to the auto-repair routine that (per the report) rewrites the transcript with 935+ null-role entries on the second turn. That is a separate failure mode in the session-file repair path; deserves its own narrowly-scoped PR.

Reproduction

  1. Start an agent on a prefill-strict provider/model (e.g. github-copilot/claude-opus-4.6 in the report; any provider that rejects "conversation must end with a user message" works) and configure two or more sessions on the same auth profile (e.g. session A and session B).
  2. In session A, drive the transcript into the failure shape: an assistant turn that errored out before producing content gets persisted as the final entry (STREAM_ERROR_FALLBACK_TEXT) followed by a blank/empty user turn that itself fails to produce a usable user message — the provider then sees a payload ending in an assistant message.
  3. Send a message in session A. Without this fix: provider returns 400 This model does not support assistant message prefill ... must end with a user message., the failure classifies as format, markAuthProfileFailure fires for the profile, the auth profile enters cooldown.
  4. Send a message in session B (different session, same auth profile, valid transcript). Without this fix: rejected with "Provider github-copilot is in cooldown (all profiles unavailable)" — even though session B's request was perfectly valid. With this fix: session B succeeds; only session A keeps failing on its own bad transcript.
  5. Cycle session A's retries past the cooldown expiry. Without this fix: the post-cooldown retry hits the same 400 and the cooldown re-extends to 5 minutes; back-to-back cycles reproduce the 42-minute outage in the report. With this fix: session A still fails per turn, but no profile cooldown is recorded, so no other session is affected and no extension cycle is created.

Risk / Mitigation

  • Risk 1 — losing telemetry on schema problems: Could excluding format from profile cooldown make schema problems invisible? No. The format failure still surfaces to the user via FailoverError, still logs through the existing run-warn path, and still appears in failureCounts if and when the resolver does return non-null on other code paths. We are removing only the profile-cooldown side effect, not the diagnostic surface.
    Mitigation: covered by the existing markAuthProfileFailure test suite continuing to pass — those tests do not assert anything about format, since format was not previously a documented profile-poisoning reason; the reporter's symptom shows it was de-facto poisoning behavior, not by design.
  • Risk 2 — masking a real provider-side schema fault: Could a provider's own bug (where everyone's request shape is rejected legitimately) hide behind this exclusion? Unlikely to be material: such an outage would manifest as format errors across many independent sessions and many profiles, and the operator would still see them as FailoverErrors in logs and per-turn UI. Profile-cooldown was never the right mitigation for a provider-side schema bug — failover/fallback chain configuration is.
    Mitigation: behavior matches the existing precedent for timeout exclusion (transport timeouts also surface to logs and UI but do not poison auth health).
  • Risk 3 — extending the exclusion list: Could other reasons need similar treatment? The two other "session-shape" reasons surfacing in the codebase are replay_invalid and schema (in the ProviderRuntimeFailureKind taxonomy at errors.ts:258-277). Both already collapse into format through classifyFailoverClassificationFromMessage. They are covered by this single change.
    Mitigation: the test asserts resolveAuthProfileFailureReason returns null for format regardless of policy, locking the contract.
  • Risk 4 — minimal blast radius: The change is one file in production code, four lines of effective logic edit. The chokepoint nature of resolveAuthProfileFailureReason means there is exactly one path to audit.

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • Agents (auth-profile failure policy)
  • Tests (resolver unit coverage)
  • Changelog (Unreleased Fixes entry)

Linked Issue/PR

Refs #77228 — addresses the cascading profile cooldown amplifier only. The two other items in the same issue (the stream-error placeholder leaving transcripts ending in assistant, and the auto-repair amplification that fills the JSONL with 935+ null-role entries) are independent failure modes; they remain open and out of scope here so they can be tracked and shipped separately.

@openperf openperf requested a review from a team as a code owner May 4, 2026 11:37
@openclaw-barnacle openclaw-barnacle Bot added agents Agent runtime and tooling size: XS maintainer Maintainer-authored PR labels May 4, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 4, 2026

Codex review: needs maintainer review before merge.

Summary
The PR makes the auth-profile failure resolver ignore format failover reasons, adds resolver unit coverage, and adds an Unreleased changelog fix entry.

Reproducibility: yes. Source inspection shows current main classifies the reported 400/422 request-shape rejection as format, passes it through the resolver, and persists it into profile-wide cooldown state.

Next step before merge
No automated repair lane is needed: the PR is maintainer-labeled and the earlier changelog blocker is resolved in the updated head.

Security
Cleared: The diff only touches a failure-classifier helper, its unit test, and the changelog; it adds no dependency, CI, secret-handling, or supply-chain surface.

Review details

Best possible solution:

Land the resolver, test, and changelog change after normal maintainer review while leaving the broader transcript placeholder and session-repair work tracked in #77228.

Do we have a high-confidence way to reproduce the issue?

Yes. Source inspection shows current main classifies the reported 400/422 request-shape rejection as format, passes it through the resolver, and persists it into profile-wide cooldown state.

Is this the best way to solve the issue?

Yes. Filtering format at the resolver is the narrowest maintainable fix because all runner cooldown writes pass through it, and the PR now includes focused unit coverage plus the required changelog entry.

What I checked:

  • Protected PR metadata: The provided GitHub context shows authorAssociation=MEMBER and the maintainer label, so this PR should stay on the normal maintainer path rather than cleanup close automation. (040ef6b70424)
  • Current main persists format failures: Current main only excludes local policy, missing reasons, and timeout; a format failover reason currently passes through as an auth-profile failure reason. (src/agents/pi-embedded-runner/run/auth-profile-failure-policy.ts:10, 89db1e5440f5)
  • Format classification matches the reported 400/422 request-shape failure: Unclassified 400/422 responses with a body are classified as format, matching the assistant-prefill rejection described in the linked report. (src/agents/pi-embedded-helpers/errors.ts:710, 89db1e5440f5)
  • Persisted non-disabled reasons create profile cooldowns: computeNextProfileUsageStats applies stepped cooldowns to non-disabled failure reasons and only model-scopes rate_limit, so format is currently profile-wide cooldown state. (src/agents/auth-profiles/usage.ts:593, 89db1e5440f5)
  • Runner cooldown writes use the resolver chokepoint: Runner prompt and assistant failure paths resolve the failover reason before calling the auth-profile failure marker, making the resolver the narrow place to stop format from poisoning shared profile health. (src/agents/pi-embedded-runner/run.ts:872, 89db1e5440f5)
  • PR adds the missing changelog entry: The updated PR file list includes a single Unreleased Fixes entry for the auth-profile format cooldown behavior, resolving the earlier ClawSweeper changelog finding. (CHANGELOG.md:310, 040ef6b70424)

Likely related people:

  • Peter Steinberger: Current-main blame in this checkout points to Peter on the resolver, runner auth failure funnel, auth controller, and cooldown code; local history also shows recent embedded-runner auth controller refactoring. (role: recent maintainer; confidence: high; commits: 5d9752ba18b7, 18dc98b00e43; files: src/agents/pi-embedded-runner/run/auth-profile-failure-policy.ts, src/agents/pi-embedded-runner/run.ts, src/agents/pi-embedded-runner/run/auth-controller.ts)
  • kiranvk2011: Local history for the auth-profile cooldown path includes the per-model cooldown scope and stepped backoff work that this PR deliberately preserves. (role: adjacent owner; confidence: medium; commits: 84401223c7b8; files: src/agents/auth-profiles/usage.ts, src/agents/model-fallback.ts, src/agents/pi-embedded-runner/run.ts)
  • Ayaan Zaidi: Recent available history on the provider error-classification file includes HTTP classification work near the format taxonomy that feeds this cooldown path. (role: adjacent maintainer; confidence: low; commits: de129a6530c0; files: src/agents/pi-embedded-helpers/errors.ts)

Remaining risk / open question:

  • I did not run the targeted Vitest file or a live provider reproduction because this was a read-only sweep; the conclusion is based on source inspection and the provided PR diff.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 89db1e5440f5.

@openperf openperf force-pushed the fix/77228-format-failure-no-profile-cooldown branch from 040ef6b to 3c7f49b Compare May 5, 2026 05:25
openperf added 3 commits May 5, 2026 13:33
A format-classified failure means the provider rejected the request payload
shape (e.g. an assistant-prefill 400 when a session transcript ends with a
stream-error placeholder turn). That is a per-session transcript-shape
problem, not a profile-wide reliability signal. Mark the reason with the
existing transport-timeout exclusion so a single bad session no longer
cascades to a profile cooldown that takes down every other healthy session
sharing the same auth profile or, when all profiles share the same fault,
the whole provider for the backoff window.

Refs openclaw#77228 — addresses the cascading-cooldown amplifier only.
The other two items in the same issue (the prefill placeholder leaving
transcripts ending in assistant, and the auto-repair filling the JSONL
with null-role entries) are separate failure modes and remain open.
@openperf openperf force-pushed the fix/77228-format-failure-no-profile-cooldown branch from 3c7f49b to f4188b4 Compare May 5, 2026 05:34
@openperf openperf merged commit 31da1fe into openclaw:main May 5, 2026
101 checks passed
@openperf
Copy link
Copy Markdown
Member Author

openperf commented May 5, 2026

Merged via squash.

Thanks @openperf!

@openperf openperf deleted the fix/77228-format-failure-no-profile-cooldown branch May 5, 2026 05:36
@openperf
Copy link
Copy Markdown
Member Author

openperf commented May 5, 2026

Merged as 31da1fe5b05caf2368d53a2027382349c28918cb — fix(auth-profiles): exclude format rejections from profile cooldown. Refs #77228.

openperf added a commit that referenced this pull request May 5, 2026
openperf added a commit that referenced this pull request May 5, 2026
…Unreleased (#77728)

Merged via squash.

Prepared head SHA: 1bd228f
Co-authored-by: openperf <80630709+openperf@users.noreply.github.com>
Co-authored-by: openperf <80630709+openperf@users.noreply.github.com>
Reviewed-by: @openperf
vincentkoc added a commit to VintageAyu/openclaw that referenced this pull request May 5, 2026
…ainer-hardening

* origin/main: (843 commits)
  docs(changelog): relocate openclaw#77046 and openclaw#77280 entries from 2026.5.3 to Unreleased (openclaw#77728)
  docs: reorder unreleased changelog
  fix: expose ollama thinking profile before activation (openclaw#77617) (thanks @yfge)
  fix: expose ollama thinking profile before activation
  test(gateway): preserve dispatch timers in waiter
  test(gateway): keep startup context timer live
  docs: document cache-friendly activity helper
  ci: install ffmpeg for Mantis media previews
  fix: avoid impossible device token rotation advice (openclaw#77688) (thanks @Conan-Scott)
  docs(changelog): note doctor device pairing advice fix
  fix(doctor): avoid impossible device token rotation advice
  ci: use Crabbox media previews for Mantis
  docs: filter maintainer-owned triage noise
  test: cover GitHub activity helper
  fix(session-file-repair): drop null-role message entries instead of preserving them (openclaw#77288)
  fix: prune orphan session artifacts
  perf: reduce GitHub activity cache misses
  fix: cache session list model resolution (openclaw#77650) (thanks @ragesaq)
  ci: embed Mantis desktop previews
  fix(replay-history): drop trailing stream-error placeholder before provider send (openclaw#77287)
  ...

# Conflicts:
#	CHANGELOG.md
steipete pushed a commit that referenced this pull request May 5, 2026
…77280)

Merged via squash.

Prepared head SHA: f4188b4
Co-authored-by: openperf <80630709+openperf@users.noreply.github.com>
Co-authored-by: openperf <80630709+openperf@users.noreply.github.com>
Reviewed-by: @openperf

(cherry picked from commit 31da1fe)
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
…penclaw#77280)

Merged via squash.

Prepared head SHA: f4188b4
Co-authored-by: openperf <80630709+openperf@users.noreply.github.com>
Co-authored-by: openperf <80630709+openperf@users.noreply.github.com>
Reviewed-by: @openperf
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
…rom 2026.5.3 to Unreleased (openclaw#77728)

Merged via squash.

Prepared head SHA: 1bd228f
Co-authored-by: openperf <80630709+openperf@users.noreply.github.com>
Co-authored-by: openperf <80630709+openperf@users.noreply.github.com>
Reviewed-by: @openperf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling maintainer Maintainer-authored PR size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant