Summary
In pi-embedded-runner, the model-fallback loop reuses a single shared session file across every candidate provider. When the first candidate writes an assistant row with an errorMessage (e.g. OpenAI returns a real 429), and a later candidate (e.g. Anthropic, Google) times out without producing a new assistant for the current attempt, the failover path falls back to sessionLastAssistant.errorMessage from the shared session file and reports the previous provider's error string as if it came from the current candidate.
The net effect: a single real upstream error from provider A is re-surfaced as the failure cause for providers B and C, producing false-positive "all providers failed with the same error" output.
Where
src/agents/pi-embedded-runner/run/attempt.ts — lastAssistant is computed by scanning messagesSnapshot for the most recent assistant regardless of provider.
- Bundled (production) lines from a 2026-05-24 build:
pi-embedded-Bcz04p2i.js:2865 (failover error construction):
const assistantForFailover = currentAttemptAssistant ?? sessionLastAssistant;
…
new FailoverError(resolveAssistantFailoverErrorMessage(params), {
…,
rawError: params.lastAssistant?.errorMessage?.trim(),
});
model-fallback-DIXhOaxb.js:379 (recordFailedCandidateAttempt) stores error: described.rawError ?? described.message, so the stale rawError wins over the candidate-attributed message.
(File names with hashes are from the published build artifact; map back to the corresponding source modules.)
Reproduction / observed cascade
- OpenAI candidate hits a real 429 → session file now contains an assistant row with
errorMessage = "You exceeded your current quota…" (and OpenAI as provider).
runWithModelFallback advances to Anthropic and spawns a fresh runEmbeddedPiAgent against the same sessionFile/sessionId.
- The Anthropic request queues / hangs / aborts at the run-level timeout — no new assistant produced this attempt.
- Failover-decision construction sees
currentAttemptAssistant === undefined and falls back to sessionLastAssistant — which is still the OpenAI errored row.
- The resulting
FailoverError carries the OpenAI quota text as rawError, attributed (by the outer model-fallback) to Anthropic.
- Same again for Google.
Smoking gun in our logs
Every [agent/embedded] embedded run failover decision line for two distinct runs (3c5d7ca0-83df-418b-be48-a9327459046a and b9ad1b27-…) logs from=openai/gpt-5.5 — including the decisions that the outer model-fallback layer interprets as Anthropic and Google candidate failures. There are zero from=anthropic/… or from=google/… decision logs. Inner pi-embedded-runner never saw a non-OpenAI-attributed assistant error.
Evidence table
For one failed run (runId=3c5d7ca0-83df-418b-be48-a9327459046a, 2026-05-24 06:38–06:43 PT):
| Candidate |
OUTER reason |
OUTER detail (logged error string) |
Real upstream call status |
| openai/gpt-5.5 |
rate_limit |
OpenAI quota text |
Real 429 — /v1/responses returned a genuine quota error from OpenAI (proxy logs confirm) |
| openai/gpt-5.5 (retry, different profile) |
rate_limit |
OpenAI quota text |
Real 429 — same |
| anthropic/claude-opus-4-7 |
timeout |
OpenAI quota text |
No Anthropic 429 observed. Run ran ~123s and ended on run-level timeout; upstream proxy shows queued/rate-limited entries but no terminal quota-exhausted response for opus-4-7 |
| google/gemini-3-pro-preview |
timeout |
OpenAI quota text |
Same shape — reason=timeout, but detail is again the OpenAI quota text verbatim |
Note specifically: two of the three statuses are timeout, not rate_limit — real quota errors produce immediate 429s, not run-level timeouts. And the error message is verbatim OpenAI's quota string; Anthropic and Google use entirely different wording for billing/quota errors.
Impact
- Misleads operators into believing all three providers are concurrently exhausted, when only one actually is.
- Drives unnecessary top-up / billing action on providers that aren't out of quota.
- Makes accurate triage of multi-provider gateways effectively impossible because the surfaced
detail is unreliable for any candidate after the first failing provider.
- Hard to spot from the operator side because the
detail looks like a fully-formed quota error.
In our own incident this caused a false-positive triple-provider quota outage that was only caught by RCA inspection of inner failover-decision logs.
Suggested fix
Two complementary guards:
- In pi-embedded-runner failover construction (around the
assistantForFailover = currentAttemptAssistant ?? sessionLastAssistant site):
- If
currentAttemptAssistant is undefined AND sessionLastAssistant?.provider !== <this candidate's provider>, do not propagate sessionLastAssistant.errorMessage as the FailoverError.rawError. Fall through to the candidate-attributed default (e.g. "LLM request timed out." / "no response from provider").
- In
recordFailedCandidateAttempt (the error: described.rawError ?? described.message site in model-fallback-…):
- Additionally guard: if
described.provider (from rawError attribution) differs from params.candidate.provider, prefer described.message over described.rawError.
Either guard alone would have prevented the false-positive in our case; both together are belt-and-suspenders.
Workaround
Operators can read from= in [agent/embedded] embedded run failover decision log lines instead of trusting the surfaced detail / outer error string. The workaround works but is fragile — it requires log access and inner-line correlation, and any wrapper that surfaces the outer error to humans (Slack alert, dashboard) will still show the stale text.
Severity
Medium. Functional fallback still works (requests do route to the next provider), and the workaround exists. But the misleading attribution actively damages triage of multi-provider failures, which is exactly when accurate signals matter most.
Filed by
Kentro.io engineering. Internal RCA tracked at KEN-4598 / KEN-4603. Happy to attach more log excerpts or test against a patch if useful.
Summary
In
pi-embedded-runner, the model-fallback loop reuses a single shared session file across every candidate provider. When the first candidate writes an assistant row with anerrorMessage(e.g. OpenAI returns a real 429), and a later candidate (e.g. Anthropic, Google) times out without producing a new assistant for the current attempt, the failover path falls back tosessionLastAssistant.errorMessagefrom the shared session file and reports the previous provider's error string as if it came from the current candidate.The net effect: a single real upstream error from provider A is re-surfaced as the failure cause for providers B and C, producing false-positive "all providers failed with the same error" output.
Where
src/agents/pi-embedded-runner/run/attempt.ts—lastAssistantis computed by scanningmessagesSnapshotfor the most recent assistant regardless of provider.pi-embedded-Bcz04p2i.js:2865(failover error construction):model-fallback-DIXhOaxb.js:379(recordFailedCandidateAttempt) storeserror: described.rawError ?? described.message, so the stalerawErrorwins over the candidate-attributedmessage.(File names with hashes are from the published build artifact; map back to the corresponding source modules.)
Reproduction / observed cascade
errorMessage = "You exceeded your current quota…"(and OpenAI as provider).runWithModelFallbackadvances to Anthropic and spawns a freshrunEmbeddedPiAgentagainst the samesessionFile/sessionId.currentAttemptAssistant === undefinedand falls back tosessionLastAssistant— which is still the OpenAI errored row.FailoverErrorcarries the OpenAI quota text asrawError, attributed (by the outer model-fallback) to Anthropic.Smoking gun in our logs
Every
[agent/embedded] embedded run failover decisionline for two distinct runs (3c5d7ca0-83df-418b-be48-a9327459046aandb9ad1b27-…) logsfrom=openai/gpt-5.5— including the decisions that the outer model-fallback layer interprets as Anthropic and Google candidate failures. There are zerofrom=anthropic/…orfrom=google/…decision logs. Inner pi-embedded-runner never saw a non-OpenAI-attributed assistant error.Evidence table
For one failed run (
runId=3c5d7ca0-83df-418b-be48-a9327459046a, 2026-05-24 06:38–06:43 PT):reasondetail(logged error string)rate_limit/v1/responsesreturned a genuine quota error from OpenAI (proxy logs confirm)rate_limittimeouttimeoutreason=timeout, butdetailis again the OpenAI quota text verbatimNote specifically: two of the three statuses are
timeout, notrate_limit— real quota errors produce immediate 429s, not run-level timeouts. And the error message is verbatim OpenAI's quota string; Anthropic and Google use entirely different wording for billing/quota errors.Impact
detailis unreliable for any candidate after the first failing provider.detaillooks like a fully-formed quota error.In our own incident this caused a false-positive triple-provider quota outage that was only caught by RCA inspection of inner failover-decision logs.
Suggested fix
Two complementary guards:
assistantForFailover = currentAttemptAssistant ?? sessionLastAssistantsite):currentAttemptAssistantis undefined ANDsessionLastAssistant?.provider !== <this candidate's provider>, do not propagatesessionLastAssistant.errorMessageas theFailoverError.rawError. Fall through to the candidate-attributed default (e.g."LLM request timed out."/"no response from provider").recordFailedCandidateAttempt(theerror: described.rawError ?? described.messagesite inmodel-fallback-…):described.provider(fromrawErrorattribution) differs fromparams.candidate.provider, preferdescribed.messageoverdescribed.rawError.Either guard alone would have prevented the false-positive in our case; both together are belt-and-suspenders.
Workaround
Operators can read
from=in[agent/embedded] embedded run failover decisionlog lines instead of trusting the surfaceddetail/ outer error string. The workaround works but is fragile — it requires log access and inner-line correlation, and any wrapper that surfaces the outer error to humans (Slack alert, dashboard) will still show the stale text.Severity
Medium. Functional fallback still works (requests do route to the next provider), and the workaround exists. But the misleading attribution actively damages triage of multi-provider failures, which is exactly when accurate signals matter most.
Filed by
Kentro.io engineering. Internal RCA tracked at KEN-4598 / KEN-4603. Happy to attach more log excerpts or test against a patch if useful.