Summary
When an API rate limit or model fallback (e.g. Claude → Gemini) occurs mid-stream, the agent state machine in pi-embedded-runner/run/attempt.ts can stall permanently with no error surfaced to the user. The session goes silent and never resumes, even after the rate limit clears.
Behaviour
- Agent is mid-execution (streaming tool results, running browser automation, etc.)
- Rate limit hit → provider switches to fallback model (or waits for retry)
- Agent session goes completely silent — no response, no error, no timeout
- Cron job eventually times out or hangs indefinitely if
timeoutSeconds: 0
- No user-visible error; only surfaced in run history as a vague failure
Root Cause (suspected)
The existing BUGFIX comment in attempt.ts acknowledges fragility in the idle-check / tool flush loop:
"Wait for the agent to be truly idle before flushing pending tool results"
The async coordination between streaming state, tool result flushing, and the idle detection appears to be sensitive to mid-stream interruptions. When the stream is cut (rate limit, network drop, model switch), the state machine can enter a state where it's waiting for an idle signal that never arrives.
Repro
- Run a long browser automation cron job with
timeoutSeconds: 0
- Let it hit a Claude API rate limit mid-session (easier to repro with a busy schedule of concurrent cron jobs)
- Even with a Gemini fallback configured, the session stalls rather than switching cleanly
- Session produces no output, no error — just silence
Suggested Fix
Replace the manual idle-check polling loop with a formal FSM (e.g. XState) with deterministic transitions between streaming, tool_execution, and flushing states. At minimum, add a hard timeout on the idle-wait loop with a recoverable error on expiry.
Version
v2026.2.26
Summary
When an API rate limit or model fallback (e.g. Claude → Gemini) occurs mid-stream, the agent state machine in
pi-embedded-runner/run/attempt.tscan stall permanently with no error surfaced to the user. The session goes silent and never resumes, even after the rate limit clears.Behaviour
timeoutSeconds: 0Root Cause (suspected)
The existing BUGFIX comment in
attempt.tsacknowledges fragility in the idle-check / tool flush loop:The async coordination between streaming state, tool result flushing, and the idle detection appears to be sensitive to mid-stream interruptions. When the stream is cut (rate limit, network drop, model switch), the state machine can enter a state where it's waiting for an idle signal that never arrives.
Repro
timeoutSeconds: 0Suggested Fix
Replace the manual idle-check polling loop with a formal FSM (e.g. XState) with deterministic transitions between
streaming,tool_execution, andflushingstates. At minimum, add a hard timeout on the idle-wait loop with a recoverable error on expiry.Version
v2026.2.26