Skip to content

bug: agent runner stalls silently on rate limit / model switch mid-stream (attempt.ts race condition) #31664

@yodason

Description

@yodason

Summary

When an API rate limit or model fallback (e.g. Claude → Gemini) occurs mid-stream, the agent state machine in pi-embedded-runner/run/attempt.ts can stall permanently with no error surfaced to the user. The session goes silent and never resumes, even after the rate limit clears.

Behaviour

  • Agent is mid-execution (streaming tool results, running browser automation, etc.)
  • Rate limit hit → provider switches to fallback model (or waits for retry)
  • Agent session goes completely silent — no response, no error, no timeout
  • Cron job eventually times out or hangs indefinitely if timeoutSeconds: 0
  • No user-visible error; only surfaced in run history as a vague failure

Root Cause (suspected)

The existing BUGFIX comment in attempt.ts acknowledges fragility in the idle-check / tool flush loop:

"Wait for the agent to be truly idle before flushing pending tool results"

The async coordination between streaming state, tool result flushing, and the idle detection appears to be sensitive to mid-stream interruptions. When the stream is cut (rate limit, network drop, model switch), the state machine can enter a state where it's waiting for an idle signal that never arrives.

Repro

  1. Run a long browser automation cron job with timeoutSeconds: 0
  2. Let it hit a Claude API rate limit mid-session (easier to repro with a busy schedule of concurrent cron jobs)
  3. Even with a Gemini fallback configured, the session stalls rather than switching cleanly
  4. Session produces no output, no error — just silence

Suggested Fix

Replace the manual idle-check polling loop with a formal FSM (e.g. XState) with deterministic transitions between streaming, tool_execution, and flushing states. At minimum, add a hard timeout on the idle-wait loop with a recoverable error on expiry.

Version

v2026.2.26

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions