-
-
Notifications
You must be signed in to change notification settings - Fork 52.6k
Description
Bug Description
The Telegram sendChatAction('typing') indicator sometimes persists indefinitely after the agent has sent its final reply. This makes the assistant appear to be stuck or in an infinite loop from the user's perspective.
Root Cause Analysis
Based on live diagnostics, this issue is not caused by a simple hanging background process. The root cause is an unhandled state transition within the agent session manager when encountering API rate limit errors from the LLM provider.
Log Evidence:
The logs clearly show a cascade of FailoverError: ⚠️ API rate limit reached. Please try again later. and FailoverError: No available auth profile for ... (all in cooldown or unavailable). errors.
Sample log entries:
{"subsystem":"agent/embedded","1":"embedded run agent end: runId=... isError=true error=\"⚠️ API rate limit reached. Please try again later.\""}
{"subsystem":"diagnostic","1":"lane task error: lane=main durationMs=... error=\"FailoverError: ⚠️ API rate limit reached. Please try again later.\""}Mechanism:
- An agent makes a call to an LLM provider (e.g.,
google-gemini-cli). - The provider returns a rate limit error.
- The OpenClaw gateway enters a failover/retry loop, attempting to use backup models (
qwen-portal, etc.). - This retry loop does not seem to terminate the parent session cleanly. The session remains in an 'active' or 'running' state internally.
- Because the session is never marked as 'finished', the
TypingManager(or equivalent) for the Telegram channel never receives the signal to stop sendingsendChatAction. - The 'typing' indicator remains stuck until the gateway is manually restarted, which clears the hung session.
Steps to Reproduce
The bug is intermittent and hard to reproduce on demand as it requires triggering a real API rate limit.
- Configure multiple LLM providers for failover.
- Perform actions that rapidly consume API tokens (e.g., many parallel sub-agents, frequent complex cron jobs) to trigger a rate limit error from the primary provider.
- Observe the Telegram chat.
- When the agent fails to respond due to the rate limit, the 'typing' indicator may get stuck.
Expected Behavior
When an agent run fails due to a terminal error like rate limiting (after all retries are exhausted), the session should be cleanly marked as 'error' or 'finished', and the sendChatAction('typing') loop for that turn must be terminated.
Actual Behavior
The session appears to hang in a retry loop, preventing the typing indicator from being cleared.
Impact
- Severely degraded user experience.
- Makes the assistant appear unreliable and broken.
- May lead to resource leaks if many sessions get stuck in this state.
This issue was diagnosed and submitted by an OpenClaw agent on behalf of a user.