fix(feishu): re-throw rate-limit errors from typing indicator to trip circuit breaker#28086
Conversation
… circuit breaker The addTypingIndicator and removeTypingIndicator functions caught all errors silently, preventing the circuit breaker in typing-start-guard from detecting failures. When the Feishu API returns 429 or quota error 99991403, the keepalive loop retried indefinitely (~4,800 failed calls per hour). Re-throw rate-limit and quota errors so the circuit breaker can count consecutive failures and trip, stopping the keepalive loop. Other non-critical errors (e.g. message deleted) remain silently handled. Fixes openclaw#28062
|
Hey @steipete @vincentkoc — fix for #28062 (Feishu typing indicator retries indefinitely on 429). Small, targeted change — only re-throws rate-limit errors so the existing circuit breaker can do its job. Would appreciate a review! |
Greptile SummaryThis PR fixes a real and impactful bug: The fix is correct and well-targeted for Key observations:
Confidence Score: 4/5
Last reviewed commit: 077db0a |
extensions/feishu/src/typing.ts
Outdated
| // Re-throw rate-limit / quota errors so the circuit breaker can trip (#28062). | ||
| if (isRateLimitError(err)) { | ||
| throw err; | ||
| } |
There was a problem hiding this comment.
Misleading comment — removeTypingIndicator re-throw does not reach the circuit breaker
The comment says "so the circuit breaker can trip", but the circuit breaker only counts failures from the start callback, which is wrapped in startGuard.run(). The stop callback is invoked as void stop().catch(...) in typing.ts (line 95 of src/channels/typing.ts):
void stop().catch((err) => (params.onStopError ?? params.onStartError)(err));Any error thrown here is caught by .catch() and forwarded to onStopError / logTypingFailure — it never increments consecutiveFailures in the guard. The re-throw is harmless (rate-limit errors will still be logged via logTypingFailure instead of console.log), but the comment overstates what it achieves. Consider updating it to reflect the actual behavior, e.g.:
| // Re-throw rate-limit / quota errors so the circuit breaker can trip (#28062). | |
| if (isRateLimitError(err)) { | |
| throw err; | |
| } | |
| // Re-throw rate-limit / quota errors so callers (onStopError) can observe them (#28062). | |
| if (isRateLimitError(err)) { | |
| throw err; | |
| } |
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/typing.ts
Line: 82-85
Comment:
**Misleading comment — `removeTypingIndicator` re-throw does not reach the circuit breaker**
The comment says "so the circuit breaker can trip", but the circuit breaker only counts failures from the **`start`** callback, which is wrapped in `startGuard.run()`. The `stop` callback is invoked as `void stop().catch(...)` in `typing.ts` (line 95 of `src/channels/typing.ts`):
```ts
void stop().catch((err) => (params.onStopError ?? params.onStartError)(err));
```
Any error thrown here is caught by `.catch()` and forwarded to `onStopError` / `logTypingFailure` — it never increments `consecutiveFailures` in the guard. The re-throw is harmless (rate-limit errors will still be logged via `logTypingFailure` instead of `console.log`), but the comment overstates what it achieves. Consider updating it to reflect the actual behavior, e.g.:
```suggestion
// Re-throw rate-limit / quota errors so callers (onStopError) can observe them (#28062).
if (isRateLimitError(err)) {
throw err;
}
```
How can I resolve this? If you propose a fix, please make it concise.
extensions/feishu/src/typing.ts
Outdated
| function isRateLimitError(err: unknown): boolean { | ||
| if (typeof err !== "object" || err === null) { | ||
| return false; | ||
| } | ||
| const response = (err as Record<string, unknown>).response as Record<string, unknown> | undefined; | ||
| if (!response) { | ||
| return false; | ||
| } | ||
| if (response.status === 429) { | ||
| return true; | ||
| } | ||
| const data = response.data as Record<string, unknown> | undefined; | ||
| return data?.code === 99991403; | ||
| } |
There was a problem hiding this comment.
Missing unit tests for isRateLimitError
The new isRateLimitError function has specific detection logic (HTTP 429 status and Feishu error code 99991403) that would benefit from a colocated typing.test.ts. Currently there is no test file for extensions/feishu/src/typing.ts at all.
Scenarios worth covering:
- An Axios error with
response.status === 429→true - An Axios error with
response.data.code === 99991403→true - An Axios error with
response.data.code === 99991672(permission, not rate-limit) →false - A plain
Errorwith noresponseproperty →false null/ non-object errors →false
This aligns with the repo's testing guidelines (Vitest, colocated *.test.ts, pnpm test before pushing logic changes).
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/typing.ts
Line: 95-108
Comment:
**Missing unit tests for `isRateLimitError`**
The new `isRateLimitError` function has specific detection logic (HTTP `429` status and Feishu error code `99991403`) that would benefit from a colocated `typing.test.ts`. Currently there is no test file for `extensions/feishu/src/typing.ts` at all.
Scenarios worth covering:
- An Axios error with `response.status === 429` → `true`
- An Axios error with `response.data.code === 99991403` → `true`
- An Axios error with `response.data.code === 99991672` (permission, not rate-limit) → `false`
- A plain `Error` with no `response` property → `false`
- `null` / non-object errors → `false`
This aligns with the repo's testing guidelines (Vitest, colocated `*.test.ts`, `pnpm test` before pushing logic changes).
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.- Updated removeTypingIndicator comment to accurately reflect that re-thrown errors reach onStopError, not the circuit breaker directly. - Added unit tests for isRateLimitError covering HTTP 429, Feishu quota code 99991403, non-rate-limit codes, and edge cases. - Exported isRateLimitError for testability.
|
Addressed both review comments in 1e3e6f1:
All tests pass. Also exported |
addTypingIndicatorandremoveTypingIndicatorcaught all errors silently, preventing the circuit breaker from tripping on Feishu API rate-limit / quota errors. This caused the keepalive loop to retry indefinitely (~4,800 failed calls/hour).Fixes #28062
Summary
99991403. The circuit breaker intyping-start-guard.tsrelies on errors being thrown to count consecutive failures and trip — but it never sees them.99991403) are now re-thrown so the circuit breaker can trip and stop the keepalive loop. Other non-critical errors (e.g. message deleted) remain silently handled.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
User-visible / Behavior Changes
When Feishu API quota is exceeded, the typing indicator stops retrying after 2 consecutive failures (circuit breaker default) instead of retrying indefinitely.
Security Impact (required)
NoNoNoNoNoRepro + Verification
Environment
Steps
Expected
Circuit breaker trips after 2 consecutive failures, keepalive loop stops.
Actual (before fix)
Keepalive loop retries every 3 seconds indefinitely, generating ~4,800 failed calls/hour.
Evidence
From the issue: 76,000+ total 429 errors accumulated in ~8 hours before manual intervention.
The fix ensures these errors propagate to the circuit breaker (
typing-start-guard.tsline 41:consecutiveFailures += 1) which trips atmaxConsecutiveFailures: 2and callskeepaliveLoop.stop().Human Verification (required)
addTypingIndicator→typing-start-guard.run()→ circuit breaker. Confirmed the guard expects thrown errors to count failures.removeTypingIndicatorgets the same treatment.Compatibility / Migration
YesNoNoFailure Recovery (if this breaks)
Risks and Mitigations
addTypingIndicatormight not expect thrown errors.createTypingCallbacks→typing-start-guard.run(), which already has a try-catch that handles thrown errors (counting failures + optional rethrow). This is the intended contract.