Skip to content

fix(feishu): re-throw rate-limit errors from typing indicator to trip circuit breaker#28086

Closed
kollaikal-rupesh wants to merge 2 commits intoopenclaw:mainfrom
kollaikal-rupesh:fix/feishu-typing-indicator-infinite-retry
Closed

fix(feishu): re-throw rate-limit errors from typing indicator to trip circuit breaker#28086
kollaikal-rupesh wants to merge 2 commits intoopenclaw:mainfrom
kollaikal-rupesh:fix/feishu-typing-indicator-infinite-retry

Conversation

@kollaikal-rupesh
Copy link

addTypingIndicator and removeTypingIndicator caught all errors silently, preventing the circuit breaker from tripping on Feishu API rate-limit / quota errors. This caused the keepalive loop to retry indefinitely (~4,800 failed calls/hour).

Fixes #28062

Summary

  • Problem: Feishu typing indicator catch block swallows all errors including HTTP 429 and quota error 99991403. The circuit breaker in typing-start-guard.ts relies on errors being thrown to count consecutive failures and trip — but it never sees them.
  • Why it matters: After quota is hit, generates ~76,000+ failed API calls in 8 hours. Worsens quota exhaustion and floods logs.
  • What changed: Rate-limit errors (HTTP 429 / Feishu code 99991403) are now re-thrown so the circuit breaker can trip and stop the keepalive loop. Other non-critical errors (e.g. message deleted) remain silently handled.
  • What did NOT change: No changes to the circuit breaker itself, the keepalive loop, or any other extension. Only the Feishu typing indicator error handling is affected.

Change Type (select all)

  • Bug fix

Scope (select all touched areas)

  • Integrations

Linked Issue/PR

User-visible / Behavior Changes

When Feishu API quota is exceeded, the typing indicator stops retrying after 2 consecutive failures (circuit breaker default) instead of retrying indefinitely.

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No

Repro + Verification

Environment

  • Integration/channel: Feishu (Lark)

Steps

  1. Exhaust Feishu API monthly quota (or simulate 429 response)
  2. Send a message that triggers agent reply with typing indicator
  3. Observe keepalive loop behavior

Expected

Circuit breaker trips after 2 consecutive failures, keepalive loop stops.

Actual (before fix)

Keepalive loop retries every 3 seconds indefinitely, generating ~4,800 failed calls/hour.

Evidence

  • Trace/log snippets

From the issue: 76,000+ total 429 errors accumulated in ~8 hours before manual intervention.

The fix ensures these errors propagate to the circuit breaker (typing-start-guard.ts line 41: consecutiveFailures += 1) which trips at maxConsecutiveFailures: 2 and calls keepaliveLoop.stop().

Human Verification (required)

  • Verified scenarios: Traced error flow from addTypingIndicatortyping-start-guard.run() → circuit breaker. Confirmed the guard expects thrown errors to count failures.
  • Edge cases checked: Non-rate-limit errors (e.g. message deleted, network timeout) still silently handled. removeTypingIndicator gets the same treatment.
  • What I did not verify: Live Feishu 429 reproduction (no Feishu API access).

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No

Failure Recovery (if this breaks)

  • Revert this single commit
  • The only behavioral change: rate-limit errors now throw instead of being swallowed

Risks and Mitigations

  • Risk: Some other caller of addTypingIndicator might not expect thrown errors.
    • Mitigation: All callers go through createTypingCallbackstyping-start-guard.run(), which already has a try-catch that handles thrown errors (counting failures + optional rethrow). This is the intended contract.

… circuit breaker

The addTypingIndicator and removeTypingIndicator functions caught all
errors silently, preventing the circuit breaker in typing-start-guard
from detecting failures. When the Feishu API returns 429 or quota error
99991403, the keepalive loop retried indefinitely (~4,800 failed calls
per hour).

Re-throw rate-limit and quota errors so the circuit breaker can count
consecutive failures and trip, stopping the keepalive loop. Other
non-critical errors (e.g. message deleted) remain silently handled.

Fixes openclaw#28062
@kollaikal-rupesh
Copy link
Author

Hey @steipete @vincentkoc — fix for #28062 (Feishu typing indicator retries indefinitely on 429). Small, targeted change — only re-throws rate-limit errors so the existing circuit breaker can do its job. Would appreciate a review!

@openclaw-barnacle openclaw-barnacle bot added channel: feishu Channel integration: feishu size: XS labels Feb 27, 2026
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Feb 27, 2026

Greptile Summary

This PR fixes a real and impactful bug: addTypingIndicator was silently swallowing all errors (including HTTP 429 / Feishu quota code 99991403), which prevented the typing-start-guard circuit breaker from ever seeing failures — causing the keepalive loop to retry indefinitely (~4,800 failed API calls/hour).

The fix is correct and well-targeted for addTypingIndicator: errors now propagate through startGuard.run()consecutiveFailures counter → circuit breaker trip + keepaliveLoop.stop().

Key observations:

  • The addTypingIndicator re-throw correctly reaches the circuit breaker via startGuard.run() in src/channels/typing.ts.
  • The removeTypingIndicator re-throw is harmless but its comment ("so the circuit breaker can trip") is inaccurate — errors from the stop callback are caught by void stop().catch(...) and forwarded to onStopError/logTypingFailure, never touching the circuit breaker's failure counter.
  • The new isRateLimitError helper follows the existing Axios error shape used elsewhere in the Feishu extension (e.g. bot.ts) and the detection logic is sound.
  • No unit tests are included for the new isRateLimitError function; a colocated typing.test.ts would improve confidence and match repo testing conventions.

Confidence Score: 4/5

  • Safe to merge — the core fix for addTypingIndicator is correct, no other code paths are changed, and the only behavioral risk is a slightly misleading comment in removeTypingIndicator.
  • The fix correctly threads rate-limit errors from addTypingIndicator through startGuard.run() to the circuit breaker, directly solving the root cause. The isRateLimitError shape is consistent with how other Feishu SDK errors are handled in the codebase. The removeTypingIndicator re-throw is harmless. Minor deductions for a misleading comment and the absence of unit tests for the new helper.
  • No files require special attention beyond the inline notes on extensions/feishu/src/typing.ts.

Last reviewed commit: 077db0a

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +82 to +85
// Re-throw rate-limit / quota errors so the circuit breaker can trip (#28062).
if (isRateLimitError(err)) {
throw err;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading comment — removeTypingIndicator re-throw does not reach the circuit breaker

The comment says "so the circuit breaker can trip", but the circuit breaker only counts failures from the start callback, which is wrapped in startGuard.run(). The stop callback is invoked as void stop().catch(...) in typing.ts (line 95 of src/channels/typing.ts):

void stop().catch((err) => (params.onStopError ?? params.onStartError)(err));

Any error thrown here is caught by .catch() and forwarded to onStopError / logTypingFailure — it never increments consecutiveFailures in the guard. The re-throw is harmless (rate-limit errors will still be logged via logTypingFailure instead of console.log), but the comment overstates what it achieves. Consider updating it to reflect the actual behavior, e.g.:

Suggested change
// Re-throw rate-limit / quota errors so the circuit breaker can trip (#28062).
if (isRateLimitError(err)) {
throw err;
}
// Re-throw rate-limit / quota errors so callers (onStopError) can observe them (#28062).
if (isRateLimitError(err)) {
throw err;
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/typing.ts
Line: 82-85

Comment:
**Misleading comment — `removeTypingIndicator` re-throw does not reach the circuit breaker**

The comment says "so the circuit breaker can trip", but the circuit breaker only counts failures from the **`start`** callback, which is wrapped in `startGuard.run()`. The `stop` callback is invoked as `void stop().catch(...)` in `typing.ts` (line 95 of `src/channels/typing.ts`):

```ts
void stop().catch((err) => (params.onStopError ?? params.onStartError)(err));
```

Any error thrown here is caught by `.catch()` and forwarded to `onStopError` / `logTypingFailure` — it never increments `consecutiveFailures` in the guard. The re-throw is harmless (rate-limit errors will still be logged via `logTypingFailure` instead of `console.log`), but the comment overstates what it achieves. Consider updating it to reflect the actual behavior, e.g.:

```suggestion
    // Re-throw rate-limit / quota errors so callers (onStopError) can observe them (#28062).
    if (isRateLimitError(err)) {
      throw err;
    }
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +95 to +108
function isRateLimitError(err: unknown): boolean {
if (typeof err !== "object" || err === null) {
return false;
}
const response = (err as Record<string, unknown>).response as Record<string, unknown> | undefined;
if (!response) {
return false;
}
if (response.status === 429) {
return true;
}
const data = response.data as Record<string, unknown> | undefined;
return data?.code === 99991403;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing unit tests for isRateLimitError

The new isRateLimitError function has specific detection logic (HTTP 429 status and Feishu error code 99991403) that would benefit from a colocated typing.test.ts. Currently there is no test file for extensions/feishu/src/typing.ts at all.

Scenarios worth covering:

  • An Axios error with response.status === 429true
  • An Axios error with response.data.code === 99991403true
  • An Axios error with response.data.code === 99991672 (permission, not rate-limit) → false
  • A plain Error with no response property → false
  • null / non-object errors → false

This aligns with the repo's testing guidelines (Vitest, colocated *.test.ts, pnpm test before pushing logic changes).

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/feishu/src/typing.ts
Line: 95-108

Comment:
**Missing unit tests for `isRateLimitError`**

The new `isRateLimitError` function has specific detection logic (HTTP `429` status and Feishu error code `99991403`) that would benefit from a colocated `typing.test.ts`. Currently there is no test file for `extensions/feishu/src/typing.ts` at all.

Scenarios worth covering:
- An Axios error with `response.status === 429``true`
- An Axios error with `response.data.code === 99991403``true`
- An Axios error with `response.data.code === 99991672` (permission, not rate-limit) → `false`
- A plain `Error` with no `response` property → `false`
- `null` / non-object errors → `false`

This aligns with the repo's testing guidelines (Vitest, colocated `*.test.ts`, `pnpm test` before pushing logic changes).

<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>

How can I resolve this? If you propose a fix, please make it concise.

- Updated removeTypingIndicator comment to accurately reflect that
  re-thrown errors reach onStopError, not the circuit breaker directly.
- Added unit tests for isRateLimitError covering HTTP 429, Feishu
  quota code 99991403, non-rate-limit codes, and edge cases.
- Exported isRateLimitError for testability.
@kollaikal-rupesh
Copy link
Author

Addressed both review comments in 1e3e6f1:

  1. Misleading comment — Updated the removeTypingIndicator comment to accurately reflect that re-thrown errors reach onStopError, not the circuit breaker directly. Good catch.

  2. Missing tests — Added typing.test.ts with 6 tests for isRateLimitError:

    • HTTP 429 → true
    • Feishu quota code 99991403 → true
    • Non-rate-limit code (99991672) → false
    • Plain Error with no response → false
    • nullfalse
    • Non-object values → false

All tests pass. Also exported isRateLimitError for testability.

@Takhoffman
Copy link
Contributor

PR #28086 is superseded by commit 32ee2f0:
32ee2f0

Closing as superseded to keep one canonical fix path.

@Takhoffman Takhoffman closed this Feb 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: feishu Channel integration: feishu close:superseded PR close reason size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feishu typing indicator ignores 429/quota errors, causing infinite retry loop

2 participants