fix(feishu): retry on send rate-limit errors (230020/230006)#89659
Conversation
|
Codex review: needs changes before merge. Reviewed June 8, 2026, 10:21 AM ET / 14:21 UTC. Summary PR surface: Source +113, Tests +611. Total +724 across 5 files. Reproducibility: yes. for the review findings: source inspection shows 11233 is absent from FEISHU_SEND_RATE_LIMIT_CODES and the retry delay never reads x-ogw-ratelimit-reset. The contributor also provided live Feishu proof for the 230020 throw-path behavior, but 11232/429/fulfilled-body paths remain unit/parity proof only. Review metrics: 1 noteworthy metric.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Risk before merge
Maintainer options:
Copy recommended automerge instructionNext step before merge
Security Review findings
Review detailsBest possible solution: Keep the shared send retry helper, but align the retryable code set and gateway backoff timing with Feishu's documented send-limit contract and add focused tests before merge. Do we have a high-confidence way to reproduce the issue? Yes for the review findings: source inspection shows 11233 is absent from FEISHU_SEND_RATE_LIMIT_CODES and the retry delay never reads x-ogw-ratelimit-reset. The contributor also provided live Feishu proof for the 230020 throw-path behavior, but 11232/429/fulfilled-body paths remain unit/parity proof only. Is this the best way to solve the issue? No, not yet: the shared helper is the right layer for direct/reply/streaming send retries, but the retry contract is still narrower than Feishu's documented send flow-control surface and the gateway backoff timing is not reliable. Full review comments:
Overall correctness: patch is incorrect AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against da401341b685. Label changesLabel changes:
Label justifications:
Evidence reviewedPR surface: Source +113, Tests +611. Total +724 across 5 files. View PR surface stats
Acceptance criteria:
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
|
@clawsweeper re-review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
🦞👀 Command router queued. I will update this comment with the next step. |
@clawsweeper re-review
|
|
🦞👀 Command router queued. I will update this comment with the next step. Re-review progress:
|
|
Regarding the two P1 findings: 1. While Feishu docs label 2. Reply path retry coverage Agreed this is a gap. The reply path ( @clawsweeper re-review |
|
🦞👀 Command router queued. I will update this comment with the next step. Re-review progress:
|
|
@clawsweeper re-review |
|
🦞👀 Command router queued. I will update this comment with the next step. Re-review progress:
|
6261bb3 to
a4fdc9b
Compare
|
checking~ |
thank you very much! |
|
逻辑整体没问题,不过现在只 retry 注意不要对所有失败统一 retry,还是按明确的 retryable code 分类处理,并给非 retryable code 保持立即失败。这样能解决 broader issue,同时不会把权限、格式、敏感词、目标无效这类永久错误变慢。 |
收到,我先去研究下这几类错误码是否需要处理,测试的时候只触发了230020,感谢大佬 |
90c7870 大佬已按照建议修改,改动如下: 1.针对issue里面提到的2200(属于内部错误包含好几种情况不需要重试处理)、11232(租户级别的限流做了重试)、429(http状态码级别的做了重试) 以及我原来的测试发现的230020(单用户级别的消息限流重试) |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
修复完成后,1s发送20个请求,全部完成,包含重试请求10个
|
|
@clawsweeper re-review Addressed all three findings from the previous review:
Tests:
|
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
@clawsweeper re-review Addressed the new findings from the previous review:
Tests (with fresh Vitest fs cache):
Live (5 × 20 concurrent, base = 500 ms):
|
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
When Feishu returns code 230020 (per-chat rate limit), requestFeishuApi now retries up to 2 times with linear backoff (500ms, 1000ms). The reply path (im.message.reply) is also covered via the same retry helper. Confirmed by a real 20-concurrent-send stress test: all 20 messages succeed after retry. Closes openclaw#70879
Feishu Open API has three send-time rate limit signals: HTTP 429 (gateway-wide quota), business code 11232 (tenant-level message service: 100/min, 5/sec), and 230020 (per-chat). Previously only 230020 was retried; HTTP 429 and 11232 propagated as fatal errors. - Add 11232 to FEISHU_SEND_RATE_LIMIT_CODES. - In getFeishuSendRateLimitCode, recognize HTTP 429 before reading the body code so gateway-level limits enter the retry loop. - Update doc comment listing both gateway and business sources.
The previous send.retry.test.ts only exercised 230020 / 230006 / non-rate codes / plain errors. After expanding the retry policy in 90c7870 to cover code 11232 (tenant-level message rate limit) and gateway-level HTTP 429, ClawSweeper review openclaw#89659 (P2) flagged the tests as no longer matching the production behavior. - getFeishuSendRateLimitCode: assert 11232 returns 11232, HTTP 429 returns 429, and HTTP 429 wins over body code when both are present. - requestFeishuApi: cover 11232 retry-then-success, 429 retry-then-success, exhaustion paths for both, and a mixed 230020 → 11232 → ok recovery.
The Feishu node SDK sometimes resolves a non-throwing response that
carries a rate-limit code in its body (e.g. { code: 11232, msg: ... })
instead of rejecting. requestFeishuApi previously returned that body
straight away and downstream assertFeishuMessageApiSuccess failed once
with no retry — the same shape that issue openclaw#28157 fixed earlier on the
typing/reaction path via getBackoffCodeFromResponse.
ClawSweeper review on openclaw#89659 (P1, comment-shared.ts:140) flagged the
gap. Mirror the typing-path pattern for the send helper:
- Add getFeishuSendRateLimitCodeFromResponse to classify fulfilled
bodies against FEISHU_SEND_RATE_LIMIT_CODES (230020, 11232).
- In requestFeishuApi, after each fulfilled await, classify before
returning. If the body is a retryable rate limit and there are
attempts left, continue the loop. After exhaustion, wrap the last
fulfilled body into a synthetic AxiosError-shaped error so callers
see the same error shape as the throw path.
- Add 11 focused tests covering fulfilled 11232/230020 retry-then-ok,
exhaustion, mixed throw → fulfilled → ok recovery, and pass-through
for code 0 / non-rate-limit codes.
ClawSweeper review on dc8d3be (P1, comment-shared.ts:166) caught a real bug: when the final retry attempt also fulfilled with a rate-limit body (e.g. { code: 11232, ... }), the guard `attempt < FEISHU_SEND_MAX_RETRIES` was false so control fell through to `return result` — bypassing the synthetic-error exhaustion path and handing the rate-limit body to the caller as if it were a successful response. The fulfilled-exhaustion test missed this because Vitest's local fs module cache served the pre-fix shape; running with a fresh cache reproduces the failure. Split the fulfilled-rate-limit branch so the body is always captured, then continue on a non-final attempt or break on the final attempt. Breaking falls through to the synthetic AxiosError-shaped throw below, which is exactly what the existing exhaustion test asserts.


Summary
requestFeishuApifor all three Feishu send-time rate-limit signals:im.message.create(direct send) andim.message.reply(reply send) now go through the retry helper.streaming-card.tsalso adopt the helper.getFeishuSendRateLimitCode()to detect retryable signals from AxiosError shapes; HTTP 429 short-circuits ahead of body-code inspection because gateway 429 has no Feishu business code.Test plan
pnpm test extensions/feishu/src/send.retry.test.ts— 13/13 pass (incl. new 11232 / HTTP 429 / mixed-signal coverage)pnpm test extensions/feishu/src/send.concurrent.test.ts— 9/9 passReal behavior proof
Behavior addressed: Feishu plugin now retries transient rate-limit signals on direct send, reply, and streaming-card send paths. All three production rate-limit codes (HTTP 429, 11232, 230020) enter the same bounded retry loop, instead of failing immediately and losing user-visible messages.
Real environment tested: Live Feishu Open API (
open.feishu.cn), appcli_aa947bd28afa5bd9, realtenant_access_token, macOS local OpenClaw build. Test driver mirrorsrequestFeishuApisemantics 1:1 with the patched source (set base delay = 1000 ms, max retries = 2).Live A/B comparison — 20 concurrent sends to one chat
requestFeishuApi(retry on 230020/11232/HTTP 429)Five consecutive 20-concurrent runs with the patched helper (run-to-run reliability)
In every run roughly half of the 20 sends triggered code 230020 on the first attempt and were recovered by the retry loop, with no message loss.
Per-signal coverage
send.retry.test.ts. Reproducing them live requires sustained ≥ 5 req/s per app or app-wide quota saturation, which is impractical against a shared production tenant; the tests exercise the exactrequestFeishuApiretry classification path with both signal shapes (AxiosError withresponse.status=429, AxiosError withresponse.data.code=11232).Observed result after fix:
getFeishuSendRateLimitCodeandrequestFeishuApiunit tests.What was not tested: Live reproduction of 11232 and HTTP 429 against the production Feishu app (would require degrading the shared tenant's quota or coordinated multi-bot pressure). The classification path is fully covered by unit tests.
Fulfilled-body coverage justification
The fulfilled-body retry path (commit
dc8d3be7dc+fc9380ea19) mirrors theestablished pattern in
extensions/feishu/src/typing.ts:80-96, originallyintroduced for issue #28157 to handle the same SDK behavior on the typing /
reaction API:
typing.ts(#28157, in production)comment-shared.ts(this PR)getBackoffCodeFromResponse(response)getFeishuSendRateLimitCodeFromResponse(result)FEISHU_BACKOFF_CODES(99991400 / 99991403 / 429)FEISHU_SEND_RATE_LIMIT_CODES(230020 / 11232)throw FeishuBackoffErrorto trip the keepalive breakerLive reproduction of fulfilled-body rate limit on the send path is not
feasible from the caller side: shape selection (HTTP 4xx throw vs. HTTP 200
fulfill) is internal to
@larksuiteoapi/node-sdkand not user-controllable.Pattern parity with the production typing-path implementation is the
strongest available equivalence proof, plus exhaustive unit coverage of every
branch (retry, exhaustion, pass-through, mixed-shape recovery).
Closes #70879