feat(plugin): rate-limit-circuit-breaker to prevent death loops in group chats#64031
feat(plugin): rate-limit-circuit-breaker to prevent death loops in group chats#64031guci314 wants to merge 1 commit into
Conversation
…n group chats When multiple agents share a group chat with requireMention:false, a rate-limit error surfaced by one agent triggers other agents to respond, causing them to also hit rate limits in an infinite cascade. This plugin detects consecutive rate-limit error messages per room and suppresses them using a circuit breaker pattern with exponential backoff. Verified in production: a 5-agent Matrix group hit a 13+ minute death loop (146 surface_error events) before this fix. After deployment, zero surface_error events with the same rate limit conditions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Greptile SummaryThis PR adds a new bundled plugin,
Confidence Score: 4/5Safe to merge after fixing the inverted hook priority; the end-to-end suppression works but the priority sign is clearly wrong and contradicts its own comment. One P1 finding (hook priority -100 vs. intended +100) means other message_sending hooks run before cancellation fires, defeating the stated isolation goal. The core functionality still works end-to-end, but the intent/code mismatch warrants a fix before landing. extensions/rate-limit-circuit-breaker/index.ts (priority bug), extensions/rate-limit-circuit-breaker/src/circuit-breaker.ts (cleanup half_open gap) Prompt To Fix All With AIThis is a comment left during a code review.
Path: extensions/rate-limit-circuit-breaker/index.ts
Line: 64
Comment:
**Hook priority is backwards — this hook runs last, not first**
The hook system sorts by descending priority (`(b.priority ?? 0) - (a.priority ?? 0)` in `src/plugins/hooks.ts:192`), so higher numbers run first. With `priority: -100` this hook runs *after* all default-priority (0) hooks, meaning other `message_sending` hooks can do their processing (logging, transformation, analytics) on the rate-limit error message before this cancellation fires. The comment says "Run early so we can cancel before other hooks process" but the sign is inverted.
Message delivery is still suppressed (the final `hookResult.cancelled` check in `deliver.ts` respects `cancel: true` regardless of order), but the side-effect isolation benefit is lost.
```suggestion
{ priority: 100 }, // Run early so we can cancel before other hooks process
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: extensions/rate-limit-circuit-breaker/src/circuit-breaker.ts
Line: 224-235
Comment:
**`half_open` entries are never cleaned up**
`cleanup()` removes `closed`+zero-error rooms and `open` rooms whose `openedAt` is older than `maxAgeMs`, but `half_open` rooms are silently skipped. A room that transitions to `half_open` (cooldown expires, one probe allowed through) and then goes quiet will stay in memory indefinitely. The `openedAt` timestamp survives the `open → half_open` transition and can be used for staleness here.
```suggestion
cleanup(maxAgeMs: number = 3_600_000): void {
const now = Date.now();
for (const [key, room] of this.rooms) {
if (room.state === "closed" && room.consecutiveErrors === 0) {
this.rooms.delete(key);
continue;
}
if (
(room.state === "open" || room.state === "half_open") &&
now - room.openedAt > maxAgeMs
) {
this.rooms.delete(key);
}
}
}
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: extensions/rate-limit-circuit-breaker/index.ts
Line: 27
Comment:
**`api: any` violates the strict-typing project rule**
Per the project coding guidelines, `any` should be avoided and extension production code should import from `openclaw/plugin-sdk/*`. The typed entry point (`PluginEntryApi` or equivalent from `openclaw/plugin-sdk/plugin-entry`) would give proper inference on `api.on`, `api.pluginConfig`, `api.logger`, and `api.registerGatewayMethod`, catching mismatched hook signatures at compile time rather than at runtime.
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "feat(plugin): add rate-limit-circuit-bre..." | Re-trigger Greptile |
| // Allow the message through (no modification) | ||
| return undefined; | ||
| }, | ||
| { priority: -100 }, // Run early so we can cancel before other hooks process |
There was a problem hiding this comment.
Hook priority is backwards — this hook runs last, not first
The hook system sorts by descending priority ((b.priority ?? 0) - (a.priority ?? 0) in src/plugins/hooks.ts:192), so higher numbers run first. With priority: -100 this hook runs after all default-priority (0) hooks, meaning other message_sending hooks can do their processing (logging, transformation, analytics) on the rate-limit error message before this cancellation fires. The comment says "Run early so we can cancel before other hooks process" but the sign is inverted.
Message delivery is still suppressed (the final hookResult.cancelled check in deliver.ts respects cancel: true regardless of order), but the side-effect isolation benefit is lost.
| { priority: -100 }, // Run early so we can cancel before other hooks process | |
| { priority: 100 }, // Run early so we can cancel before other hooks process |
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/rate-limit-circuit-breaker/index.ts
Line: 64
Comment:
**Hook priority is backwards — this hook runs last, not first**
The hook system sorts by descending priority (`(b.priority ?? 0) - (a.priority ?? 0)` in `src/plugins/hooks.ts:192`), so higher numbers run first. With `priority: -100` this hook runs *after* all default-priority (0) hooks, meaning other `message_sending` hooks can do their processing (logging, transformation, analytics) on the rate-limit error message before this cancellation fires. The comment says "Run early so we can cancel before other hooks process" but the sign is inverted.
Message delivery is still suppressed (the final `hookResult.cancelled` check in `deliver.ts` respects `cancel: true` regardless of order), but the side-effect isolation benefit is lost.
```suggestion
{ priority: 100 }, // Run early so we can cancel before other hooks process
```
How can I resolve this? If you propose a fix, please make it concise.| cleanup(maxAgeMs: number = 3_600_000): void { | ||
| const now = Date.now(); | ||
| for (const [key, room] of this.rooms) { | ||
| if (room.state === "closed" && room.consecutiveErrors === 0) { | ||
| this.rooms.delete(key); | ||
| continue; | ||
| } | ||
| if (room.state === "open" && now - room.openedAt > maxAgeMs) { | ||
| this.rooms.delete(key); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
half_open entries are never cleaned up
cleanup() removes closed+zero-error rooms and open rooms whose openedAt is older than maxAgeMs, but half_open rooms are silently skipped. A room that transitions to half_open (cooldown expires, one probe allowed through) and then goes quiet will stay in memory indefinitely. The openedAt timestamp survives the open → half_open transition and can be used for staleness here.
| cleanup(maxAgeMs: number = 3_600_000): void { | |
| const now = Date.now(); | |
| for (const [key, room] of this.rooms) { | |
| if (room.state === "closed" && room.consecutiveErrors === 0) { | |
| this.rooms.delete(key); | |
| continue; | |
| } | |
| if (room.state === "open" && now - room.openedAt > maxAgeMs) { | |
| this.rooms.delete(key); | |
| } | |
| } | |
| } | |
| cleanup(maxAgeMs: number = 3_600_000): void { | |
| const now = Date.now(); | |
| for (const [key, room] of this.rooms) { | |
| if (room.state === "closed" && room.consecutiveErrors === 0) { | |
| this.rooms.delete(key); | |
| continue; | |
| } | |
| if ( | |
| (room.state === "open" || room.state === "half_open") && | |
| now - room.openedAt > maxAgeMs | |
| ) { | |
| this.rooms.delete(key); | |
| } | |
| } | |
| } |
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/rate-limit-circuit-breaker/src/circuit-breaker.ts
Line: 224-235
Comment:
**`half_open` entries are never cleaned up**
`cleanup()` removes `closed`+zero-error rooms and `open` rooms whose `openedAt` is older than `maxAgeMs`, but `half_open` rooms are silently skipped. A room that transitions to `half_open` (cooldown expires, one probe allowed through) and then goes quiet will stay in memory indefinitely. The `openedAt` timestamp survives the `open → half_open` transition and can be used for staleness here.
```suggestion
cleanup(maxAgeMs: number = 3_600_000): void {
const now = Date.now();
for (const [key, room] of this.rooms) {
if (room.state === "closed" && room.consecutiveErrors === 0) {
this.rooms.delete(key);
continue;
}
if (
(room.state === "open" || room.state === "half_open") &&
now - room.openedAt > maxAgeMs
) {
this.rooms.delete(key);
}
}
}
```
How can I resolve this? If you propose a fix, please make it concise.| // Periodic cleanup interval handle | ||
| let cleanupInterval: ReturnType<typeof setInterval> | null = null; | ||
|
|
||
| export default function register(api: any) { |
There was a problem hiding this comment.
api: any violates the strict-typing project rule
Per the project coding guidelines, any should be avoided and extension production code should import from openclaw/plugin-sdk/*. The typed entry point (PluginEntryApi or equivalent from openclaw/plugin-sdk/plugin-entry) would give proper inference on api.on, api.pluginConfig, api.logger, and api.registerGatewayMethod, catching mismatched hook signatures at compile time rather than at runtime.
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/rate-limit-circuit-breaker/index.ts
Line: 27
Comment:
**`api: any` violates the strict-typing project rule**
Per the project coding guidelines, `any` should be avoided and extension production code should import from `openclaw/plugin-sdk/*`. The typed entry point (`PluginEntryApi` or equivalent from `openclaw/plugin-sdk/plugin-entry`) would give proper inference on `api.on`, `api.pluginConfig`, `api.logger`, and `api.registerGatewayMethod`, catching mismatched hook signatures at compile time rather than at runtime.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0fcd5faac6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| */ | ||
| const RATE_LIMIT_PATTERNS: RegExp[] = [ | ||
| /API rate limit reached/i, | ||
| /rate limit/i, |
There was a problem hiding this comment.
Narrow rate-limit detection to real surfaced error copy
The matcher /rate limit/i treats ordinary assistant text as an error signal, so normal replies like “here’s how to handle API rate limits” increment consecutiveErrors. After three such legitimate messages in the same room, the breaker opens and starts canceling non-error conversation. This creates false positives that can suppress real user-visible responses; the pattern set should be restricted to explicit surfaced error formats instead of broad phrase matches.
Useful? React with 👍 / 👎.
| // Allow the message through (no modification) | ||
| return undefined; | ||
| }, | ||
| { priority: -100 }, // Run early so we can cancel before other hooks process |
There was a problem hiding this comment.
Raise hook priority so cancellation runs before other hooks
message_sending hooks are executed in descending priority order (higher first), so priority: -100 runs late, not early. That means other default-priority hooks can execute side effects before this plugin cancels the message, which defeats the stated goal of short-circuiting repeated error sends; for example, other plugins can still perform outbound checks/actions on messages that should have been dropped.
Useful? React with 👍 / 👎.
|
Closing this as better suited for ClawHub/community plugin work after Codex review. Close as What I checked:
So I’m closing this as a scope-fit item for the plugin/community path rather than keeping it open as an OpenClaw core request. Review notes: reviewed against 5b8bd6371c66. |
Summary
rate-limit-circuit-breakerthat prevents infinite message loops in multi-agent group chats when LLM API rate limits are hitmessage_sendingto suppress repeated rate-limit error messages per roomProblem
When multiple agents share a group chat with
requireMention: false, a single rate-limitsurface_errorcascades into an infinite loop:Observed in production: 146
surface_errorevents in 13 minutes across 5 agents sharing one Matrix room and one zhipu API key. The LLM was never invoked — the loop was entirely in the gateway's error surfacing path. No amount of LLM intelligence can break this loop because the LLM is never called.Solution
The plugin intercepts outgoing messages via the
message_sendinghook:Configurable via
plugins.entriesinopenclaw.json:Verification
After deployment (combined with fallback model config):
surface_errorevents: 146 → 0Test plan
🤖 Generated with Claude Code