Skip to content

feat(plugin): rate-limit-circuit-breaker to prevent death loops in group chats#64031

Closed
guci314 wants to merge 1 commit into
openclaw:mainfrom
guci314:feat/rate-limit-circuit-breaker
Closed

feat(plugin): rate-limit-circuit-breaker to prevent death loops in group chats#64031
guci314 wants to merge 1 commit into
openclaw:mainfrom
guci314:feat/rate-limit-circuit-breaker

Conversation

@guci314

@guci314 guci314 commented Apr 10, 2026

Copy link
Copy Markdown

Summary

  • Adds a new community plugin rate-limit-circuit-breaker that prevents infinite message loops in multi-agent group chats when LLM API rate limits are hit
  • Uses a standard circuit breaker pattern (closed → open → half-open) with exponential backoff
  • Hooks into message_sending to suppress repeated rate-limit error messages per room

Problem

When multiple agents share a group chat with requireMention: false, a single rate-limit surface_error cascades into an infinite loop:

  1. Agent A calls LLM API → 429 rate limit → gateway surfaces error to chat
  2. Agent B sees error message → tries to respond → also rate-limited → surfaces error
  3. 5 agents × continuous cycle = 6.3 error messages/minute, indefinitely

Observed in production: 146 surface_error events in 13 minutes across 5 agents sharing one Matrix room and one zhipu API key. The LLM was never invoked — the loop was entirely in the gateway's error surfacing path. No amount of LLM intelligence can break this loop because the LLM is never called.

Solution

The plugin intercepts outgoing messages via the message_sending hook:

  • CLOSED: Counts consecutive rate-limit errors per room. First N-1 errors pass through (agents can see the error).
  • OPEN: On the Nth error, suppresses all further error messages for a cooldown period. Normal messages still flow.
  • HALF_OPEN: After cooldown, allows one retry. Success → reset. Failure → re-open with 2× cooldown.

Configurable via plugins.entries in openclaw.json:

"rate-limit-circuit-breaker": {
  "enabled": true,
  "config": {
    "maxConsecutiveErrors": 3,
    "baseCooldownMs": 60000,
    "maxCooldownMs": 600000
  }
}

Verification

After deployment (combined with fallback model config):

  • surface_error events: 146 → 0
  • Death loops: 13+ minutes → none
  • Normal agent communication: unaffected

Test plan

  • 22 unit tests covering all state transitions, exponential backoff, room isolation, cleanup
  • End-to-end verification in production Matrix group with 5 agents
  • Confirmed fallback model failover works as primary defense
  • Confirmed circuit breaker activates as secondary defense when all fallbacks exhausted

🤖 Generated with Claude Code

…n group chats

When multiple agents share a group chat with requireMention:false, a rate-limit
error surfaced by one agent triggers other agents to respond, causing them to
also hit rate limits in an infinite cascade. This plugin detects consecutive
rate-limit error messages per room and suppresses them using a circuit breaker
pattern with exponential backoff.

Verified in production: a 5-agent Matrix group hit a 13+ minute death loop
(146 surface_error events) before this fix. After deployment, zero surface_error
events with the same rate limit conditions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@greptile-apps

greptile-apps Bot commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a new bundled plugin, rate-limit-circuit-breaker, that intercepts outgoing message_sending events to suppress repeated rate-limit error messages using a standard closed/open/half-open circuit breaker with exponential backoff. The state machine logic and tests are well-implemented.

  • Hook priority is inverted (P1): { priority: -100 } causes this hook to run last among message_sending handlers (the system sorts descending). The comment says "Run early" but the sign should be positive (e.g. +100) to actually pre-empt other hooks. Message delivery is still suppressed, but the stated isolation benefit is not achieved.
  • cleanup() leaks half_open entries (P2): rooms that go quiet in half_open state are never evicted by the cleanup job.

Confidence Score: 4/5

Safe to merge after fixing the inverted hook priority; the end-to-end suppression works but the priority sign is clearly wrong and contradicts its own comment.

One P1 finding (hook priority -100 vs. intended +100) means other message_sending hooks run before cancellation fires, defeating the stated isolation goal. The core functionality still works end-to-end, but the intent/code mismatch warrants a fix before landing.

extensions/rate-limit-circuit-breaker/index.ts (priority bug), extensions/rate-limit-circuit-breaker/src/circuit-breaker.ts (cleanup half_open gap)

Prompt To Fix All With AI
This is a comment left during a code review.
Path: extensions/rate-limit-circuit-breaker/index.ts
Line: 64

Comment:
**Hook priority is backwards — this hook runs last, not first**

The hook system sorts by descending priority (`(b.priority ?? 0) - (a.priority ?? 0)` in `src/plugins/hooks.ts:192`), so higher numbers run first. With `priority: -100` this hook runs *after* all default-priority (0) hooks, meaning other `message_sending` hooks can do their processing (logging, transformation, analytics) on the rate-limit error message before this cancellation fires. The comment says "Run early so we can cancel before other hooks process" but the sign is inverted.

Message delivery is still suppressed (the final `hookResult.cancelled` check in `deliver.ts` respects `cancel: true` regardless of order), but the side-effect isolation benefit is lost.

```suggestion
    { priority: 100 }, // Run early so we can cancel before other hooks process
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/rate-limit-circuit-breaker/src/circuit-breaker.ts
Line: 224-235

Comment:
**`half_open` entries are never cleaned up**

`cleanup()` removes `closed`+zero-error rooms and `open` rooms whose `openedAt` is older than `maxAgeMs`, but `half_open` rooms are silently skipped. A room that transitions to `half_open` (cooldown expires, one probe allowed through) and then goes quiet will stay in memory indefinitely. The `openedAt` timestamp survives the `open → half_open` transition and can be used for staleness here.

```suggestion
  cleanup(maxAgeMs: number = 3_600_000): void {
    const now = Date.now();
    for (const [key, room] of this.rooms) {
      if (room.state === "closed" && room.consecutiveErrors === 0) {
        this.rooms.delete(key);
        continue;
      }
      if (
        (room.state === "open" || room.state === "half_open") &&
        now - room.openedAt > maxAgeMs
      ) {
        this.rooms.delete(key);
      }
    }
  }
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: extensions/rate-limit-circuit-breaker/index.ts
Line: 27

Comment:
**`api: any` violates the strict-typing project rule**

Per the project coding guidelines, `any` should be avoided and extension production code should import from `openclaw/plugin-sdk/*`. The typed entry point (`PluginEntryApi` or equivalent from `openclaw/plugin-sdk/plugin-entry`) would give proper inference on `api.on`, `api.pluginConfig`, `api.logger`, and `api.registerGatewayMethod`, catching mismatched hook signatures at compile time rather than at runtime.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "feat(plugin): add rate-limit-circuit-bre..." | Re-trigger Greptile

// Allow the message through (no modification)
return undefined;
},
{ priority: -100 }, // Run early so we can cancel before other hooks process

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Hook priority is backwards — this hook runs last, not first

The hook system sorts by descending priority ((b.priority ?? 0) - (a.priority ?? 0) in src/plugins/hooks.ts:192), so higher numbers run first. With priority: -100 this hook runs after all default-priority (0) hooks, meaning other message_sending hooks can do their processing (logging, transformation, analytics) on the rate-limit error message before this cancellation fires. The comment says "Run early so we can cancel before other hooks process" but the sign is inverted.

Message delivery is still suppressed (the final hookResult.cancelled check in deliver.ts respects cancel: true regardless of order), but the side-effect isolation benefit is lost.

Suggested change
{ priority: -100 }, // Run early so we can cancel before other hooks process
{ priority: 100 }, // Run early so we can cancel before other hooks process
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/rate-limit-circuit-breaker/index.ts
Line: 64

Comment:
**Hook priority is backwards — this hook runs last, not first**

The hook system sorts by descending priority (`(b.priority ?? 0) - (a.priority ?? 0)` in `src/plugins/hooks.ts:192`), so higher numbers run first. With `priority: -100` this hook runs *after* all default-priority (0) hooks, meaning other `message_sending` hooks can do their processing (logging, transformation, analytics) on the rate-limit error message before this cancellation fires. The comment says "Run early so we can cancel before other hooks process" but the sign is inverted.

Message delivery is still suppressed (the final `hookResult.cancelled` check in `deliver.ts` respects `cancel: true` regardless of order), but the side-effect isolation benefit is lost.

```suggestion
    { priority: 100 }, // Run early so we can cancel before other hooks process
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +224 to +235
cleanup(maxAgeMs: number = 3_600_000): void {
const now = Date.now();
for (const [key, room] of this.rooms) {
if (room.state === "closed" && room.consecutiveErrors === 0) {
this.rooms.delete(key);
continue;
}
if (room.state === "open" && now - room.openedAt > maxAgeMs) {
this.rooms.delete(key);
}
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 half_open entries are never cleaned up

cleanup() removes closed+zero-error rooms and open rooms whose openedAt is older than maxAgeMs, but half_open rooms are silently skipped. A room that transitions to half_open (cooldown expires, one probe allowed through) and then goes quiet will stay in memory indefinitely. The openedAt timestamp survives the open → half_open transition and can be used for staleness here.

Suggested change
cleanup(maxAgeMs: number = 3_600_000): void {
const now = Date.now();
for (const [key, room] of this.rooms) {
if (room.state === "closed" && room.consecutiveErrors === 0) {
this.rooms.delete(key);
continue;
}
if (room.state === "open" && now - room.openedAt > maxAgeMs) {
this.rooms.delete(key);
}
}
}
cleanup(maxAgeMs: number = 3_600_000): void {
const now = Date.now();
for (const [key, room] of this.rooms) {
if (room.state === "closed" && room.consecutiveErrors === 0) {
this.rooms.delete(key);
continue;
}
if (
(room.state === "open" || room.state === "half_open") &&
now - room.openedAt > maxAgeMs
) {
this.rooms.delete(key);
}
}
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/rate-limit-circuit-breaker/src/circuit-breaker.ts
Line: 224-235

Comment:
**`half_open` entries are never cleaned up**

`cleanup()` removes `closed`+zero-error rooms and `open` rooms whose `openedAt` is older than `maxAgeMs`, but `half_open` rooms are silently skipped. A room that transitions to `half_open` (cooldown expires, one probe allowed through) and then goes quiet will stay in memory indefinitely. The `openedAt` timestamp survives the `open → half_open` transition and can be used for staleness here.

```suggestion
  cleanup(maxAgeMs: number = 3_600_000): void {
    const now = Date.now();
    for (const [key, room] of this.rooms) {
      if (room.state === "closed" && room.consecutiveErrors === 0) {
        this.rooms.delete(key);
        continue;
      }
      if (
        (room.state === "open" || room.state === "half_open") &&
        now - room.openedAt > maxAgeMs
      ) {
        this.rooms.delete(key);
      }
    }
  }
```

How can I resolve this? If you propose a fix, please make it concise.

// Periodic cleanup interval handle
let cleanupInterval: ReturnType<typeof setInterval> | null = null;

export default function register(api: any) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 api: any violates the strict-typing project rule

Per the project coding guidelines, any should be avoided and extension production code should import from openclaw/plugin-sdk/*. The typed entry point (PluginEntryApi or equivalent from openclaw/plugin-sdk/plugin-entry) would give proper inference on api.on, api.pluginConfig, api.logger, and api.registerGatewayMethod, catching mismatched hook signatures at compile time rather than at runtime.

Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/rate-limit-circuit-breaker/index.ts
Line: 27

Comment:
**`api: any` violates the strict-typing project rule**

Per the project coding guidelines, `any` should be avoided and extension production code should import from `openclaw/plugin-sdk/*`. The typed entry point (`PluginEntryApi` or equivalent from `openclaw/plugin-sdk/plugin-entry`) would give proper inference on `api.on`, `api.pluginConfig`, `api.logger`, and `api.registerGatewayMethod`, catching mismatched hook signatures at compile time rather than at runtime.

How can I resolve this? If you propose a fix, please make it concise.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0fcd5faac6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

*/
const RATE_LIMIT_PATTERNS: RegExp[] = [
/API rate limit reached/i,
/rate limit/i,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Narrow rate-limit detection to real surfaced error copy

The matcher /rate limit/i treats ordinary assistant text as an error signal, so normal replies like “here’s how to handle API rate limits” increment consecutiveErrors. After three such legitimate messages in the same room, the breaker opens and starts canceling non-error conversation. This creates false positives that can suppress real user-visible responses; the pattern set should be restricted to explicit surfaced error formats instead of broad phrase matches.

Useful? React with 👍 / 👎.

// Allow the message through (no modification)
return undefined;
},
{ priority: -100 }, // Run early so we can cancel before other hooks process

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Raise hook priority so cancellation runs before other hooks

message_sending hooks are executed in descending priority order (higher first), so priority: -100 runs late, not early. That means other default-priority hooks can execute side effects before this plugin cancels the message, which defeats the stated goal of short-circuiting repeated error sends; for example, other plugins can still perform outbound checks/actions on messages that should have been dropped.

Useful? React with 👍 / 👎.

@steipete

Copy link
Copy Markdown
Contributor

Closing this as better suited for ClawHub/community plugin work after Codex review.

Close as clawhub: this is a useful plugin idea, but current OpenClaw policy and docs say optional community plugins should ship externally via ClawHub/npm rather than land as new bundled plugins in core. Main already exposes the message_sending hook and typed plugin-entry API needed to implement it outside the repo.

What I checked:

  • Repo vision prefers external plugin distribution: VISION.md says the preferred plugin path is npm/local development in the plugin's own repository, that the bar for adding optional plugins to core is intentionally high, and that plugin discovery/review belongs in ClawHub. (VISION.md:68, 5b8bd6371c66)
  • Public plugin docs say not to add new plugins to this repo: docs/plugins/building-plugins.md states: contributors do not need to add plugins to the OpenClaw repository and should publish to ClawHub or npm instead; it also documents the external-plugin publish/install flow. (docs/plugins/building-plugins.md:16, 5b8bd6371c66)
  • Current main already exposes the needed hook seam: docs/plugins/hooks.md documents message_sending as the outbound hook for rewriting or cancelling delivery, with descending priority and terminal cancel: true behavior. (docs/plugins/hooks.md:52, 5b8bd6371c66)
  • Hook ordering and cancellation behavior are implemented on main: Hook registrations are sorted by descending priority in src/plugins/hooks.ts, and outbound delivery skips send when hookResult.cancelled is set in src/infra/outbound/deliver.ts. A regression test in src/infra/outbound/deliver.test.ts verifies higher-priority message_sending cancellation short-circuits lower-priority hooks. (src/plugins/hooks.ts:185, 5b8bd6371c66)
  • Typed external plugin entrypoint already exists on main: src/plugin-sdk/plugin-entry.ts exports definePluginEntry(...) with register: (api: OpenClawPluginApi) => void, which is the supported non-channel plugin seam for publishing this as an external plugin instead of bundling it in core. (src/plugin-sdk/plugin-entry.ts:168, 5b8bd6371c66)
  • Existing PR review found branch-local correctness issues: The provided GitHub review context already calls out an inverted hook priority, a half_open cleanup leak, and false-positive rate-limit matching risk on PR head 0fcd5faac6877aa8904b8bdf0902da84d7cc1f9f, reinforcing that this is not a maintainer-ready bundled-plugin merge as submitted. (0fcd5faac687)

So I’m closing this as a scope-fit item for the plugin/community path rather than keeping it open as an OpenClaw core request.

Review notes: reviewed against 5b8bd6371c66.

@steipete steipete closed this Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants